BICF Genomics Analysis: Genetic Variation (5/1/2019)

This morning we are going to:

First, we will log into a compute node after installing a necessary package:

  1. Open using a web browser

  2. Input your training ID (e.g. train03) and password

  3. Click the Launch Job button on the bottom to launch your web visualization GUI

Note: The Linux commands shown in red need to be executed correctly in your home directory, while the other commands shown in black are optional for practice.

List of Linux commands

  1. Show the current working directory

  2. Change directory to /usr
    cd /usr

  3. Change directory to /archive/nanocourse/genome_analysis/trainXX, where trainXX is your account
    cd /archive/nanocourse/genome_analysis/trainXX

  4. Change directory to your home directory
    cd ~

  5. Show files under the current directory
    ls -l

  6. Show the files used in the morning session
    ls -l /archive/nanocourse/genome_analysis/shared/session1

  7. Make a shortcut to the course files
    ln -s /archive/nanocourse/genome_analysis/shared/session1 session1

The file names and paths to the real paired-end sequencing reads for this workshop are:

To examine the files, you can optionally perform the following commands:

  1. Look at genomic regions where reads are sequenced
    cat session1/etc/regions.txt

  2. Extract a compressed read file and redirect to a text file
    gzip -cd session1/reads/HD728.R1.fastq.gz > HD728.R1.fastq

  3. Show the text file
    cat HD728.R1.fastq

  4. Show the first 10 lines of the read file
    head HD728.R1.fastq

  5. Show the first page of the read file (type space to go to the next page, and type q to quit)
    less HD728.R1.fastq

  6. Show the first page of the read file directly on the compressed file using pipe |
    gzip -cd session1/reads/HD728.R1.fastq.gz | less

  7. Count the number of lines of a read file
    wc -l HD728.R1.fastq

  8. Divide the number of lines we calculated above by 4 to get the number of reads
    expr 2019172 / 4

Now we are going to run our alignment using the paired-end reads against hg38 (the alignment step using HISAT2 might take about one minute)

  1. We will use HISAT2 with a graph index to align reads as follows:
    session1/programs/hisat2 --no-spliced-alignment -x session1/indexes/genome_snp -1 session1/reads/HD728.R1.fastq.gz -2 session1/reads/HD728.R2.fastq.gz > HD728.sam

  2. As an alternative, you can use Bowtie2 as follows (the process may take about XX minutes):
    session1/programs/bowtie2 -x session1/indexes/genome -1 session1/reads/HD728.R1.fastq.gz -2 session1/reads/HD728.R2.fastq.gz > HD728.bowtie2.sam

  3. You can also use BWA-mem as follows (the process may take about XX minutes):
    session1/programs/bwa mem -M session1/indexes/genome.fa session1/reads/HD728.R1.fastq.gz session1/reads/HD728.R2.fastq.gz > HD728.bwamem.sam

Next we will view the results of the alignment using samtools in two different formats - SAM (Sequence Alignment & Mapping) and BAM (Binary version of SAM):

  1. Look at the SAM file (use space to go to next page and q to quit)
    less HD728.sam

  2. Convert the SAM file into a BAM file
    session1/programs/samtools view -bS HD728.sam > HD728.unsorted.bam

  3. Create a sorted BAM file
    session1/programs/samtools sort HD728.unsorted.bam -o HD728.sorted.bam

  4. Make an index for the sorted BAM file
    session1/programs/samtools index HD728.sorted.bam

  5. Look at alignments between 114,709,902 and 114,709,964 on Chromosome 1
    session1/programs/samtools view HD728.sorted.bam 1:114,709,902-114,709,964

  6. Look at bases of reads at a particular locus to identify variants
    session1/programs/samtools mpileup -f session1/indexes/genome_snp.fa -r 1:114,709,902-114,709,964 HD728.sorted.bam

Now we will look at the alignment using IGV (Integrative Genomics Viewer):

  1. Run IGV
    module add IGV/2.3.90

  2. Load the genome using: Genomes -> Load from Server -> Select Human (hg38)

  3. Open the BAM file using File -> Load From File in IGV and choose HD728.sorted.bam

Try to answer the following questions:
  1. Look at the region: 1:114,709,933-114,709,995

    1. Can you see an SNV?

    2. At what position?

    3. How many reads support the alt allele?

    4. How many reads support the ref allele?

    5. Which gene is this region a part of?

  2. Look at the region: 2:208,245,395-208,245,533

    1. Can you see an indel?

    2. At what positions?

    3. How many reads support the alt allele?

    4. How many reads support the ref allele?

    5. Which gene is this region a part of?

    6. Is the indel an insertion or a deletion?

    7. Is the indel in an exon or an intron?

  3. If you have time, examine these other positions:

    1. 7:55,160,108-55,160,350

    2. 9:5,077,429-5,077,606

    3. 13:28,035,912-28,036,086

    4. 15:66,444,636-66,444,740

    5. 19:49,478,873-49,479,065