Workshop for Variant Prioritization

Today we are going to:

Log into BioHPC

First, we will log into the a compute node

cp -r /archive/nanocourse/May2018/shared/session3 .
cd session3

Practice VCF manipulation skills with BCFtools

First we are going to practice VCF manipulation skills on VCF from CEPH family 1463 with 17 members.

module load bcftools 
bgzip -c ceph1463.vcf > ceph1463.vcf.gz

The "-c" option write on standard output, keep original files unchanged.
If you don't want to keep the original file, do:

bgzip ceph1463.vcf

If you want to decompress do

bgzip -d ceph1463.vcf.gz
tabix ceph1463.vcf.gz
mkdir vcf_playground
bcftools --help
bcftools query --help
bcftools stats --help
bcftools filter --help
bcftools view --help

We will try out some of these tools in the following commands, you may refer to the documentation to understand the options we will be using.

bcftools query -l ceph1463.vcf.gz
bcftools stats ceph1463.vcf.gz > vcf_playground/ceph1463.stats.out
less vcf_playground/ceph1463.stats.out
bcftools query -f '%CHROM\t%POS\t[%GT ]\n' ceph1463.vcf.gz | less -S
bcftools filter --targets 10:96447911-96613017 ceph1463.vcf.gz | less
zcat ceph1463.vcf.gz | grep -v "^#" |wc -l
bcftools filter --SnpGap 10 ceph1463.vcf.gz | grep -v "^#" | wc -l
bcftools view --samples NA12891,NA12892,NA12878 ceph1463.vcf.gz -O z -o vcf_playground/maternal_family.vcf.gz
bcftools query -l vcf_playground/maternal_family.vcf.gz
bcftools filter -i 'TYPE="indel"' vcf_playground/maternal_family.vcf.gz -O z -o vcf_playground/maternal_indels.vcf.gz
bcftools stats vcf_playground/maternal_indels.vcf.gz | grep -A 8 "SN, Summary numbers:"
bcftools query -f '%CHROM\t%POS\t[%GT ]\n' vcf_playground/maternal_indels.vcf.gz | less -S
bcftools query -f '[%GT ]\n' vcf_playground/maternal_indels.vcf.gz|grep '0/0 0/0 0/0' | wc -l
bcftools view --private -s NA12891,NA12892,NA12878 vcf_playground/maternal_indels.vcf.gz -O z -o vcf_playground/maternal_only_indels.vcf.gz
bcftools query -f '[%GT ]\n' vcf_playground/maternal_only_indels.vcf.gz | grep '0/0 0/0 0/0' | wc -l
tabix vcf_playground/maternal_family.vcf.gz
tabix -l vcf_playground/maternal_family.vcf.gz
bcftools view vcf_playground/maternal_family.vcf.gz --targets ^X -O z -o vcf_playground/maternal_family_autosomes.vcf.gz
tabix vcf_playground/maternal_family_autosomes.vcf.gz
tabix -l vcf_playground/maternal_family_autosomes.vcf.gz
bcftools view vcf_playground/maternal_family.vcf.gz --targets X -O z -o vcf_playground/maternal_family_X.vcf.gz
tabix vcf_playground/maternal_family_X.vcf.gz
tabix -l vcf_playground/maternal_family_X.vcf.gz
bcftools concat vcf_playground/maternal_family_autosomes.vcf.gz vcf_playground/maternal_family_X.vcf.gz -O z -o vcf_playground/maternal_family_recombined.vcf.gz
tabix -l vcf_playground/maternal_family_recombined.vcf.gz

Perform project QC

source activate py2.7-nano
mkdir peddy_QC
cd peddy_QC
python -m peddy -p 16 --plot --prefix ceph-1463 ../ceph1463.vcf.gz ../ceph1463.ped
cd ..

Using GEMINI

module load gemini
mkdir gemini_practices
mkdir ~/.gemini
cp gemini-config.yaml ~/.gemini/
gemini --help
gemini load --help
gemini annotate --help
gemini query --help
gemini de_novo --help
gemini comp_hets --help
gemini autosomal_recessive --help
gemini x_linked_recessive --help
gemini set_somatic --help

We will try out some of these tools in the following commands, you may refer to the documentation to understand the options we will be using.

Solving a rare disease case from a trio

A patient presented at the hospital with hyperammonemia after giving birth. The patient and her parents have been sequenced. The VCF (Hg19) and ped files are provided for you. Can you find out the disease causing variant(s)?

gemini load -v GMDP_3_0054.annot.vt.vep.vcf.gz -t VEP -p GMDP_3_0054.ped --cores 32 GMDP_3_0054.db
gemini annotate -f GMDP_3_0054.annot.vt.vep.vcf.gz \
    -a extract -c CallSet -t text -e CallSet -o uniq_list GMDP_3_0054.db
gemini db_info GMDP_3_0054.db
cat GMDP_3_0054.ped
gemini query -q "SELECT name FROM samples" --header GMDP_3_0054.db
gemini query -q "SELECT name FROM samples WHERE phenotype == 2" GMDP_3_0054.db
gemini query -q "SELECT COUNT(*) FROM variants WHERE chrom == 'chr1'" GMDP_3_0054.db

Pathogenic variants from clinvar

gemini query -q \
    "select gene,chrom,start,end,ref,alt,impact,codon_change,aa_change,\
    gt_types,gt_depths,gt_ref_depths,gt_alt_depths,max_aaf_all,gnomad_num_hom_alt,gnomad_num_het,\
    clinvar_sig,clinvar_disease_name,clinvar_gene_phenotype,gerp_bp_score,cadd_scaled \
    from variants where clinvar_sig LIKE '%pathogenic%'" \
    --header GMDP_3_0054.db > gemini_practices/GMDP_3_0054.pathogenic.txt

Filter by different inheritance models

gemini de_novo GMDP_3_0054.db \
    --columns "gene,chrom,start,end,ref,alt,impact,codon_change,aa_change,\
    gt_types,gt_depths,gt_ref_depths,gt_alt_depths,max_aaf_all,gnomad_num_hom_alt,gnomad_num_het,\
    clinvar_sig,clinvar_disease_name,clinvar_gene_phenotype,gerp_bp_score,cadd_scaled" \
    --filter "max_aaf_all < 0.001 AND impact_severity != 'LOW' AND gnomad_num_het < 1">\
    gemini_practices/GMDP_3_0054.denovo.txt
gemini comp_hets GMDP_3_0054.db \
    --columns "gene,chrom,start,end,ref,alt,impact,codon_change,aa_change,\
    gt_types,gt_depths,gt_ref_depths,gt_alt_depths,max_aaf_all,gnomad_num_hom_alt,gnomad_num_het,\
    clinvar_sig,clinvar_disease_name,clinvar_gene_phenotype,gerp_bp_score,cadd_scaled" \
    --filter "max_aaf_all < 0.001 AND impact_severity != 'LOW' AND gnomad_num_het < 1">\
    gemini_practices/GMDP_3_0054.comp_hets.txt
gemini autosomal_recessive GMDP_3_0054.db \
    --columns "gene,chrom,start,end,ref,alt,impact,codon_change,aa_change,\
    gt_types,gt_depths,gt_ref_depths,gt_alt_depths,max_aaf_all,gnomad_num_hom_alt,gnomad_num_het,\
    clinvar_sig,clinvar_disease_name,clinvar_gene_phenotype,gerp_bp_score,cadd_scaled" \
    --filter "max_aaf_all < 0.001 AND impact_severity != 'LOW' AND gnomad_num_het < 1">\
    gemini_practices/GMDP_3_0054.autosomal_recessive.txt
gemini x_linked_recessive GMDP_3_0054.db \
    --columns "gene,chrom,start,end,ref,alt,impact,codon_change,aa_change,\
    gt_types,gt_depths,gt_ref_depths,gt_alt_depths,max_aaf_all,gnomad_num_hom_alt,gnomad_num_het,\
    clinvar_sig,clinvar_disease_name,clinvar_gene_phenotype,gerp_bp_score,cadd_scaled" \
    --filter "max_aaf_all < 0.001 AND impact_severity != 'LOW' AND gnomad_num_het < 1">\
    gemini_practices/GMDP_3_0054.x_linked_recessive.txt

Number of records under each inheritance mode

for file in gemini_practices/GMDP_3_0054.*.txt; do wc -l $file; done

You may open these files in excel and check further

Making a diagnosis

grep Hyperammonemia ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt > gemini_practices/Hyperammonemia_candidate_genes.txt

Can you find out what might be the causal variant?

cd gemini_practices
# for each file generated by a specific inheritance model or ad hoc query:
for file in GMDP_3_0054.*.txt; do 
    # select unique genes from the first column as our candidate genes, loop through them:
    for candidate_gene in $(cut -f1 $file|sort|uniq); do 
        # for each known disease gene in the second column of the file we obtained through HPO term mapping:
        for disease_gene in $(cut -f2 Hyperammonemia_candidate_genes.txt); do 
            # match to see if the candidate gene is in a known disease gene, if it is, then
            if [ $candidate_gene = $disease_gene ]; then 
                # print the name of the disease gene, and the model name of the file
                echo "$disease_gene found under $(echo $file|cut -f2 -d .) mode"
                # create a result file (if it does not exist)
                touch trio_result.txt
                # write the header into this result file
                head -n1 $file >> trio_result.txt
                # append the variant records to this result file
                grep $disease_gene $file >> trio_result.txt
            fi
        done
    done
done
# check columns 1-6 and 10 of this result file
cut -f -6,10 trio_result.txt | column -t | less -S

Identifying somatic muations in paired cancer-normal cell lines.

A pair of cell lines from the same lung cancer patient (HCC4017, cancer cell line vs HBEC30, immortalized human bronchial epithelial cell line) have been sequenced and the VCF (lifted over from Hg38 to Hg19) and ped files are provided to you. Can you identify the somatic mutations?

gemini load -v CellLine.vt.vcf.gz -t snpEff -p CellLine.ped --cores 32 CellLine.db
cat CellLine.ped
gemini query -q "SELECT name FROM samples" --header CellLine.db
(cat somatic_mutation_header; \
gemini set_somatic \
    --min-depth 30 \
    --min-qual 20 \
    --min-somatic-score 15 \
    --min-tumor-depth 10 \
    --min-norm-depth 10 \
    CellLine.db) > gemini_practices/CellLine_Somatic_Mutations.txt
gemini query -q \
    "SELECT chrom,start,end,gene,vcf_id,somatic_score,gt_types,ref,alt,qual,type,sub_type,codon_change,aa_change,\
    aa_length,biotype,impact_severity,impact,is_somatic,gerp_bp_score,cadd_scaled,fitcons\
    FROM variants WHERE is_somatic==1 ORDER BY vcf_id DESC" \
    --header CellLine.db > gemini_practices/CellLine_Somatic_Mutations_additional.txt
column -t gemini_practices/CellLine_Somatic_Mutations_additional.txt | less -S

You may open these files in excel and check further