Workshop from Alignment to Differential Expression

Today we are going to:

Log into BioHPC

First, we will log into the log into a compute node,install a necessary package and then log into a compute node

ln -s /archive/nanocourse/June2018/shared/session2 /archive/nanocourse/June2018/trainXX/session2
ls /archive/nanocourse/June2018/trainXX/session2

Exploring GTF File

wc -l session2/gencode.gtf
head -n 5 session2/gencode.gtf

description: evidence-based annotation of the mouse genome (GRCm38), version M10 (Ensembl 85)

provider: GENCODE


format: gtf

date: 2016-07-19

Indicates the version of the annotation use and when it was generated.

grep 'Tnnt2' session2/gencode.gtf | cut -f9 | grep ENSMUST | cut -f2,8 -d ';' | sort | uniq
grep 'Tnnt2' session2/gencode.gtf | cut -f9 | grep ENSMUST | grep  'transcript_type "protein_coding"' | cut -f2,8 -d ';' | sort | uniq | wc

Run FeatureCounts to get counts on a file

module load subread/1.6.1
featureCounts -p -g gene_name -a session2/gencode.gtf -o heart_e11_5_rep1.cts session2/alignments/heart_e11_5_rep1.dedup.bam

featureCounts will print statistics on the screen. There is one line in the second box: "Successfully assigned fragments : 376082 (72.3%)"

Depending on how your samples are prepared, this value varies: * If you are using poly-A extraction, we expect this value to be at least 70% * If you are extracting total RNA, this value will be around 50% * When you have a percent less than 40%, you should look into this matter to figure out why

Now Let's take a look at featureCounts results

head heart_e11_5_rep1.cts.summary
head heart_e11_5_rep1.cts

Run feature counts on all of the files.

Then we need to combine all the counts into a matrix. We will be using an in-house Perl script "" to do this. It takes in the names of all genes (genenames.txt)

There are 2 parameter you need to provide: 1. The output folder -o (here we use the current directory ./ ) 2. the patter of all of the cts files (.cts)

perl session2/scripts/ -o ./ session2/counts/*.cts
head countTable.txt

Before we are ready for differential expression we need to create a design file. Thank full we have already done that for you.

Also lets copy 2 files to your home directory

cp design_se.txt /home2/trainXX
cp countTable.txt /home2/trainXX

Use RStudio on demand

Install edgeR for gene differential expression analysis

Install edgeR


For pheatmap package, click "package"->"install" Type 'a' if program asks you if you want to update packages. Install from:CRAN Packages: pheatmap Check "Install dependencies" Click "Install"

Click "Yes" if prompt window asks you if you want to use a personal library.

After everything finished, load the libraries using


Load Data

Set working directory

Session-> Set working directory -> choose directory

#Read data matrix and sample file



#Reorder the counts columns to match the order of sample file
cfile = cfile[c("heart_e11_5_rep1", "heart_e11_5_rep2", "heart_p0_rep1", "heart_p0_rep2")]

#It is good to set your control group label as the baseline. Especially you are going to use intercept
group = relevel(factor(coldata$SampleGroup),ref="heart_e11_5")
cds = DGEList(cfile,group=group)

Pre-filtering the low-expressed genes

Filter for keeping a gene if cpm (counts per million) exceeds 1 in at least 2 samples.

cds = cds[ rowSums(cpm(cds)>=1) >= 2, ,keep.lib.sizes=FALSE]

Exploratory analysis and some vizulization

Use cpm() function to get log2 transformed normalized counts

rld <- cpm(cds, log=TRUE)

Calculate the distance between sample pairs and do hierarchical clustering

sampleDists = dist(t(rld))

Use heatmap to show sample correlation**


Use MDS plot to check the relationship of replicates.

points = c(15,16)
colors = rep(c("red","blue"),4)
plotMDS(cds, col=colors[group], pch=points[group])
legend("bottomleft", legend=levels(group), pch=points, col=colors, ncol=2)

Make a design matrix for samplegroups

samplegroup <- factor(coldata$SampleGroup)

Normalize data and estimate dispersion. What is the norm.factors per sample?

cds = calcNormFactors(cds)

Use glmFit() and glmLRT() to test for differential expression. What are the top 10 differential genes sorted by logFC?

cds <- estimateDisp(cds,design)

fit <- glmFit(cds, design)
lrt <- glmLRT(fit, coef=2)
res <- topTags(lrt, n=dim(cfile)[1],"logFC")

Make a dataframe with column for genes

res_df = cbind(gene_name = rownames(res), data.frame(res))

Filter for genes that are logFC >= 1 & FDR <= 0.01

res_filt = res_df[(abs(res_df$logFC)>=1 & res_df$FDR<=0.01),]

Draw a heatmap of differential expressed genes meeting filter criteria.

res_filt_rld = rld[rownames(rld) %in% res_filt$gene_name,]
pheatmap(res_filt_rld,scale="row",show_rownames = F)