9. RNAseq

Learning objectives:

  • Install rna-seq software (salmon and edgeR) using conda
  • Learn mapping and differential gene expression analysis of rna-seq data
  • Interpret rna-seq analysis results

9.1. Boot up a Jetstream

Boot an m1.medium Jetstream instance and log in.

9.2. Install software

We will be using salmon and edgeR. Salmon is installed through conda, but edgeR will require an additional script:

cd ~

conda install -y salmon

curl -L -O https://raw.githubusercontent.com/ngs-docs/angus/2018/scripts/install-edgeR.R
sudo Rscript --no-save install-edgeR.R

9.4. Download the yeast reference transcriptome:

curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz

9.5. Index the yeast transcriptome:

salmon index --index yeast_orfs --type quasi --transcripts orf_coding.fasta.gz

9.6. Run salmon on all the samples:

for i in *.fastq.gz
do
   salmon quant -i yeast_orfs --libType U -r $i -o $i.quant --seqBias --gcBias
done

Read up on libtype, here.

9.7. Collect all of the sample counts using this Python script:

curl -L -O https://raw.githubusercontent.com/ngs-docs/2018-ggg201b/master/lab6-rnaseq/gather-counts.py
python2 gather-counts.py

9.8. Run edgeR (in R) using this script and take a look at the output:

curl -L -O https://raw.githubusercontent.com/ngs-docs/angus/2018/scripts/yeast.salmon.R
Rscript --no-save yeast.salmon.R

This will produce two plots, yeast-edgeR-MA-plot.pdf and yeast-edgeR-MDS.pdf. You can view them by going to your RStudio server file viewer, changing to the directory rnaseq, and then clicking on them. If you see an error “Popup Blocked”, then click the “Try again” button

The yeast-edgeR.csv file contains the fold expression & significance information in a spreadsheet.

9.9. Questions to ask/address

  1. What is the point or value of the multidimensional scaling (MDS) plot?

  2. Why does the MA-plot have that shape?

    Related: Why can’t we just use fold expression to select the things we’re interested in?

    Related: How do we pick the FDR (false discovery rate) threshold?

  3. How do we know how many replicates (bio and/or technical) to do?

    Related: what confounding factors are there for RNAseq analysis?

    Related: what is our false positive/false negative rate?

  4. What happens when you add new replicates?

9.10. More reading

“How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?” Schurch et al., 2016.

“Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference” Patro et al., 2016.

Also see seqanswers and biostars.