10. RNA-seq read quantification with salmon

Learning objectives:

  • Install read quantification program Salmon
  • Learn quantification of RNA-seq data

10.1. Boot up a Jetstream

Boot an m1.medium Jetstream instance and log in.

You should have the quality trimmed data on your instance in the ~/quality/ folder. If not, you can get this data by running:

cd ~
mkdir -p quality
cd quality
curl -L https://osf.io/wfz34/download -o ERR458493.qc.fq.gz
curl -L https://osf.io/jxh4d/download -o ERR458494.qc.fq.gz
curl -L https://osf.io/zx7n3/download -o ERR458495.qc.fq.gz
curl -L https://osf.io/96mrj/download -o ERR458500.qc.fq.gz
curl -L https://osf.io/wc8yn/download -o ERR458501.qc.fq.gz
curl -L https://osf.io/sdtz3/download -o ERR458502.qc.fq.gz

10.2. Introduction to Salmon (adapted from salmon documentation)

Salmon is a tool for fast transcript quantification from RNA-seq data. It requires a set of target transcripts (either from a reference or de-novo assembly) to quantify and FASTA/FASTQ file(s) containing your reads.

Salmon runs in two phases, indexing and quantification. The indexing step is independent of the reads, and only needs to be run one for a particular set of reference transcripts. The quantification step is specific to the set of RNA-seq reads and is thus run more frequently.

10.3. Install software

Salmon is installed through conda.

Let’s make sure our conda channels are loaded.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -y -c bioconda salmon

10.5. Download the yeast transcriptome:

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_rna_from_genomic.fna.gz

10.6. Index the yeast transcriptome:

salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz

10.7. Run salmon on all the samples:

for i in *.qc.fq.gz
do
   salmon quant -i sc_index --libType A -r ${i} -o ${i}_quant --seqBias --gcBias --validateMappings
done

What do all of these flags do?

Flag Meaning
-i path to index folder
--libType The library type of the reads you are quantifying. A allows salmon to automatically detect the library type of the reads you are quantifying.
-r Input file (for single-end reads)
-o output folder
--seqBias learn and correct for sequence-specific biases in the input data
--gcBias learn and correct for fragment-level GC biases in the input data
--validateMappings Enables selective alignment, which improves salmon's sensitivity

As Salmon is running, a lot of information is printed to the screen. For example, we can see the mapping rate as the sample finishes:

[2019-06-29 18:39:18.367] [jointLog] [info] Mapping rate = 86.2687%

When it finished, Salmon outputs a folder for each input RNA-seq sample. Let’s take a look at one of the output folders.

ls ERR458493.qc.fq.gz_quant

You should see output like this:

aux_info/               lib_format_counts.json
cmd_info.json           logs/
libParams/              quant.sf

The information we saw scroll through our screens is captured in a log file in aux_info/.

less ERR458493.qc.fq.gz_quant/aux_info/meta_info.json

We see information about our run parameters and performance. To exit out of the file, press q.

We can also look at the count file:

less -S ERR458493.qc.fq.gz_quant/quant.sf

We see our transcript names, as well as the number of reads that aligned to each transcript.

PRACTICE!

How can we use the grep command to find the percent of reads mapped in ALL our json files? Make sure to print out the name of each file before you print the percent of reads mapped!

Solution
cd ~/quant/
for infile in *_quant/aux_info/meta_info.json
do
echo ${infile}
grep "percent_mapped" ${infile}
done

First we use echo to print the name of the file. Then, we use grep to find and print the line containing the percent of reads mapped.

In our next lesson, we will be reading these quant files into R and performing differential expression with them. “Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference” Patro et al., 2016.

Also see seqanswers and biostars.