Quantifying transcript expression with Salmon¶

During this lab, you’ll learn how to use salmon to rapidly quantify transcript-level expression from RNA-seq data.

Log into your instance¶

For this tutorial, we’ll use a c4.2xlarge instance. Make sure you create the instance with the volume containing the reads attached!:

> ssh -i ~/Downloads/?????.pem ubuntu@XX.XX.XX.XX``

Update the package list¶

> sudo apt-get update

Install some base packages¶

First, install the “build tools” (compilers etc. that may be needed):

> sudo apt-get install build-essential

Mounting the reads¶

We have prepared (thanks; @monsterbashseq!) an Amazon volume from which you can load the reads directly. When we created our AWS instance, we attached the volume with the reads to /dev/xvdf. We have to mount this device. Since we’re using the volume from yesterday, a place for the volume /mnt/reads already exists. Here, we just mount the device at the mount point:

>sudo mount /dev/xvdf /mnt/reads

When this command finishes (should only take a few seconds) we’re good to go, but just need to change the permissions on this folder.:

> sudo chown -R ubuntu:ubuntu /mnt/reads

Now all of the read files should be available in /mnt/reads. Check this out with:

> ls -lha /mnt/reads

You should see something similar to:

> ls -lha /mnt/reads/
total 72G
drwxr-xr-x 3 ubuntu ubuntu 4.0K Aug 10 22:54 .
drwxr-xr-x 3 root   root   4.0K Aug 11 03:08 ..
drwx------ 2 ubuntu ubuntu  16K Aug 10 20:56 lost+found
-rw-rw-r-- 1 ubuntu ubuntu 3.2G Aug  9 15:18 OREf_SAMm_sdE3_ATTCCT_L002_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.3G Aug  9 15:19 OREf_SAMm_sdE3_ATTCCT_L002_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.5G Aug  9 15:20 OREf_SAMm_w_GTCCGC_L006_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.5G Aug  9 15:21 OREf_SAMm_w_GTCCGC_L006_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.3G Aug  9 15:08 ORE_sdE3_r1_GTGGCC_L004_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.3G Aug  9 15:10 ORE_sdE3_r1_GTGGCC_L004_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.4G Aug  9 15:11 ORE_sdE3_r2_TGACCA_L005_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.5G Aug  9 15:13 ORE_sdE3_r2_TGACCA_L005_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.8G Aug  9 15:14 ORE_w_r1_ATCACG_L001_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.8G Aug  9 15:15 ORE_w_r1_ATCACG_L001_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.5G Aug  9 15:16 ORE_w_r2_GTTTCG_L002_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.5G Aug  9 15:17 ORE_w_r2_GTTTCG_L002_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.8G Aug  9 15:29 SAMf_OREm_sdE3_TAGCTT_L001_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.8G Aug  9 15:30 SAMf_OREm_sdE3_TAGCTT_L001_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.9G Aug  9 15:31 SAMf_OREm_w_CAGATC_L005_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.0G Aug  9 15:32 SAMf_OREm_w_CAGATC_L005_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.7G Aug  9 15:22 SAM_sdE3_r1_ATGTCA_L006_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.7G Aug  9 15:23 SAM_sdE3_r1_ATGTCA_L006_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.9G Aug  9 15:24 SAM_sdE3_r2_GCCAAT_L007_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.9G Aug  9 15:25 SAM_sdE3_r2_GCCAAT_L007_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.5G Aug  9 15:26 SAM_w_r1_ACTTGA_L003_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.6G Aug  9 15:27 SAM_w_r1_ACTTGA_L003_R2_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.6G Aug  9 15:28 SAM_w_r2_GAGTGG_L004_R1_001.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 2.6G Aug  9 15:29 SAM_w_r2_GAGTGG_L004_R2_001.fastq.gz

Obtaining the refernece data¶

Note

If you’re on an instance that already has the reference transcriptome from the mapping lab yesterday, then you can skip this step

We’ll be quantifying against the Drosophila transcriptome, so let’s grab that file again::

> wget ftp://ftp.flybase.net/releases/FB2016_04/dmel_r6.12/fasta/dmel-all-transcript-r6.12.fasta.gz

We’ll put this in a folder called ref, and unzip it there::

> mkdir ref
> mv dmel-all-transcript-r6.12.fasta.gz ref
> cd ref
> gunzip dmel-all-transcript-r6.12.fasta.gz
> cd ..

Great; now, let’s run salmon.

Installing Salmon¶

The latest release of Salmon is available either as a pre-compiled binary from GitHub, or via linuxbrew (thanks @sjackman!), we’ll grab a pre-compiled binary directly, or you can install via linuxbrew if you want. We can download it using wget like so:

> wget --no-check-certificate 'https://github.com/COMBINE-lab/salmon/releases/download/v0.7.0/Salmon-0.7.0_linux_x86_64.tar.gz'

and we can untar and unzip the resulting file with the following command:

> tar xzf Salmon-0.7.0_linux_x86_64.tar.gz

Finally, so that we can simply type salmon to execute salmon, we’ll add the appropriate directory to our path variable again.:

> echo 'export PATH="/home/ubuntu/SalmonBeta-0.7.0_linux_x86_64/bin:$PATH"' >>~/.bashrc

Running Salmon¶

Creating the Salmon index¶

Since Salmon uses quasi-mapping behind the scenes, we’ll need to build an index on the transcriptome. Building the salmon index is relatively quick, we do it with the following command:

> salmon index -t ref/dmel-all-transcript-r6.12.fasta -i salmon_index

The -t option tells salmon where to look for the transcript sequences and -i tells it where to write the index.

Quantifying with Salmon¶

Now, we’ll run Salmon on all of our samples. We’re let salmon use defaults for almost all parameters, but I’ll explain the options and their arguments below. It will be rather burdensome to run salmon by hand for each sample, so we’ll write a small shell script to run each of the samples one-by-one. Here’s the shell script we’ll use:

#!/bin/bash

for fn in /mnt/reads/*R1_001.fastq.gz
do

# get the path to the file
dir=`dirname $fn`;

# get just the file (without the path)
base=`basename $fn`;

# the read filename, without the _R1_001.fastq.gz suffix
rf=${base%_R1_001.fastq.gz};

# Do whatever we want with it
salmon quant -i salmon_index -p 8 -l IU -1 <(gunzip -c ${dir}/${rf}_R1_001.fastq.gz) -2 <(gunzip -c ${dir}/${rf}_R2_001.fastq.gz) -o quants/${rf}

done

The call to salmon takes a few arguments; almost all of them required:

-i tells salmon where to look for the index
-p tells salmon how many threads to use
-l tells salmon the type of the read library (here, inward facing, unstranded reads). For a more in-depth description of the library types and how to specify them in salmon, have a look here in the docs.
-1 similar to RapMap, this tells salmon where to find the first reads of the pair
-2 tells salmon where to find the second reads of the pair
-o tells salmon where (the directory) to write the output for this sample. The directory (and the path to it) will be created if it doesn’t exist.

Attention

We are quantifying all 12 samples here. This totals ~400 – 500 million read pairs (~800M — 1B individual reads). Salmon will take ~4 minutes per sample, so this process should take 40 - 50 minutes. This is a good time for us to chat, or for you to ask questions you may have thought of during the lecture or up until this point in the practical.

Taking a look at the quantifications¶

For Ian’s lecture on differential expression, you’ll need the quantification results on your local machine, so let’s pull them down:

> scp -i ?????????.pem -r ubuntu@XX.XX.XX.XX:~/quants .

This will copy the quants directory, recursively, from the server to your local machine. Let’s take a quick peek at some of the quantification results (we’ll use R). Open up RStudio, and set the current directory as the working directory. We’ll do some “sanity checks” using the commands here (please don’t make fun of my lack of R-fu — I’m a Pythonista).

Note: A pre-computed version of the quantification results is available here.

TERMINATE YOUR INSTANCE!!!

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.