RNA-seq: mapping to a reference genome with tophat and counting with HT-seqΒΆ

In this tutorial, we’ll use some sample data from a project we did on flies (Drosophila melanogaster) to illustrate how you can use RNA-seq data to look for differentially expressed genes. We’ll try a few different approaches to see whether different tools give similar results. Here’s some brief background on the project: we’re trying to understand how different wild-type genetic backgrounds can influence the phenotypic effects of mutations, using the developing fly wing as our model system. We have several mutations that disrupt wing development, and we’ve backcrossed them into the genetic backgrounds of two different wild-type fly strains (SAM and ORE). For this tutorial, we’ve taken a subset of the data–we’ll look for expression differences in developing wing tissues between non-mutant and mutant (sd[E3]) flies in each of the two genetic backgrounds, and in flies with a “hybrid” genetic background (i.e., crosses between SAM/ORE flies, again both with and without the mutation). To make things run a little bit faster, we’ve included only sequence reads that map to X-linked genes.

First, launch an EC2 instance and log in. Next, create an EBS volume from our snapshot (snap-642349cb). Make sure to create your EC2 instance and your EBS volume in the same availability zone! The snapshot has the raw reads, as well as pre-computed results files so we don’t need to wait for every step to finish running before we proceed.

Attach the EBS volume to your instance and mount it in /mnt/ebs/. If you don’t know how to do this, ask for a demonstration.

First, we’ll need to install a bunch of software. Some of these tools can be installed using apt-get. Note that apt-get does not necessarily always install the most up-to-date versions of this software! You should always double check versions when you do this. For instance, when I was writing this tutorial, apt-get gave me a warning that cufflinks might be out of date, so we’re going to install by downloading it directly from the authors.

#Install samtools, bowtie, tophat
apt-get -y install samtools
apt-get -y install bowtie
apt-get -y install tophat
apt-get -y install python-pip
pip install pysam

#Create a working directory to hold software that we're going to install other tools
cd /mnt/ebs
mkdir tools
cd tools

#Download and install HTSeq
curl -O https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1.tar.gz
tar -xzvf HTSeq-0.6.1.tar.gz
cd HTSeq-0.6.1/
python setup.py build
python setup.py install

#Download and install cufflinks
cd ..
curl -O http://cufflinks.cbcb.umd.edu/downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz
tar -xzvf cufflinks-2.2.1.Linux_x86_64.tar.gz
cd cufflinks-2.2.1.Linux_x86_64/
find . -type f -executable -exec cp {} /usr/local/bin \;
cd ..

Next, we need to get our reference genome. This is another area where you want to be careful and pay attention to version numbers–the public data from genome projects are often updated, and gene ID’s, coordinates, etc., can sometimes change. At the very least, you need to pay attention to exactly which version you’re working with so you can be consistent throughout all your analyses.

In this case, we’ll download the reference Drosophila genome and annotation file (which has the ID’s and coordinates of known transcripts, etc.) from ensembl. We’ll put it in its own directory so we keep our files organized:

cd /mnt/ebs
mkdir references
cd references
curl -O ftp://ftp.ensembl.org/pub/release-75/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa.gz
gunzip Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa.gz
curl -O ftp://ftp.ensembl.org/pub/release-75/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.75.gtf.gz
gunzip Drosophila_melanogaster.BDGP5.75.gtf.gz

We also need to prepare the genomes for use with our software tools by indexing them. This is simple to do but takes a little time for large genomes. You can run the following code, but do not have to, since we’ve included pre-computed indexes on the snapshot:

bowtie-build Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa Drosophila_melanogaster.BDGP5.75.dna.toplevel
samtools faidx Drosophila_melanogaster.BDGP5.75.dna.toplevel.fa

Now we’re ready for the first step: mapping our RNA-seq reads to the genome. We will use tophat+bowtie1, which together are a splicing-aware read aligner. The raw sequencing reads are in /mnt/ebs/drosophila_reads/. Feel free to take a look at how we’ve named and organized the files:

cd /mnt/ebs/drosophila_reads/
ls -lh

Don’t forget that with your reads, you’ll want to take care of the usual QC steps before you actually begin your mapping. The drosophila_reads directory contains raw reads; the trimmed_x directory contains reads that have already been cleaned using Trimmomatic. We’ll use these for the remainder of the tutorial, but you may want to try running it with the raw reads for comparison.

Since we have a lot of files to map, it would take a long time to re-write the mapping commands for each one. And with so many parameters, we might make a mistake or typo. It’s usually safer to use a simple shell script with shell variables to be sure that we do the exact same thing to each file. Using well-named shell variables also makes our code a little bit more readable:

#Create an array to hold the names of all our samples
#Later, we can then cycle through each sample using a simple foor loop
samples[1]=ORE_wt_rep1
samples[2]=ORE_wt_rep2
samples[3]=ORE_sdE3_rep1
samples[4]=ORE_sdE3_rep2
samples[5]=SAM_wt_rep1
samples[6]=SAM_wt_rep2
samples[7]=SAM_sdE3_rep1
samples[8]=SAM_sdE3_rep2
samples[9]=HYB_wt_rep1
samples[10]=HYB_wt_rep2
samples[11]=HYB_sdE3_rep1
samples[12]=HYB_sdE3_rep2

#Create shell variables to store the location of our reference genome and annotation file
#Note that we are leaving off the .fa from the reference genome file name, because some of the later commands will require just the base of the file name
reference=/mnt/ebs/references/Drosophila_melanogaster.BDGP5.75.dna.toplevel
annotation=/mnt/ebs/references/Drosophila_melanogaster.BDGP5.75.gtf

#Make sure w are in the right directory
#Let's store all of our mapping results in /mnt/ebs/rnaseq_mapping/ to make sure we stay organized
cd /mnt/ebs
mkdir rnaseq_mapping
cd rnaseq_mapping

#Now we can actually do the mapping
for i in 1 2 3 4 5 6 7 8 9 10 11 12
do
    sample=${samples[${i}]}
    #Map the reads
    tophat -p 4 -G ${annotation} -o ${sample} ${reference} /mnt/ebs/trimmed_x/${sample}_1_pe /mnt/ebs/trimmed_x/${sample}_2_pe
    #Count the number of reads mapping to each feature using HTSeq
    htseq-count --format=bam --stranded=no --order=pos ${sample}/accepted_hits.bam ${annotation} > ${sample}_htseq_counts.txt
done

We now have count files for each sample. Take a look at one of the count files using less. You’ll notice there are a lot of zeros, but that’s partially because we’ve already filtered the dataset for you to include only reads that map to the X chromosome.

less HYB_sdE3_rep1_htseq_counts.txt

You can also visualize these read mapping using tview (Variant calling)

samtools index HYB_sdE3_rep1/accepted_hits.bam samtools tview HYB_sdE3_rep1/accepted_hits.bam ${reference}.fa

Now we’ll need to import them into R to use additional analysis packages to look for differentially expressed genes–in this case, DESeq. At this point I usually download these data files and run the analyses locally. I would suggest copying the files using scp or through a synchronized Dropbox folder. Once you’ve got them downloaded, we’re now ready to start crunching some numbers.


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- comments powered by Disqus