7. Short read quality and trimming

Learning objectives:

  • Install software (fastqc, multiqc) via conda
  • download data
  • visualize read quality
  • quality filter and trim reads

Start up a Jetstream m1.medium or larger as per Jetstream startup instructions.


You should now be logged into your Jetstream computer! You should see something like this

titus@js-17-71:~$ 

7.1. Getting started

Change to your home directory:

cd ~/

and install FastQC, MultiQC, and trimmomatic:

conda install -y fastqc multiqc trimmomatic

7.2. Data source

Make a “data” directory:

cd ~/
mkdir -p data
cd data

and download download some data from the Schurch et al, 2016 yeast RNAseq study:

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz

Let’s make sure we downloaded all of our data using md5sum. An md5sum hash is a fingerprint that lets you compare with the original file. Some sequencing facilities will provide a file containing md5sum hashes for the files being delivered to you. By comparing your md5sum with the original md5sum generated by the sequencing facility, you can make sure you have downloaded your files completely. If there was a disruption that occurred during the download process, not all the bytes of the file will be transferred properly and you will likely be missing some of your sequences.

md5sum *.fastq.gz

You should see this:

2b8c708cce1fd88e7ddecd51e5ae2154  ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b  ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549  ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52  ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236  ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc  ERR458502.fastq.gz

To check whether your md5sum hashes match with a file containing md5sum hashes:

md5sum *.fastq.gz > md5sum.txt
md5sum -c err_md5sum.txt

(First we’re creating a file containing md5sum of the files, then checking it. In reality, you wouldn’t be making this file to check. The facility would be creating it for you.)

You should see this:

ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK

Now if you type:

ls -l

you should see something like:

-rw-rw-r-- 1 titus titus  59532325 Jun 29 09:22 ERR458493.fastq.gz
-rw-rw-r-- 1 titus titus  58566854 Jun 29 09:22 ERR458494.fastq.gz
-rw-rw-r-- 1 titus titus  58114810 Jun 29 09:22 ERR458495.fastq.gz
-rw-rw-r-- 1 titus titus 102201086 Jun 29 09:22 ERR458500.fastq.gz
-rw-rw-r-- 1 titus titus 101222099 Jun 29 09:22 ERR458501.fastq.gz
-rw-rw-r-- 1 titus titus 100585843 Jun 29 09:22 ERR458502.fastq.gz

These are six data files from the yeast study. of the file.

One problem with these files is that they are writeable - by default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let’s fix that before we go on any further:

chmod a-w *

Take a look at their permissions now –

ls -l

and you should see that the ‘w’ in the original permission string (-rw-rw-r--) has been removed from each file and now it should look like -r--r--r--.

We’ll talk about what these files are below.

7.2.1. 1. Linking data to our working location

First, make a new working directory:

mkdir -p ~/quality
cd ~/quality

Now, we’re going to make a “link” to our quality-trimmed data in our current working directory:

ln -fs ~/data/* .

and you will see that they are now linked in the current directory when you do an ls. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!

These are FASTQ files – let’s take a look at them:

less ERR458493.fastq.gz

(use the spacebar to scroll down, and type ‘q’ to exit less)

Question:

  • where does the filename come from?

Links:

7.2.2. 2. FastQC

We’re going to use FastQC summarize the data. We already installed ‘fastqc’ above, with the conda command.

Now, run FastQC on two files:

fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz

Now type ‘ls’:

ls -d *fastqc.zip*

to list the files, and you should see:

ERR458493_fastqc.zip
ERR458500_fastqc.zip

Inside each of the fastqc directories you will find reports from the fastqc program. You can download these files using your RStudio Server console, if you like. (@CTB)

or you can look at these copies of them:

Questions:

  • What should you pay attention to in the FastQC report?

Links:

There are several caveats about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed for the first 100,000 sequences in each file

7.2.3. 3. Trimmomatic

Now we’re going to do some trimming! We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via conda.

The first thing we’ll need are the adapters to trim off:

cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-PE.fa .

(you can look at the contents of this file with cat TruSeq2-PE.fa)

Now, to run Trimmomatic on both of them:

trimmomatic SE ERR458493.fastq.gz \
        ERR458493.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        
trimmomatic SE ERR458500.fastq.gz \
        ERR458500.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        

You should see output that looks like this:

...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully

We can also run the same process for all 6 samples more efficiently using a for loop, as follows:

for filename in *.fastq.gz
do
        # first, make the base by removing fastq.gz
        base=$(basename $filename .fastq.gz)
        echo $base

        trimmomatic SE ${base}.fastq.gz \
                ${base}.qc.fq.gz \
                ILLUMINACLIP:TruSeq2-PE.fa:2:40:15 \
                LEADING:2 TRAILING:2 \
                SLIDINGWINDOW:4:2 \
                MINLEN:25
done

This script will go through each for the filenames that end with fastq.gz and run Trimmomatic for it.

Questions:

  • How do you figure out what the parameters mean?
  • How do you figure out what parameters to use?
  • What adapters do you use?
  • What version of Trimmomatic are we using here? (And FastQC?)
  • Do you think parameters are different for RNAseq and genomic data sets?
  • What’s with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.

Links:

7.2.4. 4. FastQC again

Run FastQC again on the trimmed files:

fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz

And now view my copies of these files:

Let’s take a look at the output files:

less ERR458493.qc.fq.gz                                                       

(again, use spacebar to scroll, ‘q’ to exit less).

7.2.5. 5. MultiQc

MultiQC aggregates results across many samples into a single report for easy comparison.

Run Mulitqc on both the untrimmed and trimmed files

multiqc .

And now you should see output that looks like this:

[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 15 files..  [####################################]  100%
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

You can view the output html file multiqc_report.html by going to RStudio, selecting the file, and saying “view in Web browser.”

Questions:

  • is the quality trimmed data “better” than before?
  • Does it matter that you still have adapters!?