Short Read Quality Control¶

As useful as BLAST is, we really want to get into sequencing data, right? One of the first steps you must do with your data is evaluate its quality and try to improve it.

Summary: a sequencing run may contain data of low reliability. It may also contain various contaminants and artificial sequence fragments. Some (but not all) of these problem can be corrected.

Caution: Don’t apply any type of correction without evaluating the results it produces. In general it is best to be conservative with QC. We are altering the data based on our expectations of what it should be like! The process may also introduce its own biases into the dataset.

Biostar QoD (questions of the day)¶

QC is one of the most mis-under-estimated tasks of NGS data analysis. People assume there is very little to it once they know how to run the tool. The reality is a more complicated than that.

QC also happens to be a pet peeve of mine (Istvan) as demonstrated below in the following Biostar threads (and others):

FastQC quality control shootout

So What Does The Sequence Duplication Rate Really Mean In A Fastqc Report

Revisiting the FastQC read duplication report

Quick Start¶

The first part of this tutorial will run on your own computer. It assumes that you have Java installed. Download both FastQC and two smaller datasets onto your system

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://apollo.huck.psu.edu/data/SRR519926_1.fastq.zip

http://apollo.huck.psu.edu/data/SRR447649_1.fastq.zip

Run FastQC on each the files from the graphical interface. Let’s discuss the output in more detail.

FastQC at the command line¶

Before you can do that, though, you need to install a bunch o’ software.

We will use a so called Makefile, a simple text file that can contains a series of commands that you could otherwise type in yourself. There will be more information on shell programming and automation later. For now think of a Makefile as a simple way to pick which commands you can execute yourself. Let’s get started. Install make:

sudo apt-get install make -y

You can also investigate the Makefile yourself: https://github.com/ngs-docs/angus/blob/2014/files/Makefile-short-read-quality

This tutorial will download datasets. You may want to create a directory (folder) that stores this data:

mkdir qc
cd qc

We assume that you are running the subsequent scripts from this folder.

Important

Obtain the Makefile and save it onto your cloud system:

# This is where you get the Makefile
wget https://raw.githubusercontent.com/ngs-docs/angus/2014/files/Makefile-short-read-quality -O Makefile

You can investigate the file:

# Look at the Makefile
# more Makefile or pico Makefile

So we now have a Makefile and our system can execute this Makefile via the make command.:

make

Setup¶

In our case you have to always specify which section of the Makefile do you wish to execute. For example you can type:

make setup

This will execute the parts of the Makefile that is listed below:

        #
        # Run initial setup tasks
        #

        # This directory will contain the executables
        mkdir -p ~/bin

        # Add the ~/bin to the user path
        echo 'export PATH=$PATH:~/bin' >> ~/.bashrc

        # Install the unzip library
        sudo apt-get install -y unzip

        # Change the mount point so it is user writeable
        sudo chmod 777 /mnt

        # Update the installation sources
        sudo apt-get update

        # Install java
        sudo apt-get install -y default-jdk

Note that you could also just type in these commands yourself for the same results. The Makefile just automates this.

Software Install¶

The next step is installing FastQC and Trimmomatic on your instance:

make install

command will execute the following lines.

        #
        # The src folder will contain the downloaded software
        #
        mkdir -p ~/src

        #
        # The bin folder will contain the binaries that you have downloaded
        #
        mkdir -p ~/bin

        #
        # Install Trimmomatic
        #
        curl http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.32.zip -o ~/src/Trimmomatic-0.32.zip
        unzip -o ~/src/Trimmomatic-0.32.zip -d ~/src
        ln -fs ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar ~/bin

        #
        # Install FastQC
        #
        curl http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.2.zip -o ~/src/fastqc_v0.11.2.zip
        unzip -o ~/src/fastqc_v0.11.2.zip -d ~/src
        chmod +x ~/src/FastQC/fastqc
        ln -fs ~/src/FastQC/fastqc ~/bin/fastqc

Where did stuff go? The downloaded code went into ~/src the binaries are linked into ~/bin To test that everything works well type:

fastqc -v

This will print the version number of fastqc

Download Data¶

Start at your EC2 prompt, then type

make download

This will execute the following lines. Remember that you could also type these in yourself.

        #
        # Data download
        #
        curl apollo.huck.psu.edu/data/SRR.tar.gz -o SRR.tar.gz

        # Unpack the gzipped data
        tar xzvf SRR.tar.gz

The FASTQ Format¶

In class explanation of the format. See a good description at http://en.wikipedia.org/wiki/FASTQ_format

If you don’t understand the format, you don’t understand the basic premise of the data generation!

Run a FastQC analysis on each dataset:

make fastqc

would run the commands:

        #
        # Run FastQC on every file that we have downloaded.
        #
        fastqc *.fastq

This command will generate an HTML file for each file. Copy these files to your dropbox and look at them (a short walkthrough on what each plot means).

Alternatively you can generate the fastqc output directly to your Dropbox like so:

fastqc *.fastq -o /mnt/Dropbox

Pattern Matching¶

We can also investigate what the files contain by matching:

# Find start codons
grep ATG SRR519926_1.fastq  --color=always | head

# Find a subsequence
grep AGATCGGAAG SRR519926_1.fastq  --color=always | head

Pattern matching via expressions is an extremely powerful concept. We’ll revisit them later.

Trimming¶

Note that there are vary large number of tools that perform quality/adapter trimming.

Now, run Trimmomatic to eliminate Illumina adapters from your sequences. First we need to find the adapter sequences:

ln -s ~/src/Trimmomatic-0.32/adapters/TruSeq3-SE.fa

Tip

You may also want to shorten the command line like so:

alias trim='java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar'

You can now invoke the tool just by typing:

trim

Among the (many) agonizing decisions that you will have to make is what parameters to pick: how big should be my window be, how long should the reads be, what should be the average quality be? What kinds of contaminants do I have. Run, rerun and evaluate. Err on the side of caution.

Trim by quality alone:

        #
        # Run the quality trimming.
        #
        java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar SE SRR447649_1.fastq good.fq SLIDINGWINDOW:4:25 MINLEN:36
        fastqc good.fq -o /mnt/Dropbox

Quality and clipping:

        #
        # Run quality trimming and clipping at the same time.
        #
        java -jar ~/src/Trimmomatic-0.32/trimmomatic-0.32.jar PE SRR519926_1.fastq good1.fq bad1.fq SRR519926_2.fastq  good2.fq bad2.fq SLIDINGWINDOW:4:25 MINLEN:36 ILLUMINACLIP:TruSeq3-SE.fa:2:30:10
        fastqc good1.fq -o /mnt/Dropbox

Now a homework:

Note

Read the manual for Trimmomatic. Trim the reads in parallel for both readfiles in a sample.

Note

BTW: cool kids have pretty prompts, but you too can be cool, all you need to do is:

echo "export PS1='\[\e]0;\w\a\]\n\[\e[32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '" >> ~/.bashrc

Then relog. Don’t ask why this works, it is one of those things that is best left undisturbed.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.