# Short read quality and trimming

Learning objectives:
- Install software (fastqc, multiqc, trimmomatic) via conda
- download data
- visualize read quality
- quality filter and trim reads


If you don't have one running currently, start up a Jetstream m1.medium or larger instance ([as detailed here](jetstream/boot.html#login-launch-instance)), and then `ssh` login through your terminal ([as shown here](jetstream/boot.html#ssh-secure-login)).

---

You should now be logged into your Jetstream computer! You should see
something like this:

```
(base) dibada@js-171-9:~$ 
```

## Getting started

Change to your home directory:

```
cd ~/
```

Let's make sure our conda channels are loaded, just in case.

```
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
```

and install [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [MultiQC](https://multiqc.info/), and [trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic):

```
conda install -y -c bioconda fastqc multiqc trimmomatic
```


## Data source

Make a "data" directory:

```
cd ~/
mkdir data
cd data
```

and download some data from the
[Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/)
yeast RNAseq study. We will download three wildtype yeast datasets, and
three mutant datasets (*SNF2* knockout; *SNF2* is a global transcription
regulator). 

```
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
curl -L https://osf.io/9c5jz/download -o err_md5sum.txt
```

We also downloaded a file holding some [md5sum](https://en.wikipedia.org/wiki/Md5sum) values for our sequence files ("err_md5sum.txt"). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don't match what is specified in the "err_md5sum.txt" file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them.

If we look at what's in that file:

```
head err_md5sum.txt
```

We can see a value for each of our sequence files:

```
2b8c708cce1fd88e7ddecd51e5ae2154  ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b  ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549  ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52  ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236  ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc  ERR458502.fastq.gz
```

Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this:

```
md5sum -c err_md5sum.txt
```

And if all is well, it will print out "OK" for each of the files:

```
ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK
```

Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories. 

If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file:

```
md5sum ERR458493.fastq.gz > my_md5sum.txt
```

Or we could use the **`*`** wildcard to help specify all of our .fastq.gz files here (overwriting the previous file):

```
md5sum *.fastq.gz > my_md5sum.txt
```

Back to our data, if we type:

```
ls -l *.fastq.gz
```

We should see something like:

```
-rw-rw-r-- 1 dibada dibada  59532325 Jul  4 03:23 ERR458493.fastq.gz
-rw-rw-r-- 1 dibada dibada  58566854 Jul  4 03:24 ERR458494.fastq.gz
-rw-rw-r-- 1 dibada dibada  58114810 Jul  4 03:24 ERR458495.fastq.gz
-rw-rw-r-- 1 dibada dibada 102201086 Jul  4 03:24 ERR458500.fastq.gz
-rw-rw-r-- 1 dibada dibada 101222099 Jul  4 03:24 ERR458501.fastq.gz
-rw-rw-r-- 1 dibada dibada 100585843 Jul  4 03:24 ERR458502.fastq.gz
```

These are six data files from the yeast study.

One problem with these files is that they are writeable – which means we can edit or delete them. By default,
UNIX makes things writeable by the file owner. This poses an issue
with creating typos or errors in raw data. Let's fix that before we
go on any further:

```
chmod a-w *
```

Here we are using the `chmod` command (for **ch**ange **mod**e, which handles permissions of files/directories), and providing the argument `a-w`, which means for all users (**`a`**), take away (**`-`**) the permission to write (**`w`**). 

If we take a look at their permissions now:
 
```
ls -l
```

We can see that the 'w' in the original permission string
(`-rw-rw-r--`) has been removed from each file and now it should look like 
`-r--r--r--`.

We'll talk about what these files are below.

### 1. Linking data to our working location

First, make a new working directory:
```
mkdir ~/quality
cd ~/quality
```

Now, we're going to make a "link" to our quality-trimmed data in our current working directory:

```
ln -fs ~/data/* .
```

and you will see that they are now linked in the current directory when you do an
`ls`. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!

These are FASTQ files -- let's take a look at them:

```
gunzip -c ERR458493.fastq.gz | head
```

FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score 
of 0 is the lowest quality score, and a score of 40 is the highest. Instead
of typing the scores as numbers, they are each encoded as a character.   

```
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                  |         |         |         |         |
Quality score:    0........10........20........30........40  
``` 

A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000. 

Links:

* [FASTQ Format](http://en.wikipedia.org/wiki/FASTQ_format)

### 2. FastQC


We're going to use 
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
summarize the data. We already installed 'fastqc' above, with the
conda command.

Now, run FastQC on two files:

```
fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz
```

Now let's use `ls`:

```
ls *fastqc.zip
```

to list the files, and you should see:

```
ERR458493_fastqc.zip
ERR458500_fastqc.zip
```

Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files:

```
ls *.html
```

Let's transfer the HTML files to our local computers so we can open and view
them in our browsers. We'll use the command `scp` to do this. 

`scp` stands for secure copy. We can use this command to transfer files between
two computers. We need to run this command on our *local* computers (i.e. from
your laptop).

The `scp` command looks like this:

```
scp <file I want to move> <where I want to move it>
```

First we will make a new directory on our computer to store the HTML files we’re
transfering. Let’s put it on our desktop for now. Make sure the terminal program
you used this morning to `ssh` onto your instance is open. If you're in it now,
you can open a new tab you can use the pull down menu at the top of your 
screen or the `cmd + t` or maybe `ctrl + t` keyboard shortcut). Type:

```
mkdir ~/Desktop/fastqc_html 
```

Now we can transfer our HTML files to our local computer using `scp`, which will look something like this:

```
scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/
```

The first part of the command is the address for your remote 
computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the `ssh` command).

The second part starts with a : and then gives the absolute path of the files 
you want to transfer from your remote computer. Don’t forget the `:`. We use a 
wildcard (\*.html) to indicate that we want all of the HTML files.

The third part of the command gives the absolute path of the location you want 
to put the files. This is on your local computer and is the directory we just 
created `~/Desktop/fastqc_html`.

You will be prompted for your password. Enter the password we set for your instance during the morning session.

You should see a status output like this:

```
ERR458493_fastqc.html                      100%  642KB 370.1KB/s   00:01    
ERR458500_fastqc.html                      100%  641KB 592.2KB/s   00:01  
```

Now we can go to our new directory and open the HTML files.

You can also download and look at these copies of them:

* [ERR458493_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458493_fastqc.html)
* [ERR458500_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458500_fastqc.html)

These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, 
failing duplication levels aren't necessarily a problem -- we expect to have
duplicated sequences given that some transcripts are very abundant.

Questions:

* What should you pay attention to in the FastQC report?
* Is one sample better than another?

Links:

* [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
* [FastQC tutorial video](http://www.youtube.com/watch?v=bz93ReOv87Y)

There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at [their docs site](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/).


### 3. Trimmomatic

Now we're going to do some trimming!  Let's switch back to our instance
terminal, as we will be running these commands on our remote computers.
We'll be using
[Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic), which
(as with fastqc) we've already installed via conda.

The first thing we'll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory.

```
cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .
```

(you can look at the contents of this file with `cat TruSeq2-SE.fa`)

Now, let's run Trimmomatic on a couple samples:

```
trimmomatic SE ERR458493.fastq.gz \
        ERR458493.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        
trimmomatic SE ERR458500.fastq.gz \
        ERR458500.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        
```

> **CODE BREAKDOWN**
> 
> Note that trimmomatic is actually a java program that we call from the command line, so it's syntax is a little different than what we've been using. The arguments are followed by a colon, and then what you are specifying to them. 
> * `SE` - this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse reads
> * `ERR458493.fastq.gz` - the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic)
> * `ERR458493.qc.fq.gz` - the second positional argument specifies the name of the output files the program will generate
> * `ILLUMINACLIP:TruSeq2-SE.fa:2:0:15` - here we are specifying the argument we're addressing "ILLUMINACLIP", first specifying the file holding the adapter sequences, then 3 numbers: "2" which states how many mismatches between the adapter sequence and what's found are allowed; "0" which is only relevant when there are forward and reverse reads; and "15" which specifies how accurate the match must be
> * `LEADING:2 TRAILING:2` - both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed off
> * `SLIDINGWINDOW:4:2` - this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed there
> * `MINLEN:25` - after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded


You should see output that looks like this:

```
...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully
```


We can also run the same process for all 6 samples more efficiently using a `for` loop, as follows. We're going to use a 
new command called `basename` in the for loop, which we introduce below.

#### Using basename in for loops

`basename` is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. 
In this case, we will use basename to remove the `.fastq.gz` extension from the files that we’ve been working with. 

```
basename ERR458493.fastq.gz .fastq.gz
```

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

```
ERR458493
```

If we try the same thing but use `.fasta` as the file extension instead, nothing happens. 
This is because basename only works when it exactly matches a string in the file.

```
basename ERR458493.fastq.gz .fasta
```

```
ERR458493.fastq.gz
```

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you 
can use to name things. Let's try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, 
then give our variable name from the for loop, in this case `${filename}`, and finally state that `.fastq.gz` 
should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re 
creating a new variable called name. The line > echo $name will print to the terminal the variable name each 
time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

```
for filename in *.fastq.gz
do
  name=$(basename ${filename} .fastq.gz)
  echo ${name}
done
```

A note on variables: We can either refer a variable as `$filename` or `${filename}`. They mean the same thing,
but using `{}` explicitly tells bash when we are done refering to the variable. This makes it safer to refer
to variables, especially when we're combining it with other text. This is different than `()`, which tells bash
that we want to execute something. 

#### Trimming files using basename and for loops

Now we will use `basename` inside of our for loop to run trimmomatic on all 6 of our files. 

```
for filename in *.fastq.gz
do
        # first, make the base by removing fastq.gz
        base=$(basename $filename .fastq.gz)
        echo $base

        trimmomatic SE ${base}.fastq.gz \
                ${base}.qc.fq.gz \
                ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
                LEADING:2 TRAILING:2 \
                SLIDINGWINDOW:4:2 \
                MINLEN:25
done
```

The `for` loop above will go through each for the filenames that end with `fastq.gz` and 
run Trimmomatic for it.

Questions:

* How do you figure out what the parameters mean?
* How do you figure out what parameters to use?
* What adapters do you use?
* What version of Trimmomatic are we using here? (And FastQC?)
* Do you think parameters are different for RNAseq and genomic data sets?
* What's with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see 
[MacManes, 2014](http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00013/abstract) -- it's about RNAseq but similar arguments should apply to metagenome
assembly.

Links:

* [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)

### 4. FastQC again

Let's take a look at the output files:

```
gunzip -c ERR458493.qc.fq.gz | head                                                        
```

It's hard to get a feel for how different these files are by just looking at 
them, so let's run FastQC again on the trimmed files:

```
fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz
```

And now view my copies of these files: 

* [ERR458493.qc_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458493.qc_fastqc.html)
* [ERR458500.qc_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458500.qc_fastqc.html)

Alternatively, if there's time, you can use `scp` to view the files from your 
local computer. 

### 5. MultiQc
[MultiQC](http://multiqc.info/) is a great tool that aggregates multiple fastqc results into a single report for easy comparison.

Run Mulitqc on both the untrimmed and trimmed files

```
multiqc .
```

And now you should see output that looks like this:

```
[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 15 files..  [####################################]  100%
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete
```

You can view the output html file
[multiqc_report.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/multiqc_report.html) by going to RStudio, selecting the file, and saying "view in Web browser."

Questions:

* is the quality trimmed data "better" than before?
* Does it matter that you still have adapters!?