# Short read quality and trimming Learning objectives: - Install software (fastqc, multiqc, trimmomatic) via conda - download data - visualize read quality - quality filter and trim reads If you don't have one running currently, start up a Jetstream m1.medium or larger instance ([as detailed here](jetstream/boot.html#login-launch-instance)), and then `ssh` login through your terminal ([as shown here](jetstream/boot.html#ssh-secure-login)). --- You should now be logged into your Jetstream computer! You should see something like this: ``` (base) dibada@js-171-9:~$ ``` ## Getting started Change to your home directory: ``` cd ~/ ``` Let's make sure our conda channels are loaded, just in case. ``` conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge ``` and install [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [MultiQC](https://multiqc.info/), and [trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic): ``` conda install -y -c bioconda fastqc multiqc trimmomatic ``` ## Data source Make a "data" directory: ``` cd ~/ mkdir data cd data ``` and download some data from the [Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/) yeast RNAseq study. We will download three wildtype yeast datasets, and three mutant datasets (*SNF2* knockout; *SNF2* is a global transcription regulator). ``` curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz curl -L https://osf.io/9c5jz/download -o err_md5sum.txt ``` We also downloaded a file holding some [md5sum](https://en.wikipedia.org/wiki/Md5sum) values for our sequence files ("err_md5sum.txt"). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don't match what is specified in the "err_md5sum.txt" file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them. If we look at what's in that file: ``` head err_md5sum.txt ``` We can see a value for each of our sequence files: ``` 2b8c708cce1fd88e7ddecd51e5ae2154 ERR458493.fastq.gz 36072a519edad4fdc0aeaa67e9afc73b ERR458494.fastq.gz 7a06e938a99d527f95bafee77c498549 ERR458495.fastq.gz 107aad97e33ef1370cb03e2b4bab9a52 ERR458500.fastq.gz fe39ff194822b023c488884dbf99a236 ERR458501.fastq.gz db614de9ed03a035d3d82e5fe2c9c5dc ERR458502.fastq.gz ``` Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this: ``` md5sum -c err_md5sum.txt ``` And if all is well, it will print out "OK" for each of the files: ``` ERR458493.fastq.gz: OK ERR458494.fastq.gz: OK ERR458495.fastq.gz: OK ERR458500.fastq.gz: OK ERR458501.fastq.gz: OK ERR458502.fastq.gz: OK ``` Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories. If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file: ``` md5sum ERR458493.fastq.gz > my_md5sum.txt ``` Or we could use the **`*`** wildcard to help specify all of our .fastq.gz files here (overwriting the previous file): ``` md5sum *.fastq.gz > my_md5sum.txt ``` Back to our data, if we type: ``` ls -l *.fastq.gz ``` We should see something like: ``` -rw-rw-r-- 1 dibada dibada 59532325 Jul 4 03:23 ERR458493.fastq.gz -rw-rw-r-- 1 dibada dibada 58566854 Jul 4 03:24 ERR458494.fastq.gz -rw-rw-r-- 1 dibada dibada 58114810 Jul 4 03:24 ERR458495.fastq.gz -rw-rw-r-- 1 dibada dibada 102201086 Jul 4 03:24 ERR458500.fastq.gz -rw-rw-r-- 1 dibada dibada 101222099 Jul 4 03:24 ERR458501.fastq.gz -rw-rw-r-- 1 dibada dibada 100585843 Jul 4 03:24 ERR458502.fastq.gz ``` These are six data files from the yeast study. One problem with these files is that they are writeable – which means we can edit or delete them. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let's fix that before we go on any further: ``` chmod a-w * ``` Here we are using the `chmod` command (for **ch**ange **mod**e, which handles permissions of files/directories), and providing the argument `a-w`, which means for all users (**`a`**), take away (**`-`**) the permission to write (**`w`**). If we take a look at their permissions now: ``` ls -l ``` We can see that the 'w' in the original permission string (`-rw-rw-r--`) has been removed from each file and now it should look like `-r--r--r--`. We'll talk about what these files are below. ### 1. Linking data to our working location First, make a new working directory: ``` mkdir ~/quality cd ~/quality ``` Now, we're going to make a "link" to our quality-trimmed data in our current working directory: ``` ln -fs ~/data/* . ``` and you will see that they are now linked in the current directory when you do an `ls`. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files! These are FASTQ files -- let's take a look at them: ``` gunzip -c ERR458493.fastq.gz | head ``` FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score of 0 is the lowest quality score, and a score of 40 is the highest. Instead of typing the scores as numbers, they are each encoded as a character. ``` Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI | | | | | Quality score: 0........10........20........30........40 ``` A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000. Links: * [FASTQ Format](http://en.wikipedia.org/wiki/FASTQ_format) ### 2. FastQC We're going to use [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) summarize the data. We already installed 'fastqc' above, with the conda command. Now, run FastQC on two files: ``` fastqc ERR458493.fastq.gz fastqc ERR458500.fastq.gz ``` Now let's use `ls`: ``` ls *fastqc.zip ``` to list the files, and you should see: ``` ERR458493_fastqc.zip ERR458500_fastqc.zip ``` Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files: ``` ls *.html ``` Let's transfer the HTML files to our local computers so we can open and view them in our browsers. We'll use the command `scp` to do this. `scp` stands for secure copy. We can use this command to transfer files between two computers. We need to run this command on our *local* computers (i.e. from your laptop). The `scp` command looks like this: ``` scp ``` First we will make a new directory on our computer to store the HTML files we’re transfering. Let’s put it on our desktop for now. Make sure the terminal program you used this morning to `ssh` onto your instance is open. If you're in it now, you can open a new tab you can use the pull down menu at the top of your screen or the `cmd + t` or maybe `ctrl + t` keyboard shortcut). Type: ``` mkdir ~/Desktop/fastqc_html ``` Now we can transfer our HTML files to our local computer using `scp`, which will look something like this: ``` scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/ ``` The first part of the command is the address for your remote computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the `ssh` command). The second part starts with a : and then gives the absolute path of the files you want to transfer from your remote computer. Don’t forget the `:`. We use a wildcard (\*.html) to indicate that we want all of the HTML files. The third part of the command gives the absolute path of the location you want to put the files. This is on your local computer and is the directory we just created `~/Desktop/fastqc_html`. You will be prompted for your password. Enter the password we set for your instance during the morning session. You should see a status output like this: ``` ERR458493_fastqc.html 100% 642KB 370.1KB/s 00:01 ERR458500_fastqc.html 100% 641KB 592.2KB/s 00:01 ``` Now we can go to our new directory and open the HTML files. You can also download and look at these copies of them: * [ERR458493_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458493_fastqc.html) * [ERR458500_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458500_fastqc.html) These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, failing duplication levels aren't necessarily a problem -- we expect to have duplicated sequences given that some transcripts are very abundant. Questions: * What should you pay attention to in the FastQC report? * Is one sample better than another? Links: * [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) * [FastQC tutorial video](http://www.youtube.com/watch?v=bz93ReOv87Y) There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at [their docs site](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/). ### 3. Trimmomatic Now we're going to do some trimming! Let's switch back to our instance terminal, as we will be running these commands on our remote computers. We'll be using [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic), which (as with fastqc) we've already installed via conda. The first thing we'll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory. ``` cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa . ``` (you can look at the contents of this file with `cat TruSeq2-SE.fa`) Now, let's run Trimmomatic on a couple samples: ``` trimmomatic SE ERR458493.fastq.gz \ ERR458493.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 trimmomatic SE ERR458500.fastq.gz \ ERR458500.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 ``` > **CODE BREAKDOWN** > > Note that trimmomatic is actually a java program that we call from the command line, so it's syntax is a little different than what we've been using. The arguments are followed by a colon, and then what you are specifying to them. > * `SE` - this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse reads > * `ERR458493.fastq.gz` - the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic) > * `ERR458493.qc.fq.gz` - the second positional argument specifies the name of the output files the program will generate > * `ILLUMINACLIP:TruSeq2-SE.fa:2:0:15` - here we are specifying the argument we're addressing "ILLUMINACLIP", first specifying the file holding the adapter sequences, then 3 numbers: "2" which states how many mismatches between the adapter sequence and what's found are allowed; "0" which is only relevant when there are forward and reverse reads; and "15" which specifies how accurate the match must be > * `LEADING:2 TRAILING:2` - both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed off > * `SLIDINGWINDOW:4:2` - this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed there > * `MINLEN:25` - after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded You should see output that looks like this: ``` ... Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%) TrimmomaticSE: Completed successfully ``` We can also run the same process for all 6 samples more efficiently using a `for` loop, as follows. We're going to use a new command called `basename` in the for loop, which we introduce below. #### Using basename in for loops `basename` is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the `.fastq.gz` extension from the files that we’ve been working with. ``` basename ERR458493.fastq.gz .fastq.gz ``` We see that this returns just the SRR accession, and no longer has the .fastq file extension on it. ``` ERR458493 ``` If we try the same thing but use `.fasta` as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file. ``` basename ERR458493.fastq.gz .fasta ``` ``` ERR458493.fastq.gz ``` Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let's try this. Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case `${filename}`, and finally state that `.fastq.gz` should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output. ``` for filename in *.fastq.gz do name=$(basename ${filename} .fastq.gz) echo ${name} done ``` A note on variables: We can either refer a variable as `$filename` or `${filename}`. They mean the same thing, but using `{}` explicitly tells bash when we are done refering to the variable. This makes it safer to refer to variables, especially when we're combining it with other text. This is different than `()`, which tells bash that we want to execute something. #### Trimming files using basename and for loops Now we will use `basename` inside of our for loop to run trimmomatic on all 6 of our files. ``` for filename in *.fastq.gz do # first, make the base by removing fastq.gz base=$(basename $filename .fastq.gz) echo $base trimmomatic SE ${base}.fastq.gz \ ${base}.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:2 TRAILING:2 \ SLIDINGWINDOW:4:2 \ MINLEN:25 done ``` The `for` loop above will go through each for the filenames that end with `fastq.gz` and run Trimmomatic for it. Questions: * How do you figure out what the parameters mean? * How do you figure out what parameters to use? * What adapters do you use? * What version of Trimmomatic are we using here? (And FastQC?) * Do you think parameters are different for RNAseq and genomic data sets? * What's with these annoyingly long and complicated filenames? For a discussion of optimal trimming strategies, see [MacManes, 2014](http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00013/abstract) -- it's about RNAseq but similar arguments should apply to metagenome assembly. Links: * [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) ### 4. FastQC again Let's take a look at the output files: ``` gunzip -c ERR458493.qc.fq.gz | head ``` It's hard to get a feel for how different these files are by just looking at them, so let's run FastQC again on the trimmed files: ``` fastqc ERR458493.qc.fq.gz fastqc ERR458500.qc.fq.gz ``` And now view my copies of these files: * [ERR458493.qc_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458493.qc_fastqc.html) * [ERR458500.qc_fastqc.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/ERR458500.qc_fastqc.html) Alternatively, if there's time, you can use `scp` to view the files from your local computer. ### 5. MultiQc [MultiQC](http://multiqc.info/) is a great tool that aggregates multiple fastqc results into a single report for easy comparison. Run Mulitqc on both the untrimmed and trimmed files ``` multiqc . ``` And now you should see output that looks like this: ``` [INFO ] multiqc : This is MultiQC v1.0 [INFO ] multiqc : Template : default [INFO ] multiqc : Searching '.' Searching 15 files.. [####################################] 100% [INFO ] fastqc : Found 4 reports [INFO ] multiqc : Compressing plot data [INFO ] multiqc : Report : multiqc_report.html [INFO ] multiqc : Data : multiqc_data [INFO ] multiqc : MultiQC complete ``` You can view the output html file [multiqc_report.html](http://htmlpreview.github.com/?https://github.com/ngs-docs/angus/blob/2018/_static/multiqc_report.html) by going to RStudio, selecting the file, and saying "view in Web browser." Questions: * is the quality trimmed data "better" than before? * Does it matter that you still have adapters!?