# Transcriptome assembly - some basics

Learning objectives:

	* Learn what is Transcriptome assembly?
	* Different types of assemblies
	* How do assemblers work?
	* Checking the quality of assembly
	* Understanding Transcriptome assembly
	
In [variant calling](http://angus.readthedocs.io/en/2018/mapping-variant-calling.html), we mapped reads to a reference and looked systematically for differences.

But what if you don't have a reference? How do you construct one?

The answer is *de novo* assembly, and the basic idea is you feed in your reads and you get out a bunch of *contigs*, that represent stretches of RNA present in the reads that don't have any long repeats or much significant polymorphism.  Like everything else, the basic idea is that you run a program, feed in the reads, and get out a pile of assembled RNA.

Trinity, developed at the [Broad Institute](http://www.broadinstitute.org/) and the [Hebrew University of Jerusalem](http://www.cs.huji.ac.il/, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. We will be using the [eel-pond protocol](https://eel-pond.readthedocs.io/en/latest) for our guide to doing RNA-seq assembly.

## Boot up a Jetstream

[Boot an m1.medium Jetstream instance](jetstream/boot.md) and log in.

## Install the TRINITY assembler

The [Trinity assembler](https://www.ncbi.nlm.nih.gov/pubmed/21572440) can also install it through `conda`:

```
conda install -y trinity 
```

## Change to a new working directory and link the original data

We will be using the same data as before ([Schurch et al, 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4878611/)), so the following commands will create a new folder `assembly` and link the trimmed data we prepared earlier in the newly created folder:

```
cd ~/
mkdir -p assembly
cd assembly

ln -fs ~/quality/*.qc.fq.gz .
ls
```

## Applying Digital Normalization

In this section, we’ll apply [digital normalization](http://arxiv.org/abs/1203.4802) and [variable-coverage k-mer abundance trimming](https://peerj.com/preprints/890/) to the reads prior to assembly. This has the effect of reducing the computational cost of assembly [without negatively affecting the quality of the assembly](https://peerj.com/preprints/505/). Although the appropriate approach would be to use all 6 samples, for time consideration we will be using just the first one, i.e. ERR458493.qc.fq.gz

```
normalize-by-median.py --ksize 20 --cutoff 20 -M 10G --savegraph normC20k20.ct --force_single ERR458493.qc.fq.gz
```

This tools works mainly for paired-end reads, combined with a one file containing single-end reads. Given that all our samples are single end, we'll use the `--force_single` flag to force all reads to be considered as single-end. The parameter `--cutoff` indicates that when the median k-mer coverage level is above this number the read is not kept. Also note the `-M` parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you’re using. (See [choosing hash sizes for khmer for more information](http://khmer.readthedocs.io/en/v2.1.1/user/choosing-table-sizes.html).

(This step should take about 2-3 minutes to complete)


## Trim off likely erroneous k-mers

Now, run through all the reads and trim off low-abundance parts of high-coverage reads

```
filter-abund.py --threads 4 --variable-coverage --normalize-to 18 normC20k20.ct *.keep
```

The parameter `--variable-coverage` requests that only trim low-abundance k-mers from sequences that have high coverage. The parameter `--normalize-to` bases the variable-coverage cutoff on this median k-mer abundance.

(This step should take about 2-3 minutes to complete)

### Run the assembler


Trinity works both with paired-end reads as well as single-end reads (including simultaneously both types at the same time). In the general case, the paired-end files are defined as `--left left.fq` and `--right right.fq` respectively. The single-end reads (a.k.a _orphans_) are defined by the flag `--single`. 

First of all though, we need to make sure that there are no whitespaces in the header of the input fastq file. This is done using the following command:

```
cat ERR458493.qc.fq.gz.keep.abundfilt | tr -d ' ' > ERR458493.qc.fq.gz.keep.abundfilt.clean
```

So let's run the assembler as follows:

```
time Trinity --seqType fq --max_memory 10G --CPU 4 --single ERR458493.qc.fq.gz.keep.abundfilt.clean --output yeast_trinity
```

(This will take about 20 minutes)

You should see something like:

```
** Harvesting all assembled transcripts into a single multi-fasta file...

Saturday, June 30, 2018: 16:42:08       CMD: find /home/tx160085/assembly/yeast_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/opt/trinity-2.6.6/util/support_scripts /partitioned_trinity_aggregator.pl TRINITY_DN > /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp 
-relocating /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp to /home/tx160085/assembly/yeast_trinity/Trinity.fasta
Saturday, June 30, 2018: 16:42:08       CMD: mv /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp /home/tx160085/assembly/yeast_trinity/Trinity.fasta

###################################################################
Trinity assemblies are written to /home/tx160085/assembly/yeast_trinity/Trinity.fasta
###################################################################

Saturday, June 30, 2018: 16:42:08       CMD: /opt/miniconda/opt/trinity-2.6.6/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/tx160085/assembly/yeast_trinity/Trinity.fasta > /home/tx160085/assembly/yeast_trinity/Trinity.fasta.gene_trans_map
```

at the end.



### Looking at the assembly

First, save the assembly:

```
cp yeast_trinity/Trinity.fasta yeast-transcriptome-assembly.fa
``` 
 
Now, look at the beginning:

```
head yeast-transcriptome-assembly.fa
```
    
It's RNA! Yay!

Let's capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

```
TrinityStats.pl yeast-transcriptome-assembly.fa
```

The output should look something like the following:

```
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  3305
Total trinity transcripts:      3322
Percent GC: 42.05

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 1355
        Contig N20: 1016
        Contig N30: 781
        Contig N40: 617
        Contig N50: 502

        Median contig length: 319
        Average contig: 441.65
        Total assembled bases: 1467173

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 1358
        Contig N20: 1016
        Contig N30: 781
        Contig N40: 617
        Contig N50: 500

        Median contig length: 319
        Average contig: 440.93
        Total assembled bases: 1457279
```

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?