# De novo transcriptome assembly

Learning objectives:

	* Learn what transcriptome assembly is
	* Learn to differentiate different types of assemblies
	* Discuss how do assemblers work
	* Learn to check the quality of a transcriptome assembly

What if you are working with a species with no existing reference genome or transcriptome...how do you construct a reference?

The answer is *de novo* assembly. The basic idea with *de novo* transcriptome assembly is you feed in your reads and you get out a bunch of *contigs* that represent transcripts, or stretches of RNA present in the reads that don't have any long repeats or much significant polymorphism. You run a  *de novo* transcriptome assembly program using the trimmed reads as input and get out a pile of assembled RNA.

Trinity, one of the leading *de novo* transcriptome assemblers, was developed at the [Broad Institute](http://www.broadinstitute.org/) and the [Hebrew University of Jerusalem](http://www.cs.huji.ac.il/). We will be losely following steps from the [Eel pond protocol](https://eel-pond.readthedocs.io/en/latest) for our guide to doing RNA-seq assembly.

## Download and trim the data

We will be using a set of *Nematostella vectensis* mRNAseq reads from [Tulin et al., 2013](https://evodevojournal.biomedcentral.com/articles/10.1186/2041-9139-4-16).

To avoid needing to go though the trimming steps as before, we'll download a snakefile to take care of these steps for us. If you look at this file, you may notice it is very similar to the one you generated during the [snakemake challenge](./snakemake_for_qc.md).


Make sure you have snakemake installed in your base environment, where we will be running the snakefile:
```
conda install -c conda-forge snakemake-minimal
```

Download the snakefile and a corresponding conda environment file:
```
cd ~ # cd to home directory
curl -L https://osf.io/nqh6p/download -o nema_trim.snakefile
curl -L https://osf.io/xs6k7/download -o trim-env.yml
```

Let's take a look at the environment file to see what software snakemake will be putting in the environment it creates:

```
less trim-env.yml
```

Run the snakefile to download and trim the *Nematostella* reads:
```
snakemake -s nema_trim.snakefile --use-conda --cores 6
```

Here, we run snakemake. We use the `-s` command to tell snakemake where it can find our snakefile, 
tell it to build and use the environment above using the `--use-conda` flag, and have it execute 
the workflow over 6 cores using `--cores 6`.

The trimmed data should now be within the `nema_trimmed` folder.


## Generate a _de novo_ assembly

The following commands will create a new folder `assembly` and link the trimmed data we prepared in our 
snakemake workflow into the `assembly` folder:

```
cd 
mkdir assembly
cd assembly

ln -fs ../nema_trimmed/*.qc.fq.gz .
ls
```

Next we will combine our all forward reads into a single file and all reverse reads into a single file.
Usually, you want to have many samples from a single individual that are combined, but that minimize 
polymorphism and improve assembly. See this 
[preprint](https://www.biorxiv.org/content/10.1101/035642v2) for best practices for care and 
feeding of your transcriptome.

Use the following command to combine all fq into 2 files (left.fq and right.fq). 

```
cat *_1.pe.qc.fq.gz *se.qc.fq.gz > left.fq.gz
cat *_2.pe.qc.fq.gz > right.fq.gz
```

_Note: this step just makes it easier for us to type out the trinity command. Trinity can accept comma-separated lists of files as inputs - check the trinity documentation for details._


## Run the assembler

First, we'll create and activate an environment where Trinity is installed:

```
conda create -y -n trinity-env trinity
conda activate trinity-env
```

Trinity works both with paired-end reads as well as single-end reads (including with both types of reads at the same time). In the general case, the paired-end files are defined as `--left left.fq` and `--right right.fq` respectively. Our single-end reads (orphans) have been concatenated onto the `left.fq` file. 


So let's run the assembler as follows:

```
time Trinity --seqType fq --max_memory 16G --CPU 6 --left left.fq.gz --right right.fq.gz --output nema_trinity
```

(This will take a few minutes)

The `time` command allows us to see how long Trinity takes to run, but is not a part of the Trinity command.

You should see something like:

```
** Harvesting all assembled transcripts into a single multi-fasta file...

Thursday, July 11, 2019: 18:21:18       CMD: find /home/dibada/assembly/nema_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/nema_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/nema_trinity/Trinity.fasta
Thursday, July 11, 2019: 18:21:18       CMD: mv Trinity.tmp.fasta /home/dibada/assembly/nema_trinity/Trinity.fasta


###################################################################
Trinity assemblies are written to /home/dibada/assembly/nema_trinity/Trinity.fasta
###################################################################


Thursday, July 11, 2019: 18:21:18       CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/nema_trinity/Trinity.fasta > /home/dibada/assembly/nema_trinity/Trinity.fasta.gene_trans_map
```

at the end.


## Looking at the assembly

First, save the assembly:

```
cp nema_trinity/Trinity.fasta nema-trinity.fa
``` 
 
Now, look at the beginning:

```
head nema-trinity.fa
```
    
These are the transcripts! Yay!

Let's capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

```
TrinityStats.pl nema-trinity.fa
```

The output should look something like the following:

```
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  18
Total trinity transcripts:      46
Percent GC: 46.68

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3631
        Contig N20: 2497
        Contig N30: 2145
        Contig N40: 2097
        Contig N50: 1977

        Median contig length: 1593
        Average contig: 1459.98
        Total assembled bases: 67159


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4447
        Contig N20: 3631
        Contig N30: 3631
        Contig N40: 2497
        Contig N50: 2151

        Median contig length: 1107
        Average contig: 1422.83
        Total assembled bases: 25611
```

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?

Lastly, we'll deactivate the environment where Trinity is installed.

```
conda deactivate
```

## Suggestions for next steps 

After generating a *de novo* transcriptome assembly:
* [annotation](https://angus.readthedocs.io/en/2019/dammit_annotation.html)
* [evaluation](https://dibsi-rnaseq.readthedocs.io/en/latest/evaluation.html)