23. De novo transcriptome assembly

Learning objectives:

* Learn what transcriptome assembly is
* Learn to differentiate different types of assemblies
* Discuss how do assemblers work
* Learn to check the quality of a transcriptome assembly

What if you are working with a species with no existing reference genome or transcriptome…how do you construct a reference?

The answer is de novo assembly. The basic idea with de novo transcriptome assembly is you feed in your reads and you get out a bunch of contigs that represent transcripts, or stretches of RNA present in the reads that don’t have any long repeats or much significant polymorphism. You run a de novo transcriptome assembly program using the trimmed reads as input and get out a pile of assembled RNA.

Trinity, one of the leading de novo transcriptome assemblers, was developed at the Broad Institute and the Hebrew University of Jerusalem. We will be losely following steps from the Eel pond protocol for our guide to doing RNA-seq assembly.

23.1. Download and trim the data

We will be using a set of Nematostella vectensis mRNAseq reads from Tulin et al., 2013.

To avoid needing to go though the trimming steps as before, we’ll download a snakefile to take care of these steps for us. If you look at this file, you may notice it is very similar to the one you generated during the snakemake challenge.

Make sure you have snakemake installed in your base environment, where we will be running the snakefile:

conda install -c conda-forge snakemake-minimal

Download the snakefile and a corresponding conda environment file:

cd ~ # cd to home directory
curl -L https://osf.io/nqh6p/download -o nema_trim.snakefile
curl -L https://osf.io/xs6k7/download -o trim-env.yml

Let’s take a look at the environment file to see what software snakemake will be putting in the environment it creates:

less trim-env.yml

Run the snakefile to download and trim the Nematostella reads:

snakemake -s nema_trim.snakefile --use-conda --cores 6

Here, we run snakemake. We use the -s command to tell snakemake where it can find our snakefile, tell it to build and use the environment above using the --use-conda flag, and have it execute the workflow over 6 cores using --cores 6.

The trimmed data should now be within the nema_trimmed folder.

23.2. Generate a de novo assembly

The following commands will create a new folder assembly and link the trimmed data we prepared in our snakemake workflow into the assembly folder:

cd 
mkdir assembly
cd assembly

ln -fs ../nema_trimmed/*.qc.fq.gz .
ls

Next we will combine our all forward reads into a single file and all reverse reads into a single file. Usually, you want to have many samples from a single individual that are combined, but that minimize polymorphism and improve assembly. See this preprint for best practices for care and feeding of your transcriptome.

Use the following command to combine all fq into 2 files (left.fq and right.fq).

cat *_1.pe.qc.fq.gz *se.qc.fq.gz > left.fq.gz
cat *_2.pe.qc.fq.gz > right.fq.gz

Note: this step just makes it easier for us to type out the trinity command. Trinity can accept comma-separated lists of files as inputs - check the trinity documentation for details.

23.3. Run the assembler

First, we’ll create and activate an environment where Trinity is installed:

conda create -y -n trinity-env trinity
conda activate trinity-env

Trinity works both with paired-end reads as well as single-end reads (including with both types of reads at the same time). In the general case, the paired-end files are defined as --left left.fq and --right right.fq respectively. Our single-end reads (orphans) have been concatenated onto the left.fq file.

So let’s run the assembler as follows:

time Trinity --seqType fq --max_memory 16G --CPU 6 --left left.fq.gz --right right.fq.gz --output nema_trinity

(This will take a few minutes)

The time command allows us to see how long Trinity takes to run, but is not a part of the Trinity command.

You should see something like:

** Harvesting all assembled transcripts into a single multi-fasta file...

Thursday, July 11, 2019: 18:21:18       CMD: find /home/dibada/assembly/nema_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/nema_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/nema_trinity/Trinity.fasta
Thursday, July 11, 2019: 18:21:18       CMD: mv Trinity.tmp.fasta /home/dibada/assembly/nema_trinity/Trinity.fasta


###################################################################
Trinity assemblies are written to /home/dibada/assembly/nema_trinity/Trinity.fasta
###################################################################


Thursday, July 11, 2019: 18:21:18       CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/nema_trinity/Trinity.fasta > /home/dibada/assembly/nema_trinity/Trinity.fasta.gene_trans_map

at the end.

23.4. Looking at the assembly

First, save the assembly:

cp nema_trinity/Trinity.fasta nema-trinity.fa

Now, look at the beginning:

head nema-trinity.fa

These are the transcripts! Yay!

Let’s capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

TrinityStats.pl nema-trinity.fa

The output should look something like the following:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  18
Total trinity transcripts:      46
Percent GC: 46.68

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3631
        Contig N20: 2497
        Contig N30: 2145
        Contig N40: 2097
        Contig N50: 1977

        Median contig length: 1593
        Average contig: 1459.98
        Total assembled bases: 67159


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4447
        Contig N20: 3631
        Contig N30: 3631
        Contig N40: 2497
        Contig N50: 2151

        Median contig length: 1107
        Average contig: 1422.83
        Total assembled bases: 25611

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?

Lastly, we’ll deactivate the environment where Trinity is installed.

conda deactivate

23.5. Suggestions for next steps

After generating a de novo transcriptome assembly: