Transcriptome assembly - some basics

Learning objectives:

* Learn what transcriptome assembly is
* Learn to differentiate different types of assemblies
* Discuss how do assemblers work
* Learn to check the quality of a transcriptome assembly

In variant calling, we mapped reads to a reference and looked systematically for differences.

But what if you don’t have a reference? How do you construct one?

The answer is de novo assembly, and the basic idea is you feed in your reads and you get out a bunch of contigs, that represent stretches of RNA present in the reads that don’t have any long repeats or much significant polymorphism. Like everything else, the basic idea is that you run a program, feed in the reads, and get out a pile of assembled RNA.

Trinity, developed at the Broad Institute and the [Hebrew University of Jerusalem](http://www.cs.huji.ac.il/, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. We will be using the eel-pond protocol, which uses trinity, as our guide to doing RNA-seq assembly.

Boot up a Jetstream

Boot an m1.medium Jetstream instance and log in.

Install the TRINITY assembler in a conda environment

Create and environment and install the Trinity assembler through conda:

conda create -y -n trinity-env trinity khmer

where:

  • -y says to install the pkgs without double checking with me
  • -n names the environment “trinity-env”

Now, let’s activate that environment:

conda activate trinity-env

Make sure you’ve got the data

We will be using the same data as before (Schurch et al, 2016), so the following commands will create a new folder assembly and link the trimmed data we prepared earlier in the newly created folder:

cd ~
ls -l quality/*qc* .

You should see the following:

-rw-rw-r-- 1 dibada dibada  59465832 Jul 11 15:45 quality/ERR458493.qc.fq.gz
-rw-rw-r-- 1 dibada dibada  58503534 Jul 11 15:46 quality/ERR458494.qc.fq.gz
-rw-rw-r-- 1 dibada dibada  58054460 Jul 11 15:46 quality/ERR458495.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 102164315 Jul 11 15:44 quality/ERR458500.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 101187090 Jul 11 15:44 quality/ERR458501.qc.fq.gz
-rw-rw-r-- 1 dibada dibada 100550095 Jul 11 15:45 quality/ERR458502.qc.fq.gz

If you don’t see the data, you can download it using the following commands:

cd ~
mkdir -p quality
cd quality
curl -L https://osf.io/wfz34/download -o ERR458493.qc.fq.gz
curl -L https://osf.io/jxh4d/download -o ERR458494.qc.fq.gz
curl -L https://osf.io/zx7n3/download -o ERR458495.qc.fq.gz
curl -L https://osf.io/96mrj/download -o ERR458500.qc.fq.gz
curl -L https://osf.io/wc8yn/download -o ERR458501.qc.fq.gz
curl -L https://osf.io/sdtz3/download -o ERR458502.qc.fq.gz

Applying Digital Normalization

In this section, we’ll apply digital normalization and variable-coverage k-mer abundance trimming to the reads prior to assembly. This has the effect of reducing the computational cost of assembly without negatively affecting the quality of the assembly. Although the appropriate approach would be to use all 6 samples, for time consideration we will be using just the first one, i.e. ERR458493.qc.fq.gz

normalize-by-median.py --ksize 20 --cutoff 20 -M 10G --savegraph normC20k20.ct --force_single ERR458493.qc.fq.gz

This tools works mainly for paired-end reads, combined with a one file containing single-end reads. Given that all our samples are single end, we’ll use the --force_single flag to force all reads to be considered as single-end. The parameter --cutoff indicates that when the median k-mer coverage level is above this number the read is not kept. Also note the -M parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you’re using. (See choosing hash sizes for khmer for more information.

(This step should take about 2-3 minutes to complete)

Trim off likely erroneous k-mers

Now, run through all the reads and trim off low-abundance parts of high-coverage reads

filter-abund.py --threads 6 --variable-coverage --normalize-to 18 normC20k20.ct *.keep

The parameter --variable-coverage requests that only trim low-abundance k-mers from sequences that have high coverage. The parameter --normalize-to bases the variable-coverage cutoff on this median k-mer abundance.

(This step should take about 2-3 minutes to complete)

Run the assembler

Trinity works both with paired-end reads as well as single-end reads (including simultaneously both types at the same time). In the general case, the paired-end files are defined as --left left.fq and --right right.fq respectively. If assembling from single-end reads (as in this case), we use the flag --single.

First of all though, we need to make sure that there are no whitespaces in the header of the input fastq file. This is done using the following command:

cat ERR458493.qc.fq.gz.keep.abundfilt | tr -d ' ' > ERR458493.qc.fq.gz.keep.abundfilt.clean

So let’s run the assembler as follows:

time Trinity --seqType fq --max_memory 10G --CPU 4 --single ERR458493.qc.fq.gz.keep.abundfilt.clean --output yeast_trinity

(This will take about 20 minutes)

You should see something like:

All commands completed successfully. :-)

** Harvesting all assembled transcripts into a single multi-fasta file...

Thursday, July 11, 2019: 17:15:47	CMD: find /home/dibada/assembly/yeast_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/yeast_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/yeast_trinity/Trinity.fasta
Thursday, July 11, 2019: 17:15:47	CMD: mv Trinity.tmp.fasta /home/dibada/assembly/yeast_trinity/Trinity.fasta

###################################################################
Trinity assemblies are written to /home/dibada/assembly/yeast_trinity/Trinity.fasta
###################################################################

Thursday, July 11, 2019: 17:15:47	CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/yeast_trinity/Trinity.fasta > /home/dibada/assembly/yeast_trinity/Trinity.fasta.gene_trans_map

at the end.

Looking at the assembly

First, save the assembly:

cp yeast_trinity/Trinity.fasta yeast-transcriptome-assembly.fa

Now, look at the beginning:

head yeast-transcriptome-assembly.fa

It’s RNA! Yay!

Let’s capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

TrinityStats.pl yeast-transcriptome-assembly.fa

The output should look something like the following:


################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':	3296
Total trinity transcripts:	3320
Percent GC: 42.02

########################################
Stats based on ALL transcript contigs:
########################################

	Contig N10: 1420
	Contig N20: 1069
	Contig N30: 802
	Contig N40: 634
	Contig N50: 514

	Median contig length: 321
	Average contig: 448.98
	Total assembled bases: 1490618

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

	Contig N10: 1385
	Contig N20: 1021
	Contig N30: 787
	Contig N40: 618
	Contig N50: 502

	Median contig length: 319
	Average contig: 442.56
	Total assembled bases: 1458676

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?