10. Transcriptome assembly - some basics¶

Learning objectives:

* Learn what is Transcriptome assembly?
* Different types of assemblies
* How do assemblers work?
* Checking the quality of assembly
* Understanding Transcriptome assembly

In variant calling, we mapped reads to a reference and looked systematically for differences.

But what if you don’t have a reference? How do you construct one?

The answer is de novo assembly, and the basic idea is you feed in your reads and you get out a bunch of contigs, that represent stretches of RNA present in the reads that don’t have any long repeats or much significant polymorphism. Like everything else, the basic idea is that you run a program, feed in the reads, and get out a pile of assembled RNA.

Trinity, developed at the Broad Institute and the [Hebrew University of Jerusalem](http://www.cs.huji.ac.il/, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. We will be using the eel-pond protocol for our guide to doing RNA-seq assembly.

10.1. Boot up a Jetstream¶

Boot an m1.medium Jetstream instance and log in.

10.2. Install the TRINITY assembler¶

The Trinity assembler can also install it through conda:

conda install -y trinity 

10.3. Change to a new working directory and link the original data¶

We will be using the same data as before (Schurch et al, 2016), so the following commands will create a new folder assembly and link the trimmed data we prepared earlier in the newly created folder:

cd ~/
mkdir -p assembly
cd assembly

ln -fs ~/quality/*.qc.fq.gz .
ls

10.4. Applying Digital Normalization¶

In this section, we’ll apply digital normalization and variable-coverage k-mer abundance trimming to the reads prior to assembly. This has the effect of reducing the computational cost of assembly without negatively affecting the quality of the assembly. Although the appropriate approach would be to use all 6 samples, for time consideration we will be using just the first one, i.e. ERR458493.qc.fq.gz

normalize-by-median.py --ksize 20 --cutoff 20 -M 10G --savegraph normC20k20.ct --force_single ERR458493.qc.fq.gz

This tools works mainly for paired-end reads, combined with a one file containing single-end reads. Given that all our samples are single end, we’ll use the --force_single flag to force all reads to be considered as single-end. The parameter --cutoff indicates that when the median k-mer coverage level is above this number the read is not kept. Also note the -M parameter. This specifies how much memory diginorm should use, and should be less than the total memory on the computer you’re using. (See choosing hash sizes for khmer for more information.

(This step should take about 2-3 minutes to complete)

10.5. Trim off likely erroneous k-mers¶

Now, run through all the reads and trim off low-abundance parts of high-coverage reads

filter-abund.py --threads 4 --variable-coverage --normalize-to 18 normC20k20.ct *.keep

The parameter --variable-coverage requests that only trim low-abundance k-mers from sequences that have high coverage. The parameter --normalize-to bases the variable-coverage cutoff on this median k-mer abundance.

(This step should take about 2-3 minutes to complete)

10.5.1. Run the assembler¶

Trinity works both with paired-end reads as well as single-end reads (including simultaneously both types at the same time). In the general case, the paired-end files are defined as --left left.fq and --right right.fq respectively. The single-end reads (a.k.a orphans) are defined by the flag --single.

First of all though, we need to make sure that there are no whitespaces in the header of the input fastq file. This is done using the following command:

cat ERR458493.qc.fq.gz.keep.abundfilt | tr -d ' ' > ERR458493.qc.fq.gz.keep.abundfilt.clean

So let’s run the assembler as follows:

time Trinity --seqType fq --max_memory 10G --CPU 4 --single ERR458493.qc.fq.gz.keep.abundfilt.clean --output yeast_trinity

(This will take about 20 minutes)

You should see something like:

** Harvesting all assembled transcripts into a single multi-fasta file...

Saturday, June 30, 2018: 16:42:08       CMD: find /home/tx160085/assembly/yeast_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/opt/trinity-2.6.6/util/support_scripts /partitioned_trinity_aggregator.pl TRINITY_DN > /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp 
-relocating /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp to /home/tx160085/assembly/yeast_trinity/Trinity.fasta
Saturday, June 30, 2018: 16:42:08       CMD: mv /home/tx160085/assembly/yeast_trinity/Trinity.fasta.tmp /home/tx160085/assembly/yeast_trinity/Trinity.fasta

###################################################################
Trinity assemblies are written to /home/tx160085/assembly/yeast_trinity/Trinity.fasta
###################################################################

Saturday, June 30, 2018: 16:42:08       CMD: /opt/miniconda/opt/trinity-2.6.6/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/tx160085/assembly/yeast_trinity/Trinity.fasta > /home/tx160085/assembly/yeast_trinity/Trinity.fasta.gene_trans_map

at the end.

10.5.2. Looking at the assembly¶

First, save the assembly:

cp yeast_trinity/Trinity.fasta yeast-transcriptome-assembly.fa

Now, look at the beginning:

head yeast-transcriptome-assembly.fa

It’s RNA! Yay!

Let’s capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

TrinityStats.pl yeast-transcriptome-assembly.fa

The output should look something like the following:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  3305
Total trinity transcripts:      3322
Percent GC: 42.05

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 1355
        Contig N20: 1016
        Contig N30: 781
        Contig N40: 617
        Contig N50: 502

        Median contig length: 319
        Average contig: 441.65
        Total assembled bases: 1467173

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 1358
        Contig N20: 1016
        Contig N30: 781
        Contig N40: 617
        Contig N50: 500

        Median contig length: 319
        Average contig: 440.93
        Total assembled bases: 1457279

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?

Contents

Previous topic

Next topic

This Page