Assembling E. coli sequences with Velvet

Warning

These documents are not maintained and their instructions may be out of date. However the GED Lab does maintain the khmer protocols which may cover similar topics. See also the installation instructions for the current version of the khmer project.

Install screed:

cd /usr/local/share
git clone https://github.com/ged-lab/screed.git
cd screed
python setup.py install

Install khmer:

cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git
cd khmer
make

Grab the ngs-scripts:

git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts

and install the Velvet assembler:

cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
make MAXKMERLENGTH=51
cp velvet? /usr/local/bin

Now, let’s create a working directory:

cd /mnt
mkdir assembly
cd assembly

Download some E. coli data. The first data set (ecoli_ref-5m-trim.fastq.gz) is the trimmed PE data sets from the other day (see Evaluating the quality of your short reads, and trimming them), and the second data set is a specially processed data set using digital normalization that will assemble quickly.

curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m-trim.fastq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz

Now... assemble the small, fast data sets, using the Velvet assembler. Here we will set the required parameter k=31:

velveth ecoli.31 31 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.31 -exp_cov auto

Check out the stats for the assembled contigs for a cutoff of 1000:

python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa

Also try assembling with k=33 and k=35:

velveth ecoli.33 33 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.33 -exp_cov auto

velveth ecoli.35 35 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.35 -exp_cov auto

Now check out the stats for the assembled contigs for a cutoff of 1000:

python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa

(Also read: What does k control in de Bruijn graph assemblers?.)

Over lunch, run an assembly of the larger read data set. (You might want to do this in screen.)

velveth ecoli-full.31 31 -short -fastq.gz ecoli_ref-5m-trim.fastq.gz
velvetg ecoli-full.31 -exp_cov auto

Comparing and evaluating assemblies - Mapping

Install bowtie:

cd /root

curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip
unzip bowtie-0.12.7-linux-x86_64.zip
cd bowtie-0.12.7
cp bowtie bowtie-build bowtie-inspect /usr/local/bin

Now, for each contigs file, extract the contigs over (say) 300 bases –

cd /mnt/assembly

python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.31/contigs.fa > ecoli-31.fa
python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.33/contigs.fa > ecoli-33.fa
python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.35/contigs.fa > ecoli-35.fa

...and build the indices:

for i in 31 33 35; do
   bowtie-build ecoli-$i.fa ecoli-$i
done

Finally, map a subset of the reads to each subset:

gunzip -c ecoli_ref-5m-trim.fastq.gz | head -1200000 > 300k-reads.fq

for i in 31 33 35; do
   echo mapping to ecoli-$i
   bowtie -p 2 ecoli-$i -q 300k-reads.fq 300k-reads.x.$i.map
done

You can look at the statistics (# of reads aligned) as a proxy for “completeness” of the assembly –

mapping to ecoli-31
# reads processed: 300000
# reads with at least one reported alignment: 266468 (88.82%)
# reads that failed to align: 33532 (11.18%)
Reported 266468 alignments to 1 output stream(s)

mapping to ecoli-33
# reads processed: 300000
# reads with at least one reported alignment: 256539 (85.51%)
# reads that failed to align: 43461 (14.49%)
Reported 256539 alignments to 1 output stream(s)

mapping to ecoli-35
# reads processed: 300000
# reads with at least one reported alignment: 243645 (81.22%)
# reads that failed to align: 56355 (18.79%)
Reported 243645 alignments to 1 output stream(s)

generally you want this number to be as high as possible, because it means that you’ve got an assembly representing more of the reads.

Note that you can also ‘wc’ to count the number of lines in the map files:

wc -l *.map

from which you can generate the numbers above.

Comparing and evaluating assemblies – BLAST

Install BLAST:

cd /root

curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data

Build BLAST databases for the assemblies you’ve done:

cd /mnt/assembly

for i in 31 33 35
do
   formatdb -i ecoli-$i.fa -o T -p F
done

Also download the true reference genome:

curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz
formatdb -i ecoliMG1655.fa -o T -p F

And now do a few pairwise BLASTs:

blastall -i ecoli-31.fa -d ecoliMG1655.fa -p blastn -e 1e-20 -o 31.x.mg.blastn &
blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn &
blastall -i ecoli-31.fa -d ecoli-33.fa -p blastn -e 1e-20 -o 31.x.33.blastn &
blastall -d ecoli-31.fa -i ecoli-33.fa -p blastn -e 1e-20 -o 33.x.31.blastn &

Compute overlaps:

python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-33.fa 31.x.33.blastn 300 ecoli-31.fa
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa 33.x.31.blastn 300 ecoli-33.fa

python /root/ngs-scripts/blast/calc-blast-cover.py ecoliMG1655.fa 31.x.mg.blastn 300 ecoli-31.fa
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa

You should see:

total bases in reference: 4588331
total ref bases covered : 4232257
fraction                : 0.92239574695
reference               : ecoli-33.fa
blast file              : 31.x.33.blastn
query sequences         : ecoli-31.fa

which tells you that 4.23 mb of the 4.58 mb (92.2% of the bases) in ecoli-33 have an alignment in ecoli-31. Similarly,

total bases in reference: 4606276
total ref bases covered : 4223094
fraction                : 0.916813061137
reference               : ecoli-31.fa
blast file              : 33.x.31.blastn
query sequences         : ecoli-33.fa

tells you that 91.6% (4.22/4.61 mb) of the bases in ecoli-31.fa have an alignment in ecoli-33.

The numbers for the reference genome are a bit strange, though –

total bases in reference: 4639675
total ref bases covered : 4409820
fraction                : 0.950458814464
reference               : ecoliMG1655.fa
blast file              : 31.x.mg.blastn
query sequences         : ecoli-31.fa

total bases in reference: 4606276
total ref bases covered : 896964
fraction                : 0.194726499237
reference               : ecoli-31.fa
blast file              : mg.x.31.blastn
query sequences         : ecoliMG1655.fa

What? So, 95% of the reference genome is covered by ecoli-31, but only 19.4% of ecoli-31.fa is covered by the E. coli genome? How is that possible??

What’s going on? Any ideas?

It turns out that, by default, BLAST only shows the first 200-250 matches for each query sequence! In the case where you have one query sequence (the E. coli genome) and many query contigs (ecoli-31.fa) there are several thousand hits; BLAST is only showing you the first few hundred.

To fix this, re-run BLAST with different -b and -v parameters, to increase the number of hits reported, and then re-run the calc-blast-cover command:

blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn -b 5000 -v 5000
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa

and you should get this:

total bases in reference: 4606276
total ref bases covered : 4354561
fraction                : 0.945353904108
reference               : ecoli-31.fa
blast file              : mg.x.31.blastn
query sequences         : ecoliMG1655.fa

Much better!

Comparing and evaluating assemblies – contig size distributions

Go to IPython Notebook (https:// + machine name), and enter the code below in different cells. Use SHIFT-Enter to execute the Python code.

First, write a function to load in the lengths of sequences in an assembly:

import screed

def get_length_distribution(filename):
    x = []
    for record in screed.open(filename):
        x.append(len(record.sequence))
    return x

Then load the data from E. coli 31 and E. coli 33.

ecoli31 = get_length_distribution('/mnt/assembly/ecoli-31.fa')
ecoli33 = get_length_distribution('/mnt/assembly/ecoli-33.fa')

Now look at a histogram of the length distribution:

hist(ecoli31)

The problem with this graph is that it shows the number of contigs by bin, not the number of bases. Let’s fix:

def generate_cumulative_above(contig_sizes):
    sizes = list(reversed(sorted(contig_sizes)))

    sizesum = []
    sofar = 0
    for size in sizes:
        sofar += size
        sizesum.append((size, sofar))

    sizesum.sort()

    return numpy.array(sizesum)

Now, plot:

cum31 = generate_cumulative_above(ecoli31)
cum33 = generate_cumulative_above(ecoli33)

plot(cum31[:,0], cum31[:,1], label='k=31')
plot(cum33[:,0], cum33[:,1], label='k=33')
legend()

Suggested exercises

  1. Plot the above graphs for k=35 on the digitally normalized data, as well on the ecoli-full.31 data set.

  2. For the digitally normalized data, find the “best” k value. Why is it the best? :)

  3. Note that the ‘ecoli_ref-5m-trim.fastq.gz’ is interleaved but contains both pairs and orphans. Use:

    python /usr/local/share/khmer/sandbox/strip-and-split-for-assembly.py
    

    to split things up into a .pe and .se file, and then modify the velveth command to use ‘-shortPaired’ on the .pe file, and ‘-short’ on the .se file.

comments powered by Disqus