========================================
Assembling E. coli sequences with Velvet
========================================
.. Warning:: These documents are not maintained and their instructions may be
out of date. However the GED Lab does maintain the `khmer protocols
`__ which may cover similar
topics. See also the `installation instructions for the current version
of the khmer project `__.
Install screed::
cd /usr/local/share
git clone https://github.com/ged-lab/screed.git
cd screed
python setup.py install
Install khmer::
cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git
cd khmer
make
Grab the ngs-scripts::
git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts
and install the Velvet assembler::
cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
make MAXKMERLENGTH=51
cp velvet? /usr/local/bin
Now, let's create a working directory::
cd /mnt
mkdir assembly
cd assembly
Download some E. coli data. The first data set
(ecoli_ref-5m-trim.fastq.gz) is the trimmed PE data sets from the
other day (see :doc:`short-read-quality-evaluation`), and the second
data set is a specially processed data set using `digital
normalization `__ that will
assemble quickly. ::
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m-trim.fastq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz
Now... assemble the small, fast data sets, using the Velvet assembler. Here
we will set the required parameter k=31::
velveth ecoli.31 31 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.31 -exp_cov auto
Check out the stats for the assembled contigs for a cutoff of 1000::
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
Also try assembling with k=33 and k=35::
velveth ecoli.33 33 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.33 -exp_cov auto
velveth ecoli.35 35 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.35 -exp_cov auto
Now check out the stats for the assembled contigs for a cutoff of 1000::
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
(Also read: `What does k control in de Bruijn graph assemblers? `__.)
Over lunch, run an assembly of the larger read data set. (You might want to
do this in screen.) ::
velveth ecoli-full.31 31 -short -fastq.gz ecoli_ref-5m-trim.fastq.gz
velvetg ecoli-full.31 -exp_cov auto
Comparing and evaluating assemblies - Mapping
=============================================
Install bowtie::
cd /root
curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip
unzip bowtie-0.12.7-linux-x86_64.zip
cd bowtie-0.12.7
cp bowtie bowtie-build bowtie-inspect /usr/local/bin
Now, for each contigs file, extract the contigs over (say) 300 bases -- ::
cd /mnt/assembly
python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.31/contigs.fa > ecoli-31.fa
python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.33/contigs.fa > ecoli-33.fa
python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.35/contigs.fa > ecoli-35.fa
...and build the indices::
for i in 31 33 35; do
bowtie-build ecoli-$i.fa ecoli-$i
done
Finally, map a subset of the reads to each subset::
gunzip -c ecoli_ref-5m-trim.fastq.gz | head -1200000 > 300k-reads.fq
for i in 31 33 35; do
echo mapping to ecoli-$i
bowtie -p 2 ecoli-$i -q 300k-reads.fq 300k-reads.x.$i.map
done
You can look at the statistics (# of reads aligned) as a proxy for
"completeness" of the assembly -- ::
mapping to ecoli-31
# reads processed: 300000
# reads with at least one reported alignment: 266468 (88.82%)
# reads that failed to align: 33532 (11.18%)
Reported 266468 alignments to 1 output stream(s)
mapping to ecoli-33
# reads processed: 300000
# reads with at least one reported alignment: 256539 (85.51%)
# reads that failed to align: 43461 (14.49%)
Reported 256539 alignments to 1 output stream(s)
mapping to ecoli-35
# reads processed: 300000
# reads with at least one reported alignment: 243645 (81.22%)
# reads that failed to align: 56355 (18.79%)
Reported 243645 alignments to 1 output stream(s)
generally you want this number to be as high as possible, because it
means that you've got an assembly representing more of the reads.
Note that you can also 'wc' to count the number of lines in the map files::
wc -l *.map
from which you can generate the numbers above.
Comparing and evaluating assemblies -- BLAST
============================================
Install BLAST::
cd /root
curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data
Build BLAST databases for the assemblies you've done::
cd /mnt/assembly
for i in 31 33 35
do
formatdb -i ecoli-$i.fa -o T -p F
done
Also download the true reference genome::
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz
formatdb -i ecoliMG1655.fa -o T -p F
And now do a few pairwise BLASTs::
blastall -i ecoli-31.fa -d ecoliMG1655.fa -p blastn -e 1e-20 -o 31.x.mg.blastn &
blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn &
blastall -i ecoli-31.fa -d ecoli-33.fa -p blastn -e 1e-20 -o 31.x.33.blastn &
blastall -d ecoli-31.fa -i ecoli-33.fa -p blastn -e 1e-20 -o 33.x.31.blastn &
Compute overlaps::
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-33.fa 31.x.33.blastn 300 ecoli-31.fa
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa 33.x.31.blastn 300 ecoli-33.fa
python /root/ngs-scripts/blast/calc-blast-cover.py ecoliMG1655.fa 31.x.mg.blastn 300 ecoli-31.fa
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa
You should see::
total bases in reference: 4588331
total ref bases covered : 4232257
fraction : 0.92239574695
reference : ecoli-33.fa
blast file : 31.x.33.blastn
query sequences : ecoli-31.fa
which tells you that 4.23 mb of the 4.58 mb (92.2% of the bases) in
ecoli-33 have an alignment in ecoli-31. Similarly, ::
total bases in reference: 4606276
total ref bases covered : 4223094
fraction : 0.916813061137
reference : ecoli-31.fa
blast file : 33.x.31.blastn
query sequences : ecoli-33.fa
tells you that 91.6% (4.22/4.61 mb) of the bases in ecoli-31.fa have an
alignment in ecoli-33.
The numbers for the reference genome are a bit strange, though -- ::
total bases in reference: 4639675
total ref bases covered : 4409820
fraction : 0.950458814464
reference : ecoliMG1655.fa
blast file : 31.x.mg.blastn
query sequences : ecoli-31.fa
total bases in reference: 4606276
total ref bases covered : 896964
fraction : 0.194726499237
reference : ecoli-31.fa
blast file : mg.x.31.blastn
query sequences : ecoliMG1655.fa
What? So, 95% of the reference genome is covered by ecoli-31, but
only 19.4% of ecoli-31.fa is covered by the E. coli genome? How is
that possible??
What's going on? Any ideas?
It turns out that, by default, BLAST only shows the first 200-250 matches
for each query sequence! In the case where you have one query
sequence (the E. coli genome) and many query contigs (ecoli-31.fa)
there are several thousand hits; BLAST is only showing you the first
few hundred.
To fix this, re-run BLAST with different -b and -v parameters, to
increase the number of hits reported, and then re-run the
calc-blast-cover command::
blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn -b 5000 -v 5000
python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa
and you should get this::
total bases in reference: 4606276
total ref bases covered : 4354561
fraction : 0.945353904108
reference : ecoli-31.fa
blast file : mg.x.31.blastn
query sequences : ecoliMG1655.fa
Much better!
Comparing and evaluating assemblies -- contig size distributions
================================================================
Go to IPython Notebook (https:// + machine name), and enter the code
below in different cells. Use SHIFT-Enter to execute the Python code.
First, write a function to load in the lengths of sequences in an
assembly::
import screed
def get_length_distribution(filename):
x = []
for record in screed.open(filename):
x.append(len(record.sequence))
return x
Then load the data from E. coli 31 and E. coli 33. ::
ecoli31 = get_length_distribution('/mnt/assembly/ecoli-31.fa')
ecoli33 = get_length_distribution('/mnt/assembly/ecoli-33.fa')
Now look at a histogram of the length distribution::
hist(ecoli31)
The problem with this graph is that it shows the *number of contigs*
by bin, not the *number of bases*. Let's fix::
def generate_cumulative_above(contig_sizes):
sizes = list(reversed(sorted(contig_sizes)))
sizesum = []
sofar = 0
for size in sizes:
sofar += size
sizesum.append((size, sofar))
sizesum.sort()
return numpy.array(sizesum)
Now, plot::
cum31 = generate_cumulative_above(ecoli31)
cum33 = generate_cumulative_above(ecoli33)
plot(cum31[:,0], cum31[:,1], label='k=31')
plot(cum33[:,0], cum33[:,1], label='k=33')
legend()
.. @SGA
Suggested exercises
-------------------
1. Plot the above graphs for k=35 on the digitally normalized data, as well
on the ecoli-full.31 data set.
2. For the digitally normalized data, find the "best" k value. Why is it
the best? :)
3. Note that the 'ecoli_ref-5m-trim.fastq.gz' is interleaved but contains
both pairs and orphans. Use::
python /usr/local/share/khmer/sandbox/strip-and-split-for-assembly.py
to split things up into a .pe and .se file, and then modify the velveth
command to use '-shortPaired' on the .pe file, and '-short' on the .se
file.