======================================== Assembling E. coli sequences with Velvet ======================================== .. Warning:: These documents are not maintained and their instructions may be out of date. However the GED Lab does maintain the `khmer protocols `__ which may cover similar topics. See also the `installation instructions for the current version of the khmer project `__. Install screed:: cd /usr/local/share git clone https://github.com/ged-lab/screed.git cd screed python setup.py install Install khmer:: cd /usr/local/share git clone https://github.com/ged-lab/khmer.git cd khmer make Grab the ngs-scripts:: git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts and install the Velvet assembler:: cd /root curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz tar xzf velvet_1.2.10.tgz cd velvet_1.2.10 make MAXKMERLENGTH=51 cp velvet? /usr/local/bin Now, let's create a working directory:: cd /mnt mkdir assembly cd assembly Download some E. coli data. The first data set (ecoli_ref-5m-trim.fastq.gz) is the trimmed PE data sets from the other day (see :doc:`short-read-quality-evaluation`), and the second data set is a specially processed data set using `digital normalization `__ that will assemble quickly. :: curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m-trim.fastq.gz curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz Now... assemble the small, fast data sets, using the Velvet assembler. Here we will set the required parameter k=31:: velveth ecoli.31 31 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz velvetg ecoli.31 -exp_cov auto Check out the stats for the assembled contigs for a cutoff of 1000:: python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa Also try assembling with k=33 and k=35:: velveth ecoli.33 33 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz velvetg ecoli.33 -exp_cov auto velveth ecoli.35 35 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz velvetg ecoli.35 -exp_cov auto Now check out the stats for the assembled contigs for a cutoff of 1000:: python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa (Also read: `What does k control in de Bruijn graph assemblers? `__.) Over lunch, run an assembly of the larger read data set. (You might want to do this in screen.) :: velveth ecoli-full.31 31 -short -fastq.gz ecoli_ref-5m-trim.fastq.gz velvetg ecoli-full.31 -exp_cov auto Comparing and evaluating assemblies - Mapping ============================================= Install bowtie:: cd /root curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip unzip bowtie-0.12.7-linux-x86_64.zip cd bowtie-0.12.7 cp bowtie bowtie-build bowtie-inspect /usr/local/bin Now, for each contigs file, extract the contigs over (say) 300 bases -- :: cd /mnt/assembly python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.31/contigs.fa > ecoli-31.fa python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.33/contigs.fa > ecoli-33.fa python /usr/local/share/khmer/sandbox/extract-long-sequences.py 300 ecoli.35/contigs.fa > ecoli-35.fa ...and build the indices:: for i in 31 33 35; do bowtie-build ecoli-$i.fa ecoli-$i done Finally, map a subset of the reads to each subset:: gunzip -c ecoli_ref-5m-trim.fastq.gz | head -1200000 > 300k-reads.fq for i in 31 33 35; do echo mapping to ecoli-$i bowtie -p 2 ecoli-$i -q 300k-reads.fq 300k-reads.x.$i.map done You can look at the statistics (# of reads aligned) as a proxy for "completeness" of the assembly -- :: mapping to ecoli-31 # reads processed: 300000 # reads with at least one reported alignment: 266468 (88.82%) # reads that failed to align: 33532 (11.18%) Reported 266468 alignments to 1 output stream(s) mapping to ecoli-33 # reads processed: 300000 # reads with at least one reported alignment: 256539 (85.51%) # reads that failed to align: 43461 (14.49%) Reported 256539 alignments to 1 output stream(s) mapping to ecoli-35 # reads processed: 300000 # reads with at least one reported alignment: 243645 (81.22%) # reads that failed to align: 56355 (18.79%) Reported 243645 alignments to 1 output stream(s) generally you want this number to be as high as possible, because it means that you've got an assembly representing more of the reads. Note that you can also 'wc' to count the number of lines in the map files:: wc -l *.map from which you can generate the numbers above. Comparing and evaluating assemblies -- BLAST ============================================ Install BLAST:: cd /root curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz tar xzf blast-2.2.24-x64-linux.tar.gz cp blast-2.2.24/bin/* /usr/local/bin cp -r blast-2.2.24/data /usr/local/blast-data Build BLAST databases for the assemblies you've done:: cd /mnt/assembly for i in 31 33 35 do formatdb -i ecoli-$i.fa -o T -p F done Also download the true reference genome:: curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz gunzip ecoliMG1655.fa.gz formatdb -i ecoliMG1655.fa -o T -p F And now do a few pairwise BLASTs:: blastall -i ecoli-31.fa -d ecoliMG1655.fa -p blastn -e 1e-20 -o 31.x.mg.blastn & blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn & blastall -i ecoli-31.fa -d ecoli-33.fa -p blastn -e 1e-20 -o 31.x.33.blastn & blastall -d ecoli-31.fa -i ecoli-33.fa -p blastn -e 1e-20 -o 33.x.31.blastn & Compute overlaps:: python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-33.fa 31.x.33.blastn 300 ecoli-31.fa python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa 33.x.31.blastn 300 ecoli-33.fa python /root/ngs-scripts/blast/calc-blast-cover.py ecoliMG1655.fa 31.x.mg.blastn 300 ecoli-31.fa python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa You should see:: total bases in reference: 4588331 total ref bases covered : 4232257 fraction : 0.92239574695 reference : ecoli-33.fa blast file : 31.x.33.blastn query sequences : ecoli-31.fa which tells you that 4.23 mb of the 4.58 mb (92.2% of the bases) in ecoli-33 have an alignment in ecoli-31. Similarly, :: total bases in reference: 4606276 total ref bases covered : 4223094 fraction : 0.916813061137 reference : ecoli-31.fa blast file : 33.x.31.blastn query sequences : ecoli-33.fa tells you that 91.6% (4.22/4.61 mb) of the bases in ecoli-31.fa have an alignment in ecoli-33. The numbers for the reference genome are a bit strange, though -- :: total bases in reference: 4639675 total ref bases covered : 4409820 fraction : 0.950458814464 reference : ecoliMG1655.fa blast file : 31.x.mg.blastn query sequences : ecoli-31.fa total bases in reference: 4606276 total ref bases covered : 896964 fraction : 0.194726499237 reference : ecoli-31.fa blast file : mg.x.31.blastn query sequences : ecoliMG1655.fa What? So, 95% of the reference genome is covered by ecoli-31, but only 19.4% of ecoli-31.fa is covered by the E. coli genome? How is that possible?? What's going on? Any ideas? It turns out that, by default, BLAST only shows the first 200-250 matches for each query sequence! In the case where you have one query sequence (the E. coli genome) and many query contigs (ecoli-31.fa) there are several thousand hits; BLAST is only showing you the first few hundred. To fix this, re-run BLAST with different -b and -v parameters, to increase the number of hits reported, and then re-run the calc-blast-cover command:: blastall -d ecoli-31.fa -i ecoliMG1655.fa -p blastn -e 1e-20 -o mg.x.31.blastn -b 5000 -v 5000 python /root/ngs-scripts/blast/calc-blast-cover.py ecoli-31.fa mg.x.31.blastn 300 ecoliMG1655.fa and you should get this:: total bases in reference: 4606276 total ref bases covered : 4354561 fraction : 0.945353904108 reference : ecoli-31.fa blast file : mg.x.31.blastn query sequences : ecoliMG1655.fa Much better! Comparing and evaluating assemblies -- contig size distributions ================================================================ Go to IPython Notebook (https:// + machine name), and enter the code below in different cells. Use SHIFT-Enter to execute the Python code. First, write a function to load in the lengths of sequences in an assembly:: import screed def get_length_distribution(filename): x = [] for record in screed.open(filename): x.append(len(record.sequence)) return x Then load the data from E. coli 31 and E. coli 33. :: ecoli31 = get_length_distribution('/mnt/assembly/ecoli-31.fa') ecoli33 = get_length_distribution('/mnt/assembly/ecoli-33.fa') Now look at a histogram of the length distribution:: hist(ecoli31) The problem with this graph is that it shows the *number of contigs* by bin, not the *number of bases*. Let's fix:: def generate_cumulative_above(contig_sizes): sizes = list(reversed(sorted(contig_sizes))) sizesum = [] sofar = 0 for size in sizes: sofar += size sizesum.append((size, sofar)) sizesum.sort() return numpy.array(sizesum) Now, plot:: cum31 = generate_cumulative_above(ecoli31) cum33 = generate_cumulative_above(ecoli33) plot(cum31[:,0], cum31[:,1], label='k=31') plot(cum33[:,0], cum33[:,1], label='k=33') legend() .. @SGA Suggested exercises ------------------- 1. Plot the above graphs for k=35 on the digitally normalized data, as well on the ecoli-full.31 data set. 2. For the digitally normalized data, find the "best" k value. Why is it the best? :) 3. Note that the 'ecoli_ref-5m-trim.fastq.gz' is interleaved but contains both pairs and orphans. Use:: python /usr/local/share/khmer/sandbox/strip-and-split-for-assembly.py to split things up into a .pe and .se file, and then modify the velveth command to use '-shortPaired' on the .pe file, and '-short' on the .se file.