Assembling E. coli sequences with Velvet¶
The goal of this tutorial is to show you the basics of assembly using the Velvet assembler.
We’ll be using data from Efficient de novo assembly of single-cell bacterial genomes from short-read data sets, Chitsaz et al., 2011.
Booting an Amazon AMI¶
Start up an Amazon computer (m1.large or m1.xlarge) using AMI ami-7607d01e (see Start up an EC2 instance and Starting up a custom operating system).
Log in with Windows or from Mac OS X.
Updating the operating system¶
Copy and paste the following two commands
apt-get update
apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core \
r-cran-gplots python-matplotlib sysstat
to update the computer with all the bundled software you’ll need.
Packages to install¶
Install khmer:
cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git
cd khmer
git checkout v1.1
make install
and install the Velvet assembler:
cd /root
curl -O http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz
tar xzf velvet_1.2.10.tgz
cd velvet_1.2.10
make MAXKMERLENGTH=51
cp velvet? /usr/local/bin
as well as Quast, software for evaluating the assembly against the known reference:
cd /root
curl -O -L https://downloads.sourceforge.net/project/quast/quast-2.3.tar.gz
tar xzf quast-2.3.tar.gz
Getting the data¶
Now, let’s create a working directory:
cd /mnt
mkdir assembly
cd assembly
Download some E. coli data. The first data set (ecoli_ref-5m-trim.fastq.gz) is the trimmed PE data sets from the other day (see Short Read Quality Control), and the second data set is a specially processed data set using digital normalization that will assemble quickly.
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m-trim.fastq.gz
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli-reads-5m-dn-paired.fa.gz
Running an assembly¶
Now... assemble the small, fast data sets, using the Velvet assembler. Here we will set the required parameter k=21:
velveth ecoli.21 21 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.21 -exp_cov auto
Check out the stats for the assembled contigs for a cutoff of 1000:
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
Also try assembling with k=23 and k=25:
velveth ecoli.23 23 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.23 -exp_cov auto
velveth ecoli.25 25 -shortPaired -fasta.gz ecoli-reads-5m-dn-paired.fa.gz
velvetg ecoli.25 -exp_cov auto
Now check out the stats for the assembled contigs for a cutoff of 1000:
python /usr/local/share/khmer/sandbox/assemstats3.py 1000 ecoli.*/contigs.fa
(Also read: What does k control in de Bruijn graph assemblers?.)
Comparing and evaluating assemblies - QUAST¶
Download the true reference genome:
cd /mnt/assembly
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz
and run QUAST:
/root/quast-2.3/quast.py -R ecoliMG1655.fa ecoli.*/contigs.fa
Note that here we’re looking at all the assemblies we’ve generated.
Now look at the results:
more quast_results/latest/report.txt
The first bits to look at are Genome fraction (%) and # misassembled contigs, I think.
Searching assemblies – BLAST¶
Install BLAST:
cd /root
curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.24/blast-2.2.24-x64-linux.tar.gz
tar xzf blast-2.2.24-x64-linux.tar.gz
cp blast-2.2.24/bin/* /usr/local/bin
cp -r blast-2.2.24/data /usr/local/blast-data
Build BLAST databases for the assemblies you’ve done:
cd /mnt/assembly
for i in 21 23 25
do
extract-long-sequences.py -o ecoli-$i.fa -l 500 ecoli.$i/contigs.fa
formatdb -i ecoli-$i.fa -o T -p F
done
and then let’s search for a specific gene – first, download a file containing your protein sequence of interest:
curl -O http://athyra.idyll.org/~t/crp.fa
and now search:
blastall -i crp.fa -d ecoli-21.fa -p tblastn -b 1 -v 1
Questions and Discussion Points¶
Why do we use a lower cutoff of 1kb for the assemstats3 links, above? Why not 0?
Followup work¶
Try running an assembly of the larger read data set:
velveth ecoli-full.31 31 -short -fastq.gz ecoli_ref-5m-trim.fastq.gz
velvetg ecoli-full.31 -exp_cov auto
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus