Assembling E. coli sequences with SPAdes¶
The goal of this tutorial is to show you the basics of assembly using the SPAdes assembler.
We’ll be using data from Efficient de novo assembly of single-cell bacterial genomes from short-read data sets, Chitsaz et al., 2011.
Booting an Amazon AMI¶
Start up an Amazon computer (m3.large or m3.xlarge) running Ubuntu 14.04, as in Getting started with Amazon EC2, and log in.
Logging in¶
Log in and type:
sudo apt-get update && \
sudo apt-get -y install screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core \
r-cran-gplots python-matplotlib sysstat python-virtualenv \
python-setuptools cmake
to update the computer with all the bundled software you’ll need.
At this time, you might also make /mnt writeable:
sudo chmod a+rwxt /mnt
Packages to install¶
Install khmer:
cd
python -m virtualenv env
source env/bin/activate
pip install -U setuptools
pip install khmer==1.4.1
and download and compile the SPAdes assembler:
cd
curl -O http://spades.bioinf.spbau.ru/release3.5.0/SPAdes-3.5.0.tar.gz
tar xvf SPAdes-3.5.0.tar.gz
cd SPAdes-3.5.0
./spades_compile.sh
export PATH="$PATH:$(pwd)/bin"
as well as Quast, software for evaluating the assembly against the known reference:
cd
curl -L http://sourceforge.net/projects/quast/files/quast-3.0.tar.gz/download > quast-3.0.tar.gz
tar xvf quast-3.0.tar.gz
Getting the data¶
Now, let’s create a working directory:
cd /mnt
mkdir assembly
cd assembly
Download some E. coli data. This data set (ecoli_ref-5m-trim.fastq.gz) is the trimmed data from the Chitsaz paper, E. coli reference sequencing.
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoli_ref-5m-trim.fastq.gz
Now, pull out the paired reads:
extract-paired-reads.py ecoli_ref-5m-trim.fastq.gz
mv ecoli_ref-5m-trim.fastq.gz.se ecoli_ref-5m-trim.se.fq
mv ecoli_ref-5m-trim.fastq.gz.pe ecoli_ref-5m-trim.pe.fq
Running an assembly¶
Now, let’s run an assembly:
spades.py --12 ecoli_ref-5m-trim.pe.fq -s ecoli_ref-5m-trim.se.fq -o spades.d
This will take about 15 minutes; it should end with:
* Corrected reads are in /mnt/assembly/spades.d/corrected/
* Assembled contigs are in /mnt/assembly/spades.d/contigs.fasta (contigs.fastg)
* Assembled scaffolds are in /mnt/assembly/spades.d/scaffolds.fasta (scaffolds.fastg)
Looking at the assembly¶
Run QUAST:
~/quast-3.0/quast.py spades.d/scaffolds.fasta -o report
and then look at the report:
less report/report.txt
You should see:
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
Assembly scaffolds
# contigs (>= 0 bp) 160
# contigs (>= 1000 bp) 84
Total length (>= 0 bp) 4571783
Total length (>= 1000 bp) 4551354
# contigs 93
Largest contig 264754
Total length 4557807
GC (%) 50.75
N50 132618
N75 64692
L50 12
L75 24
# N's per 100 kbp 0.00
Comparing and evaluating assemblies - QUAST¶
Download the true reference genome:
cd /mnt/assembly
curl -O https://s3.amazonaws.com/public.ged.msu.edu/ecoliMG1655.fa.gz
gunzip ecoliMG1655.fa.gz
and run QUAST again:
~/quast-3.0/quast.py -R ecoliMG1655.fa spades.d/scaffolds.fasta -o report
Note that here we’re looking at all the assemblies we’ve generated.
Now look at the results:
less report/report.txt
and now we have a lot more information!
A second assembler - MEGAHIT¶
Let’s try out the MEGAHIT assembler. MEGAHIT is primarily intended for metagenomes but works well on microbial genomes in general.
The MEGAHIT source code is on GitHub, here: https://github.com/voutcn/megahit. Let’s go grab it and build it!
cd
git clone https://github.com/voutcn/megahit.git
cd megahit
make
Now, let’s go run an assembly –
cd /mnt/assembly
~/megahit/megahit --12 *.pe.fq -r *.se.fq
This will take about a minute, and the output will be placed in
megahit_out/final.contigs.fa
. Let’s evaluate it against the SPAdes
assembly with QUAST:
cp spades.d/scaffolds.fasta spades-assembly.fa
cp megahit_out/final.contigs.fa megahit-assembly.fa
~/quast-3.0/quast.py -R ecoliMG1655.fa spades-assembly.fa \
megahit-assembly.fa -o report
Let’s look at the report!
less report/report.txt
Reference-free comparison¶
Above, we’ve been using the genome reference to do assembly comparisons – but often you don’t have one. What do you do to evaluate and compare assemblies without a reference?
One interesting trick is to just run QUAST with one assembly as a reference, and the other N assemblies against it. My only suggestion is to first eliminate short, fragmented contigs from the assembly you’re going to use as a reference.
Let’s try that, using extract-long-sequences.py
from khmer:
extract-long-sequences.py -l 1000 spades-assembly.fa > spades-long.fa
and then re-run QUAST and put the output in report-noref/report.txt
:
~/quast-3.0/quast.py -R spades-long.fa spades-assembly.fa \
megahit-assembly.fa -o report-noref
When you look at the report,
less report-noref/report.txt
take particular note of the following –
Assembly spades-assembly megahit-assembly
...
Misassembled contigs length 0 814643
# local misassemblies 0 9
# unaligned contigs 9 + 0 part 7 + 14 part
Unaligned length 6453 7404
Genome fraction (%) 100.000 99.833
Challenge exercise¶
Take your assembled genome, and:
- Install BLAST;
- Grab a FASTA sequence from NCBI for an E. coli protein (e.g. CRP);
- Save it to a file;
- TBLASTN that protein against your newly assembled genome.
See Running command-line BLAST for the basics. Hint – you’ll need to format your assembly as a BLAST database.
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus