Assessing and Assembling Nanopore data¶

Last week in Woods Hole, MA we used our lab’s MinION to sequence a new bacterial species isolated by Rebecca Mickol in the Microbial Diversity Course at the Marine Biological Lab.

The goals of this tutorial are to:

convert oxford nanopore data in .fast5 format to .fastq
assess a nanopore run
assemble data
evaluate the assembly

Starting an AWS instance and installing software:¶

Start a blank Amazon instance (m3.xlarge) with 50 GB and log in.

Copy/paste to update and install software on your new instance:

sudo apt-get update && \
sudo apt-get -y install build-essential ruby screen git curl gcc make g++ python-dev unzip \
    default-jre pkg-config libncurses5-dev r-base-core \
    r-cran-gplots python-matplotlib sysstat python-virtualenv \
    python-setuptools cmake cython libhdf5-serial-dev \
    python-numpy python-scipy python-pandas python-pandas-lib \
    python-biopython parallel python-h5py python-tornado \
    bioperl libxml-simple-perl default-jre

We will now install poretools, which will help to read the .fast5 raw data generated by the Oxford Nanopore MinION.

sudo pip install poretools
poretools

You should see output like this:

usage: poretools [-h] [-v]
                 {combine,fastq,fasta,stats,hist,events,readstats,tabular,nucdist,metadata,index,qualdist,qualpos,winner,squiggle,times,yield_plot,occupancy,organise}
                 ...
poretools: error: too few arguments

(Ignore the error, we’re expecting it because we have not given it any arguments!)

To install the rest of the software, we will use Linux brew: https://github.com/Linuxbrew/brew

sudo mkdir /home/linuxbrew
sudo chown $USER:$USER /home/linuxbrew
git clone https://github.com/Linuxbrew/brew.git /home/linuxbrew/.linuxbrew
export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH
brew tap homebrew/science

Now install canu, samtools, bwa mem, nanopolish:

brew install jdk canu bwa samtools

We will install the new (thanks @jts!) R9 version of nanopolish:

git clone --recursive https://github.com/jts/nanopolish.git
cd nanopolish/
make -j 4 HDF5=no
./nanopolish_test
export PATH=/home/ubuntu/nanopolish:$PATH

You should see:

===============================================================================
All tests passed (421 assertions in 4 test cases)

Install prokka:

git clone https://github.com/tseemann/prokka.git
export PATH=$PWD/prokka/bin:$PATH
prokka --setupdb
prokka --version

Get Oxford Nanopore MinION data¶

Last week we collected about 46k reads from three flowcells. Download a subset of these reads:

cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_subset.zip
unzip ectocooler_subset.zip
ls ectocooler_subset/

You should see a bunch of .fast5 files.

Download the fastq and fasta:

cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fastq
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fasta

Assess the Data and Convert .fast5 to .fastq and .fasta¶

As the MinION instrument is collecting raw data, it is uploaded to the Metrichor server which runs the basecalling software. Reads are then downloaded as .fast5 files. Let’s assess the run.

cd
directory="ectocooler_subset/"
poretools stats $directory

Here are the 2D reads:

poretools stats --type 2D $directory

How many reads are there total?
How many 2D?
What is the longest read?
How would you decide whether to do more sequencing?

Look at a histogram of read lengths:

poretools hist --theme-bw --min-length 1000 --max-length 40000 --saveas ecto_hist.png $directory

However, you will likely get an error like this:

_tkinter.TclError: no display name and no $DISPLAY environment variable

This happens because you have created an image and it is trying to launch from your instance. Click here to get the code to fix this problem. Let’s edit the first:

sudo nano /usr/local/lib/python2.7/dist-packages/poretools/hist.py

Delete the import library lines until “logging” then change the code so the libraries are loaded in the correct order. Edit the second script:

sudo nano /usr/local/lib/python2.7/dist-packages/poretools/yield_plot.py

Run the command again:

cd
poretools hist --theme-bw --min-length 1000 --max-length 40000 --saveas ecto_hist.png $directory

You will get another error now, but it’s OK (ignore this):

INFO:poretools:100 files processed.
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to Bitstream Vera Sans
(prop.get_family(), self.defaultFamily[fontext]))
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=11.0. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
UserWarning)

Check your files to make sure it output an ecto_hist.png file:

ls

Download the file to your local computer and take a look at the image. What does the distribution of read lengths look like?

scp -i amazon.pem ubuntu@xxx.amazon.com:/home/ubuntu/ecto_hist.png .

This is only a subset of the reads from the whole run. (Click here for stats from the full data set.)

Convert your .fast5 to .fastq and .fasta files:

cd ~/
poretools fastq $directory > ectocooler_subset.fastq
poretools fasta $directory > ectocooler_subset.fasta

Convert only 2D reads from .fast5 to .fastq and .fasta files:

cd ~/
poretools fastq --type 2D $directory > ectocooler_subset_2D.fastq
poretools fasta --type 2D $directory > ectocooler_subset_2D.fasta

Look at the reads:

head ectocooler_subset.fasta

Look at the 2D reads:

head ectocooler_subset_2D.fasta

What is the difference between the 2D reads and all the reads?

Copy a few reads and use the web blastn to try to identify what species or closest taxa these data came from. What do you come up with?

Assemble the data¶

We will use the program canu to assemble the reads. The full data set will take several hours. So, we will only assemble the subset. Which data are better to use, 2D or a mixture of template and compliment? Pick one, assemble, and compare with your neighbor.

canu \
-p ecto_subset -d ectocooler_assembly \
genomeSize=3.0m \
-nanopore-raw ectocooler_subset_2D.fastq

From the output files, you are interested in the ecto_subset.contigs.fasta file. Let’s copy that file to the home directory:

cd
cp ectocooler_assembly/ecto_subset.contigs.fasta .

Which is a better assembly, mixture or 2D?
How many contigs do you have?
How many contigs are you expecting?

Download the pre-assembled contigs from the full data set:

wget https://raw.githubusercontent.com/ljcohen/dib_ONP_MinION/master/Ectocooler/ecto.contigs.fasta

Compare this with your assembly. How are they different?

Annotate with prokka:¶

Yesterday, you used Torsten’s program, prokka to annotate a bacterial genome. We will use this to annotate these new contigs we have assembled.

prokka --outdir anno_subset --prefix ecto_subset_prokka ecto_subset.contigs.fasta

Check the output:

cat ./anno_subset/ecto_subset_prokka.txt

How many genes did Prokka find in the contigs?
Does this meet your expectations?

Use this command to run prokka on the contigs assembled with the full data set:

prokka --outdir anno_full --prefix ecto_full_prokka ecto.contigs.fasta

Check the output:

cat ./anno_full/ecto_full_prokka.txt

Evaluate the assembly:¶

Align the reads to the assembled subset of contigs. (Or use the contigs assembled from full data set. Pick one and compare with your neighbor!)

index the reference genome - in this case the reference genome is our de novo assembly
align, converting SAM to BAM, then sorting the BAM file
index the BAM file

Index:

bwa index ecto_subset.contigs.fasta

Align

bwa mem -x ont2d -t 8 ecto_subset.contigs.fasta ectocooler_subset_2D.fasta | samtools sort > ecto_subset.sorted.bam

This will give you an indexed bam file:

samtools index ecto_subset.sorted.bam

Download the resulting ectocooler_align.sorted.bam, ectocooler_align.sorted.bam.bai, ecto.contigs.fasta to your local computer.

scp -i amazon.pem ubuntu@xxx.amazon.com:/home/ubuntu/ecto_subset.sorted.bam .
scp -i amazon.pem ubuntu@xxx.amazon.com:/home/ubuntu/ecto_subset.sorted.bam.bai .
scp -i amazon.pem ubuntu@xxx.amazon.com:/home/ubuntu/ecto_subset.contigs.fasta .

In IGV, open ecto_subset.contigs.fasta as “Genome” and ecto_subset.sorted.bam.

What does the alignment look like?
What is the coverage?
Can you spot any problems?
What is the Oxford Nanopore error profile?
Does it do badly in any regions, which ones? Why?

Fix the assembly using nanopolish¶

The program nanopolish will align your reads to the assembly and compute a consensus. This will take some time.

Run these commands using your reads and your assembly:

# Copy the nanopolish model files into the working directory
cp /path/to/nanopolish/etc/r9-models/* .

# Align the reads in event space
nanopolish eventalign -t 4 --sam -r ectocooler_subset.fasta -b ecto_subset.sorted.bam -g ecto_subset.contigs.fasta --models nanopolish_models.fofn | samtools sort > ecto_subset.eventalign.sorted.bam
samtools index ecto_subset.eventalign.sorted.bam

The next step takes a long time (several hours), so let’s run screen first:

screen

Press enter if prompted:

python /home/ubuntu/nanopolish/scripts/nanopolish_makerange.py ecto_subset.contigs.fasta | parallel --results nanopolish.results -P 4 \
nanopolish variants --consensus polished.{1}.fa -w {1} -r ectocooler_subset.fasta -b ecto_subset.sorted.bam -g ecto_subset.contigs.fasta -e ecto_subset.eventalign.sorted.bam -t 1 --min-candidate-frequency 0.1 --models nanopolish_models.fofn

Type Ctrl-A-D (press all three keys at the same time) to exit from your screen. Then screen -r to return to that screen.

Once this has finished, merge files and make a new “polished” assembly:

python /home/ubuntu/nanopolish/scripts/nanopolish_merge.py polished.*.fa > polished_ecto_subset.fa

Download this to your local computer and view in IGV. How is this different than the original assembly? Is it better?

Run prokka again:

prokka --outdir anno_subset_polished --prefix ecto_subset_polished_prokka polished_ecto_subset.fa
cat ./anno_subset_polished/ecto_subset_polished_prokka.txt

References:¶

Acknowledgements¶

This is a modified lesson by Nick Loman from 2015, contributions by Torsten Seemann, Harriet Alexander, and Lisa Cohen.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.