Assessing and Assembling Nanopore data

Last year (2016) in Woods Hole, MA we used our lab’s MinION to sequence a new bacterial species isolated by Rebecca Mickol in the Microbial Diversity Course at the Marine Biological Lab.

If you’re interested, you can read a blog post about it.

The goals of this tutorial are to:

  • Assess an Oxford Nanopore Technologies (ONT) sequencing run on the MinION
  • Create an assembly from raw fastq files
  • Evaluate the assembly

Start a Jetstream instance and install software:

Start a blank Jetstream instance (m1.medium) with CPU: 6, Mem: 16, Disk 60 GB RAM:

Copy/paste to update and install software on your new instance:

sudo apt-get update && \
sudo apt-get -y install build-essential ruby screen git curl gcc make g++ python-dev unzip \
    default-jre pkg-config libncurses5-dev r-base-core \
    r-cran-gplots python-matplotlib sysstat python-virtualenv \
    python-setuptools cmake cython libhdf5-serial-dev \
    python-numpy python-scipy python-pandas python-pandas-lib \
    python-biopython parallel python-h5py python-tornado \
    bioperl libxml-simple-perl default-jre gdebi-core r-base gnuplot

To install some of the software, we will use Linux brew:

cd
sudo mkdir /home/linuxbrew
sudo chown $USER:$USER /home/linuxbrew
git clone https://github.com/Linuxbrew/brew.git /home/linuxbrew/.linuxbrew
echo 'export PATH=/home/linuxbrew/.linuxbrew/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
brew tap homebrew/science

Now install canu, samtools, bwa:

brew install jdk canu bwa samtools

Install poretools:

sudo pip install poretools

Install prokka:

cd
git clone https://github.com/tseemann/prokka.git
echo 'export PATH=$PWD/prokka/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
prokka --setupdb
prokka --version

Install assembly-stats:

cd
git clone https://github.com/sanger-pathogens/assembly-stats.git
cd assembly-stats/
mkdir build
cd build
cmake ..
make
make test
sudo make install

Install RStudio:

cd
wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
sudo gdebi -n rstudio-server-1.0.143-amd64.deb

Change your password for RStudio:

cd
sudo passwd <account name>

Install mummer and gnuplot v4:

mummer

cd
wget https://github.com/mummer4/mummer/releases/download/v3.9.4alpha/mummer-3.9.4alpha.tar.gz
tar xvzf mummer-3.9.4alpha.tar.gz
cd mummer-3.9.4alpha
./configure
make
sudo make install
echo 'export LD_LIBRARY_PATH="/usr/local/lib"' >> ~/.bashrc
source ~/.bashrc

Get Oxford Nanopore MinION data and convert it

Our data were collected from three R9.4 flowcells in 2016. Download a subset of these reads:

cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_subset.zip
unzip ectocooler_subset.zip
ls ectocooler_subset/

You should see a bunch of .fast5 files.

This is only a subset of the reads from the whole run. (Click here for stats from the full data set.)

The MinION instrument collects raw data in .fast5 format. The local basecalling software, Albacore (sorry, link requires ONT MAP login access), converts .fast5 files into .fastq or .fasta files. Poretools is another method for converting .fast5 files into .fastq files.

Convert your .fast5 to .fastq and .fasta files:

cd
directory="ectocooler_subset/"
poretools fastq $directory > ectocooler_subset.fastq
poretools fasta $directory > ectocooler_subset.fasta

Take a look at the reads:

head ectocooler_subset.fastq

Copy a few reads and use the web blastn to try to identify what species or closest taxa these data came from. What do you come up with?

Download the full dataset, fastq and fasta files:

cd
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fastq
wget https://s3.amazonaws.com/ngs2016/ectocooler_all_2D.fasta

Assess the Data

assembly-stats ectocooler_subset.fastq
  1. How many reads are there total?
  2. What is the longest read?

Assess the full data set:

assembly-stats ectocooler_all_2D.fastq

How does the full data set compare to the subset?

How does it compare to these results from three R9.5 flowcells of killifish (Fundulus olivaceus) data collected at Porecamp in 2017?

[ljcohen@globus-00 fastq2]$ /mnt/home/ljcohen/bin/assembly-stats/assembly-stats porecamp_killifish2.fastq
stats for porecamp_killifish2.fastq
sum = 4962626713, n = 740248, ave = 6704.01, largest = 973552
N50 = 12726, n = 117202
N60 = 10357, n = 160433
N70 = 8098, n = 214460
N80 = 5724, n = 286845
N90 = 3229, n = 400661
N100 = 5, n = 740248
N_count = 0
Gaps = 0

Run this to make a file with all read lengths:

cat ectocooler_all_2D.fastq | paste - - - - | awk -F"\t" '{print length($2)}' > lengths.txt

Grab lengths from the killifish runs to compare:

wget https://raw.githubusercontent.com/ngs-docs/angus/2017/killifish_lengths.txt

Start RStudio server:

echo My RStudio Web server is running at: http://$(hostname):8787/

Run these commands in RStudio:

setwd("~/")
lengths <- read.table("lengths.txt")[,1]
hist(lengths, xlim=c(0,30000), breaks=100, col="red")
killifish_lengths <- read.table("killifish_lengths.txt")[,1]
hist(killifish_lengths, xlim=c(0,90000), breaks=1000, col="blue")

Assemble the data

We will use the program canu to assemble the reads. The full data set will take several hours. So, we will only assemble the subset. Which data are better to use, 2D or a mixture of template and compliment? Pick one, assemble, and compare with your neighbor.

canu \
-p ecto_subset -d ectocooler_assembly \
genomeSize=3.0m \
-nanopore-raw ectocooler_subset.fastq

This will take ~15 min. So, go ahead and grab some coffee!

After the assembly has finished, the output file you are interested is ecto_subset.contigs.fasta. Let’s copy that file to the home directory:

cd
cp ectocooler_assembly/ecto_subset.contigs.fasta .

Assess the assembly:

assembly-stats ecto_subset.contigs.fasta

How many contigs do you have?

Download the pre-assembled contigs from the full data set:

wget https://raw.githubusercontent.com/ljcohen/dib_ONP_MinION/master/Ectocooler/ecto.contigs.fasta
assembly-stats ecto.contigs.fasta

Compare this with your assembly. How are they different?

All-by-all comparisons

Before running mummerplot, run this:

sudo awk -i inplace '/\$P_FORMAT .=/ { print "#" $0; next } { print }' $(which mummerplot)

because reasons

Quick assembly vs. assembly of full reads:

nucmer --maxmatch -c 100 -p ectocooler ecto.contigs.fasta ecto_subset.contigs.fasta
mummerplot --fat --filter --png --large -p ectocooler ectocooler.delta

Grab closely-related reference genome of Tenacibaculum dicentrarchi:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/483/385/GCF_001483385.1_ASM148338v1/GCF_001483385.1_ASM148338v1_genomic.fna.gz
gunzip GCF_001483385.1_ASM148338v1_genomic.fna.gz

Do all-by-all comparison of reference genome vs. ecocooler full reads assembly

nucmer --maxmatch -c 100 -p ectocooler_Tdicent ecto.contigs.fasta GCF_001483385.1_ASM148338v1_genomic.fna
mummerplot --fat --filter --png --large -p ectocooler_Tdicent ectocooler_Tdicent.delta

Annotate with prokka:

This week, you have used Torsten’s program, prokka to annotate a bacterial genome. We will use this to annotate these new contigs we have assembled.

prokka --outdir anno_subset --prefix ecto_subset_prokka ecto_subset.contigs.fasta

Check the output:

cat ./anno_subset/ecto_subset_prokka.txt
  1. How many genes did Prokka find in the contigs?
  2. Does this meet your expectations?

Use this command to run prokka on the contigs assembled with the full data set:

prokka --outdir anno_full --prefix ecto_full_prokka ecto.contigs.fasta

Check the output:

cat ./anno_full/ecto_full_prokka.txt

References:

  • https://github.com/PacificBiosciences/Bioinformatics-Training/wiki/Evaluating-Assemblies
  • http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html
  • https://github.com/ljcohen/dib_ONP_MinION
  • http://nbviewer.jupyter.org/github/arq5x/poretools/blob/master/poretools/ipynb/test_run_report.ipynb
  • http://porecamp.github.io/2015/timetable.html
  • http://porecamp.github.io/2016/
  • http://porecamp.github.io/texas/

Acknowledgements

This is a modified lesson by Nick Loman from 2015, contributions by Torsten Seemann, Harriet Alexander, Mick Watson, Danny Miller, Jon Badalamenti, and Lisa Cohen.

canu.report stats from Fundulus olivaceus reads

If you’re interested in read length distributions from R9.5 data:

[ljcohen@dev-intel14 killifish_assembly]$ cat killifish.report
[CORRECTION/READS]
--
-- In gatekeeper store 'correction/killifish.gkpStore':
--   Found 614587 reads.
--   Found 4883910982 bases (4.43 times coverage).
--
--   Read length histogram (one '*' equals 4297.12 reads):
--        0   4999 300799 **********************************************************************
--     5000   9999 145553 *********************************
--    10000  14999  81269 ******************
--    15000  19999  40302 *********
--    20000  24999  20366 ****
--    25000  29999  11161 **
--    30000  34999   6305 *
--    35000  39999   3560
--    40000  44999   2043
--    45000  49999   1217
--    50000  54999    734
--    55000  59999    463
--    60000  64999    297
--    65000  69999    165
--    70000  74999    103
--    75000  79999     61
--    80000  84999     33
--    85000  89999     30
--    90000  94999     15
--    95000  99999     17
--   100000 104999     13
--   105000 109999      5
--   110000 114999      5
--   115000 119999      5
--   120000 124999      8
--   125000 129999      2
--   130000 134999      8
--   135000 139999      6
--   140000 144999      4
--   145000 149999      0
--   150000 154999      2
--   155000 159999      3
--   160000 164999      2
--   165000 169999      2
--   170000 174999      0
--   175000 179999      1
--   180000 184999      0
--   185000 189999      4
--   190000 194999      0
--   195000 199999      2
--   200000 204999      0
--   205000 209999      1
--   210000 214999      1
--   215000 219999      1
--   220000 224999      0
--   225000 229999      0
--   230000 234999      0
--   235000 239999      1
--   240000 244999      1
--   245000 249999      1
--   250000 254999      0
--   255000 259999      1
--   260000 264999      0
--   265000 269999      1
--   270000 274999      0
--   275000 279999      0
--   280000 284999      1
--   285000 289999      1
--   290000 294999      1
--   295000 299999      0
--   300000 304999      0
--   305000 309999      0
--   310000 314999      0
--   315000 319999      0
--   320000 324999      0
--   325000 329999      0
--   330000 334999      1
--   335000 339999      1
--   340000 344999      0
--   345000 349999      0
--   350000 354999      0
--   355000 359999      0
--   360000 364999      0
--   365000 369999      0
--   370000 374999      0
--   375000 379999      2
--   380000 384999      0
--   385000 389999      0
--   390000 394999      0
--   395000 399999      0
--   400000 404999      0
--   405000 409999      0
--   410000 414999      0
--   415000 419999      1
--   420000 424999      1
--   425000 429999      0
--   430000 434999      0
--   435000 439999      0
--   440000 444999      0
--   445000 449999      0
--   450000 454999      0
--   455000 459999      0
--   460000 464999      0
--   465000 469999      0
--   470000 474999      1
--   475000 479999      0
--   480000 484999      0
--   485000 489999      0
--   490000 494999      0
--   495000 499999      0
--   500000 504999      1
--   505000 509999      0
--   510000 514999      0
--   515000 519999      0
--   520000 524999      0
--   525000 529999      0
--   530000 534999      0
--   535000 539999      0
--   540000 544999      0
--   545000 549999      0
--   550000 554999      0
--   555000 559999      0
--   560000 564999      0
--   565000 569999      0
--   570000 574999      0
--   575000 579999      0
--   580000 584999      0
--   585000 589999      0
--   590000 594999      0
--   595000 599999      0
--   600000 604999      0
--   605000 609999      0
--   610000 614999      0
--   615000 619999      0
--   620000 624999      0
--   625000 629999      0
--   630000 634999      0
--   635000 639999      0
--   640000 644999      0
--   645000 649999      0
--   650000 654999      0
--   655000 659999      0
--   660000 664999      0
--   665000 669999      0
--   670000 674999      0
--   675000 679999      0
--   680000 684999      0
--   685000 689999      0
--   690000 694999      0
--   695000 699999      0
--   700000 704999      0
--   705000 709999      0
--   710000 714999      0
--   715000 719999      0
--   720000 724999      0
--   725000 729999      0
--   730000 734999      0
--   735000 739999      0
--   740000 744999      0
--   745000 749999      0
--   750000 754999      0
--   755000 759999      0
--   760000 764999      0
--   765000 769999      0
--   770000 774999      0
--   775000 779999      0
--   780000 784999      0
--   785000 789999      0
--   790000 794999      0
--   795000 799999      0
--   800000 804999      0
--   805000 809999      0
--   810000 814999      0
--   815000 819999      0
--   820000 824999      0
--   825000 829999      0
--   830000 834999      0
--   835000 839999      0
--   840000 844999      0
--   845000 849999      0
--   850000 854999      0
--   855000 859999      0
--   860000 864999      0
--   865000 869999      0
--   870000 874999      0
--   875000 879999      0
--   880000 884999      0
--   885000 889999      0
--   890000 894999      0
--   895000 899999      1
--   900000 904999      1
--   905000 909999      0
--   910000 914999      0
--   915000 919999      0
--   920000 924999      0
--   925000 929999      0
--   930000 934999      0
--   935000 939999      0
--   940000 944999      0
--   945000 949999      0
--   950000 954999      0
--   955000 959999      0
--   960000 964999      0
--   965000 969999      0
--   970000 974999      1

[CORRECTION/MERS]
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1 513587462 *******************************************************************--> 0.3659 0.1054
--       2-     2 302853120 ********************************************************************** 0.5817 0.2296
--       3-     4 297949496 ********************************************************************   0.7116 0.3419
--       5-     7 166204607 **************************************                                 0.8485 0.5152
--       8-    11  72366333 ****************                                                       0.9314 0.6771
--      12-    16  28375480 ******                                                                 0.9700 0.7901
--      17-    22  11291731 **                                                                     0.9861 0.8576
--      23-    29   4915132 *                                                                      0.9929 0.8966
--      30-    37   2381975                                                                        0.9960 0.9200
--      38-    46   1279754                                                                        0.9975 0.9351
--      47-    56    750151                                                                        0.9983 0.9454
--      57-    67    471014                                                                        0.9988 0.9529
--      68-    79    316003                                                                        0.9992 0.9587
--      80-    92    217116                                                                        0.9994 0.9633
--      93-   106    152316                                                                        0.9995 0.9671
--     107-   121    108860                                                                        0.9996 0.9701
--     122-   137     79676                                                                        0.9997 0.9726
--     138-   154     60884                                                                        0.9998 0.9747
--     155-   172     47146                                                                        0.9998 0.9765
--     173-   191     36929                                                                        0.9998 0.9780
--     192-   211     29814                                                                        0.9999 0.9794
--     212-   232     23866                                                                        0.9999 0.9806
--     233-   254     19555                                                                        0.9999 0.9817
--     255-   277     15830                                                                        0.9999 0.9826
--     278-   301     13153                                                                        0.9999 0.9835
--     302-   326     11139                                                                        0.9999 0.9843
--     327-   352      9446                                                                        0.9999 0.9850
--     353-   379      7738                                                                        0.9999 0.9856
--     380-   407      6591                                                                        1.0000 0.9862
--     408-   436      5681                                                                        1.0000 0.9867
--     437-   466      5128                                                                        1.0000 0.9872
--     467-   497      4612                                                                        1.0000 0.9877
--     498-   529      3954                                                                        1.0000 0.9882
--     530-   562      3413                                                                        1.0000 0.9886
--     563-   596      3177                                                                        1.0000 0.9890
--     597-   631      2742                                                                        1.0000 0.9893
--     632-   667      2515                                                                        1.0000 0.9897
--     668-   704      2268                                                                        1.0000 0.9900
--     705-   742      1991                                                                        1.0000 0.9903
--     743-   781      1873                                                                        1.0000 0.9906
--     782-   821      1654                                                                        1.0000 0.9909
--
--      620808 (max occurrences)
--  4361104715 (total mers, non-unique)
--   890053109 (distinct mers, non-unique)
--   513587462 (unique mers)