18. De novo genome exploration

Here we’re going to take off from a de novo prokaryotic genome assembly performed in this lesson. That involved logging into our JetStream instances as noted here and setting up a specific conda environment as noted here. If you didn’t do that lesson, be sure to set up that conda environment as shown there. If you did, then be sure to activate that conda environment with the following after logged in:

conda activate de_novo_example

18.1. The data

The data used here is described here. If you did not do the previous lesson, the following code block will download the data and copy over the required assembly. But otherwise you will have all of this already:

cd ~
curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz
tar -xzvf genomics_de_novo_temp.tar.gz
rm genomics_de_novo_temp.tar.gz

cp -r ~/genomics_de_novo_temp/downloaded_results/assemblies/spades-careful-assembly/ ~/genomics_de_novo_temp/working/spades-careful-assembly/

Now let’s change directories into our working directory:

cd ~/genomics_de_novo_temp/working/

18.2. Exploring our assembly with anvi’o

Now that we’ve selected an assembly we’re going to move forward with, we can start to take a closer look at it. To do that we’re going to use a wonderful platform called anvi’o - which stands for analysis and visualization of omics data.

anvi’o is very expansive, but to try to summarize it in a sentence, it is a powerful and user-friendly data visualization and exploration platform. There is a ton of excellent documentation at the anvi’o site, including workflows, tutorials, and blog posts.

Here we’re going to put our isolate-genome assembly into the anvi’o framework and see just a few of the ways it can help us begin to look at our assembled genome - including a visual way of checking for possible contamination (which in this case would be anything in there that’s not our target Burkholderia cepacia). Many of the major steps we’re going to be performing here are laid out in the metagenomic workflow presented here, as much of the processing is the same even though we are not working with a metagenome here.

18.2.1. Putting our assembly into anvi’o and generating some information about it

For us to get our assembly into anvi’o, first we need to generate what it calls a contigs database. This contains the contigs from our assembly and some information intrisic to them (like GC content, tetranucleotide frequency, and open-reading frame coordinates). The following command will organize our contigs in an anvi’o-friendly way, generate some basic stats about them, and use the program Prodigal to identify open-reading frames (takes < 2 min.):

anvi-gen-contigs-database -f spades-careful-assembly/contigs.fasta \
                          -o contigs.db -n B_cepacia_assembly

Now that we have our contigs.db that holds our sequences and some basic information about them, we can start adding more information. This is one of the places where the flexibility comes into play, but for now we’ll just move forward with some parts of a general anvi’o workflow, including:

  • using the program HMMER with profile hidden Markov models (HMMs) to scan for bacterial single-copy genes (from Campbell et al. 2013)
    • if new to HMMs, see the bottom of page 7 here for a good explanation of what exactly a “hidden Markov model” is in the realm of sequence data
    • to summarize, instead of treating all columns of an alignment equally (like something like BLAST does), HMMs can weight individual columns in an alignment differently. So the highly-conserved stretches within, say, a conserved domain of a protein will matter more than a stretch in between two domains that may vary more.
  • using NCBI COGs to functionally annotate the open-reading frames Prodigal predicted
  • using a tool called centrifuge for taxonomic classification of the identified open-reading frames
  • generate coverage information for our assembly so we can visualize it

The following block of code would take ~45 minutes to run, mostly because of needing to download and setup some files for their first use. So we are going to skip this code block, copy over the results, and pick up with mapping to generate the coverage information.


  # HMM searching for bacterial single-copy genes
# anvi-run-hmms -I Campbell_et_al -c contigs.db -T 6

  # functional annotation with DIAMOND against NCBI's COGs
# anvi-setup-ncbi-cogs -T 6 # only needed the first time
# anvi-run-ncbi-cogs -c contigs.db --num-threads 6

  # exporting Prodigal-identified open-reading frames from anvi'o
# anvi-get-sequences-for-gene-calls -c contigs.db -o gene_calls.fa

  # setting up centrifuge for taxonomy (only needed the first time)
# wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
# tar -xzvf p_compressed+h+v.tar.gz && rm -rf p_compressed+h+v.tar.gz

  # running centrifuge taxonomy (this taxonomically classifies each identified coding sequence)
# centrifuge -f -x p_compressed+h+v gene_calls.fa -S centrifuge_hits.tsv -p 6

  # importing the taxonomy results into our anvi'o contigs database
# anvi-import-taxonomy-for-genes -p centrifuge -c contigs.db \
#                                -i centrifuge_report.tsv \
#                                centrifuge_hits.tsv

Let’s copy over the new “contigs.db” that has the information stored from running those commands:

cp ../downloaded_results/anvio/contigs.db .

The last thing we want to add right now is mapping information from recruiting our reads to the assembly we built with them. Here is generating that mapping information with bowtie2 (this is another aligner like bwa that we used) and preparing the output file for anvi’o (altogether takes < 5 min.):

  # building bowtie index from our selected assembly fasta file
bowtie2-build spades-careful-assembly/contigs.fasta \

  # mapping our reads (takes ~1 min.)
bowtie2 -q -x spades-careful-assembly.btindex \
        -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
        -p 6 -S spades-careful-assembly.sam

  # converting to a bam file (takes < 1 min.)
samtools view -bS spades-careful-assembly.sam > B-cep-assembly.bam

  # sorting and indexing our bam file (takes < 1 min.)
anvi-init-bam B-cep-assembly.bam -o B-cep.bam

We can now integrate this mapping information into anvi’o with the anvi-profile program, which generates another type of database anvi’o calls a “profile database” that holds extrinsic information about our contigs. For our purposes right now, this is mostly about storing coverage information for each contig. We’ll see what that looks like and why it’s helpful soon!

  # integrating the read-mapping info into anvi'o (takes ~2 min.)
anvi-profile -i B-cep.bam -c contigs.db -M 1000 -T 6 --cluster-contigs -o B-cep-profiled/

18.2.2. Pulling out some sequences and summarizing our assembly

Ok, great, so we’ve just generated and put quite a bit of information about our assembly into our contigs and profile databases. And this being anvi’o, it’s now very easy to access specifics of that information when we want it (you can see all of the available programs by typing anvi- and hitting tab twice).

For example, one of the commands we just ran, anvi-run-hmms, searched for bacterial single-copy genes. We can ask anvi’o to give us which genes were identified using anvi-get-sequences-for-hmm-hits (and adding the flag -h will tell us all about how to use the program). Including adding the -l flag to see which HMM sets we have:

anvi-get-sequences-for-hmm-hits -c contigs.db -l

Which shows us “Campbell_et_al” which we specified above. And adding the -L flag will list the names of the genes in that set:

anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al -L

And say we want to pull out the “RecA” gene, which is a commonly targeted bacterial single-copy gene involved in DNA repair:

anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al \
                                --gene-names RecA \
                                --get-aa-sequences \
                                --no-wrap -o RecA.faa

If we print out what’s in that file with cat:

cat RecA.faa

Copy it and paste it into NCBI’s protein BLAST here and BLAST it, we can see that the top hit is 100% identical across the entire query to B. cepacia, and all top hits are to the same species (which is a nice little sanity check). Here are just a few from the top:

We can also generate some summary statistics on our assembly, including estimated percent completion and redundancy based on the presence/absence of the single-copy marker genes we scanned for above. Here are the two steps needed to summarize our isolate-genome assembly:

  # this is adding all contigs to a group called "DEFAULT"
anvi-script-add-default-collection -p B-cep-profiled/PROFILE.db
  # and here is our summary command
anvi-summarize -c contigs.db -p B-cep-profiled/PROFILE.db \
               -C DEFAULT -o B-cepacia-assembly-summary/

A lot was generated with that, now found in our new directory, “B-cepacia-assembly-summary/”. If we glance at the “bins_summary.txt” file, we can see some summary statistics of our assembled genome, including the completion/redundancy estimates (column -t will keep the columns lined up for us):

column -t B-cepacia-assembly-summary/bins_summary.txt
## output from above command ##
# bins        taxon         total_length  num_contigs  N50     GC_content         percent_completion  percent_redundancy
# EVERYTHING  Burkholderia  8472061       92           194480  66.44548680306582  99.28057553956835   1.4388489208633093

Which shows us, in the second column, the majority of taxonomy calls by Centrifuge were actually for the right genus, that’s always nice. Also, based on the Campbell et al. 2013 bacterial single-copy marker genes, our assembly is estimated to be ~99.3% complete with ~1.4% redundancy. This approach doesn’t actually add up to 100% completion and 0% redundancy in many, if any, organisms, so for comparison’s sake, running the ATCC 25416 reference genome through the same pipeline gives the same results based on this single-copy gene set. Though we are still ~130 Mbps short, as we saw with the QUAST output above.

But great, things look pretty much as they should so far. We can also visually inspect our assembly, and how the reads that went into it recruit to it. In theory, if all the DNA in the assembly came from the same organisms (i.e. it’s a clean assembly), there should be pretty even coverage across the whole thing. And if there is contamination in there (something other than our target B. cepacia, it will likely have very different coverage). So let’s finally take a look with anvi-interactive.

18.2.3. Visualizing with anvi-interactive

To be able to visualize our work from the server, we need to ssh in slightly different than we did earlier. So now, let’s type exit in our terminal to disconnect, press the up arrow to bring up our last login command, and modify it to look like this:

ssh -L 8080:localhost:8080 dibada@XXX.XXX.XXX.XXX

Where the -L 8080:localhost:8080 stays exactly the same as it appears above, but dibada above should be your room name, and the IP address at the end should be yours. You will also need to enter the password again.

Next we need to re-activate our conda environment and change back into our working directory:

conda activate de_novo_example
cd ~/genomics_de_novo_temp/working/

Next, let’s copy-and-paste this command to get the link we will need to view our anvi’o interactive interface:

echo "http://$(hostname -i):8080"

Now let’s start up the anvi’o interactive interface:

anvi-interactive -c contigs.db -p B-cep-profiled/PROFILE.db \
                 --server-only -P 8080

Now if we copy the link that was generated above and paste it into our browser, after the page loads and we click “Draw” in the bottom left corner, it should look something like this:

So there is a lot going on here at first glance, especially if we’re not yet familiar with how anvi’o organizes things. So here’s a quick crash course 🙂

At the center of the figure is a hierarchical clustering of the contigs from our assembly (here clustered based on tetranucleotide frequency and coverage). So each tip (leaf) represents a contig (or a fragment of a contig as each is actually broken down into a max of ~20,000bps, but for right now we’ll just be referring to them as contigs). Then radiating out from the center are layers of information (“Parent”, “Length”, “GC content”, etc.), with each layer displaying information for each contig.

One of the first things that might jump out here is the outer layer (purple in this image, but may be different on yours), labeled “Taxonomy”. There is actually a color for each contig for whatever taxonomy was assigned to the majority of genes in that particular contig. This solid bar all around tells us that the genes in almost the entire assembly were identified as Burkholderia – minus two contigs which were not classified as anything (and have no color there). The next thing that stands out is how stable the mean coverage is across almost all contigs, other than that area near the top in this image (yours may be oriented differently too). That likely holds things found in multiple copies in the genome. For instance, according to IMG, B. cepacia ATCC 25416 has 7 16S copies and 9 23S copies, which would complicate assembly if they aren’t all identical, and would inflate their coverage compared to the parts of the genome that exist in single copy. Overall this is great and shows the culture really seems to have been axenic.

Just for a quick comparison, here is the same type of figure (taxonomy at the inner-part of the circle here), but from an enrichment culture of a cyanobacterium (so “contaminant” genomes in there too), rather than axenic:

Here the highlighted contigs, labeled “Bin 1”, represent the targeted cultivar from this sequencing run, demonstrating a nice example of how anvi’o can help you manually curate bins you’re trying derive from assemblies.

While we didn’t need much (any) manual curation in this case, it was still a good idea to visually inspect the coverage of our assembly to make sure nothing weird was going on. And if we wanted, we could further explore those parts with higher coverage to find out which parts of the genome seem to exist in greater than 1 copy.

18.3. Epilogue

This and the prior lesson were all basically to get a high-quality draft of our isolate genome, that we could feel confident about investigating further. Once we feel comfortable with our assembled genome, we can go a lot of different ways. Going into individual approaches are beyond the scope of this particular page, but here are just a few examples.

18.3.1. Phylogenomics

Pull available reference genomes of close relatives and build a phylogenomic tree to get a robust estimate of where our newly acquired isolate(s) fit in evolutionarily with what is already known.

18.3.2. Distributions

Pull available metagenomes from studies and recruit the reads to a reference library containing our isolate (and its close relatives if it has any) to begin assessing their distributions across samples. This example is done with ocean samples, but the same principle can be applied to any environments.

18.3.3. Pangenomics

Start investigating differences in the genetic complement of our new isolate as compared to its known close relatives. This example figure is combining pangenomics (the core of the figure showing the presence or absence of genes within each genome) with metagenomics (distributions of the genomes across samples in the top right corner) to try to associate genomic variability with ecological delineations:

And so much more! This is where just pure data crunching slows down, and the actual science begins. The above are just some of the ways to get to the point where we can then consider our experimental design and our questions and let them guide where we go next 🙂