# De novo genome exploration Here we're going to take off from a de novo prokaryotic genome assembly performed in [this lesson](de-novo-genome-assembly.md). That involved logging into our JetStream instances as [noted here](de-novo-genome-assembly.md#accessing-our-jetstreaminstances) and setting up a specific conda environment as [noted here](de-novo-genome-assembly.md#setting-up-our-working-environment). If you didn't do that lesson, be sure to set up that conda environment as shown there. If you did, then be sure to activate that conda environment with the following after logged in: ```bash conda activate de_novo_example ``` ## The data The data used here is described [here](de-novo-genome-assembly.md#the-data). If you did not do the [previous lesson](de-novo-genome-assembly.md), the following code block will download the data and copy over the required assembly. But otherwise you will have all of this already: ```bash cd ~ curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz tar -xzvf genomics_de_novo_temp.tar.gz rm genomics_de_novo_temp.tar.gz cp -r ~/genomics_de_novo_temp/downloaded_results/assemblies/spades-careful-assembly/ ~/genomics_de_novo_temp/working/spades-careful-assembly/ ``` Now let's change directories into our working directory: ```bash cd ~/genomics_de_novo_temp/working/ ``` ## Exploring our assembly with anvi'o Now that we've selected an assembly we're going to move forward with, we can start to take a closer look at it. To do that we're going to use a wonderful platform called [anvi'o](http://merenlab.org/software/anvio/) - which stands for analysis and visualization of omics data. anvi'o is very expansive, but to try to summarize it in a sentence, it is a powerful and user-friendly data visualization and exploration platform. There is a ton of excellent documentation at the [anvi'o site](http://merenlab.org/software/anvio/), including workflows, tutorials, and blog posts. Here we're going to put our isolate-genome assembly into the anvi'o framework and see just a few of the ways it can help us begin to look at our assembled genome - including a visual way of checking for possible contamination (which in this case would be anything in there that's *not* our target *Burkholderia cepacia*). Many of the major steps we're going to be performing here are laid out in the [metagenomic workflow presented here](http://merenlab.org/2016/06/22/anvio-tutorial-v2/), as much of the processing is the same even though we are not working with a metagenome here. ### Putting our assembly into anvi'o and generating some information about it For us to get our assembly into anvi'o, first we need to generate what it calls a [contigs database](http://merenlab.org/2016/06/22/anvio-tutorial-v2/#creating-an-anvio-contigs-database). This contains the contigs from our assembly and some information intrisic to them (like GC content, tetranucleotide frequency, and open-reading frame coordinates). The following command will organize our contigs in an anvi'o-friendly way, generate some basic stats about them, and use the program [Prodigal](https://github.com/hyattpd/Prodigal) to identify [open-reading frames](https://en.wikipedia.org/wiki/Open_reading_frame) (takes < 2 min.): ```bash anvi-gen-contigs-database -f spades-careful-assembly/contigs.fasta \ -o contigs.db -n B_cepacia_assembly ``` Now that we have our `contigs.db` that holds our sequences and some basic information about them, we can start adding more information. This is one of the places where the flexibility comes into play, but for now we'll just move forward with some parts of a general anvi'o workflow, including: - using the program [HMMER](http://hmmer.org/) with profile hidden Markov models (HMMs) to scan for bacterial single-copy genes [(from Campbell et al. 2013)](http://www.pnas.org/content/110/14/5540.short) - if new to HMMs, see the bottom of page 7 [here](http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) for a good explanation of what exactly a "hidden Markov model" is in the realm of sequence data - to summarize, instead of treating all columns of an alignment equally (like something like BLAST does), HMMs can weight individual columns in an alignment differently. So the highly-conserved stretches within, say, a conserved domain of a protein will matter more than a stretch in between two domains that may vary more. - using [NCBI COGs](https://www.ncbi.nlm.nih.gov/COG/) to functionally annotate the open-reading frames [Prodigal](https://github.com/hyattpd/Prodigal) predicted - using a tool called [centrifuge](https://ccb.jhu.edu/software/centrifuge/index.shtml) for taxonomic classification of the identified open-reading frames - generate coverage information for our assembly so we can visualize it The following block of code would take ~45 minutes to run, mostly because of needing to download and setup some files for their first use. So we are going to skip this code block, copy over the results, and pick up with mapping to generate the coverage information. ```bash ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ## # HMM searching for bacterial single-copy genes # anvi-run-hmms -I Campbell_et_al -c contigs.db -T 6 # functional annotation with DIAMOND against NCBI's COGs # anvi-setup-ncbi-cogs -T 6 # only needed the first time # anvi-run-ncbi-cogs -c contigs.db --num-threads 6 # exporting Prodigal-identified open-reading frames from anvi'o # anvi-get-sequences-for-gene-calls -c contigs.db -o gene_calls.fa # setting up centrifuge for taxonomy (only needed the first time) # wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz # tar -xzvf p_compressed+h+v.tar.gz && rm -rf p_compressed+h+v.tar.gz # running centrifuge taxonomy (this taxonomically classifies each identified coding sequence) # centrifuge -f -x p_compressed+h+v gene_calls.fa -S centrifuge_hits.tsv -p 6 # importing the taxonomy results into our anvi'o contigs database # anvi-import-taxonomy-for-genes -p centrifuge -c contigs.db \ # -i centrifuge_report.tsv \ # centrifuge_hits.tsv ``` Let's copy over the new "contigs.db" that has the information stored from running those commands: ```bash cp ../downloaded_results/anvio/contigs.db . ``` The last thing we want to add right now is mapping information from recruiting our reads to the assembly we built with them. Here is generating that mapping information with [bowtie2](https://github.com/BenLangmead/bowtie2) (this is another aligner like `bwa` that we used) and preparing the output file for anvi'o (altogether takes < 5 min.): ```bash # building bowtie index from our selected assembly fasta file bowtie2-build spades-careful-assembly/contigs.fasta \ spades-careful-assembly.btindex # mapping our reads (takes ~1 min.) bowtie2 -q -x spades-careful-assembly.btindex \ -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \ -p 6 -S spades-careful-assembly.sam # converting to a bam file (takes < 1 min.) samtools view -bS spades-careful-assembly.sam > B-cep-assembly.bam # sorting and indexing our bam file (takes < 1 min.) anvi-init-bam B-cep-assembly.bam -o B-cep.bam ``` We can now integrate this mapping information into anvi'o with the `anvi-profile` program, which generates another type of database anvi'o calls a "profile database" that holds extrinsic information about our contigs. For our purposes right now, this is mostly about storing coverage information for each contig. We'll see what that looks like and why it's helpful soon! ```bash # integrating the read-mapping info into anvi'o (takes ~2 min.) anvi-profile -i B-cep.bam -c contigs.db -M 1000 -T 6 --cluster-contigs -o B-cep-profiled/ ``` ### Pulling out some sequences and summarizing our assembly Ok, great, so we've just generated and put quite a bit of information about our assembly into our contigs and profile databases. And this being anvi'o, it's now very easy to access specifics of that information when we want it (you can see all of the available programs by typing `anvi-` and hitting tab twice). For example, one of the commands we just ran, `anvi-run-hmms`, searched for bacterial single-copy genes. We can ask anvi'o to give us which genes were identified using `anvi-get-sequences-for-hmm-hits` (and adding the flag `-h` will tell us all about how to use the program). Including adding the `-l` flag to see which HMM sets we have: ```bash anvi-get-sequences-for-hmm-hits -c contigs.db -l ``` Which shows us "Campbell_et_al" which we specified above. And adding the `-L` flag will list the names of the genes in that set: ```bash anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al -L ``` And say we want to pull out the "RecA" gene, which is a commonly targeted bacterial single-copy gene involved in DNA repair: ```bash anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al \ --gene-names RecA \ --get-aa-sequences \ --no-wrap -o RecA.faa ``` If we print out what's in that file with `cat`: ```bash cat RecA.faa ``` Copy it and paste it into [NCBI's protein BLAST here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) and BLAST it, we can see that the top hit is 100% identical across the entire query to *B. cepacia*, and all top hits are to the same species (which is a nice little sanity check). Here are just a few from the top: