# Variant calling¶

The goal of this tutorial is to show you the basics of variant calling using Samtools.

We’re going to be looking at variation in laboratory grown strains of Escherichia coli. We have reads from B strain REL606 and we’ll be mapping it to a reference genome from BL21(DE3). This is a different lab strain, and there’s an interesting paper where they trace the origin and transfer of all the different E. coli strains between scientisits through the decades.

Citation: Tracing Ancestors and Relatives of Escherichia coli B, and the Derivation of B Strains REL606 and BL21(DE3) Journal of Molecular Biology, Volume 394, Issue 4, 11 December 2009, Pages 634–643

## Booting an Amazon AMI¶

Start up an Amazon computer (large or xlarge) with an storage of 100Gb.

## Install software¶

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install ruby git
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)" export PATH="/home/ubuntu/.linuxbrew/bin:$PATH"
export MANPATH="/home/ubuntu/.linuxbrew/share/man:$MANPATH" export INFOPATH="/home/ubuntu/.linuxbrew/share/info:$INFOPATH"
brew tap homebrew/science


Now we can install anything available from linuxbrew science

brew install samtools
brew install zlib
brew install bcftools
brew install bwa


See what is installed:

brew list


curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_012971&rettype=fasta&retmode=text" > Ecoli_BL21.fasta
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098038/SRR098038.fastq.gz


Note, this last URL is the “Fastq files (FTP)” link from the EBI page. Its compressed, lets decompress

gunzip SRR098038.fastq.gz


Just in case EBI is down , you can also get reads this way

curl -O ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA026/SRA026813/SRX040675/SRR098038.fastq.bz2


## Rename the reference¶

The reference is named something really long and complicated. Check it out

head Ecoli_BL21.fasta


Lets shorten that for fewer headaches. Use nano to make the header look like this:

>NC_012971.2


Create the BWA index

bwa index Ecoli_BL21.fasta


Now, do the mapping of the raw reads to the reference genome

bwa aln Ecoli_BL21.fasta SRR098038.fastq > SRR098038.sai


Make a SAM file (this would be done with ‘sampe’ if these were paired-end reads)

bwa samse Ecoli_BL21.fasta SRR098038.sai SRR098038.fastq > SRR098038.sam


Next, index the reference genome with samtools

samtools faidx Ecoli_BL21.fasta


Convert the SAM into a BAM file

samtools view -bS SRR098038.sam > SRR098038.bam


Sort the BAM file

samtools sort SRR098038.bam > SRR098038.sorted.bam


And index the sorted BAM file

samtools index SRR098038.sorted.bam


## Visualizing alignments¶

At this point you can visualize with samtools tview. Other visualization software: * Tablet. * IGV

‘samtools tview’ is a text interface that you use from the command line; run it like so

samtools tview SRR098038.sorted.bam Ecoli_BL21.fasta


The ‘.’s are places where the reads align perfectly in the forward direction, and the ‘,’s are places where the reads align perfectly in the reverse direction. Mismatches are indicated as A, T, C, G, etc.

You can scroll around using left and right arrows; to go to a specific coordinate, use ‘g’ and then type in the contig name and the position. For example, type ‘g’ and then ‘NC_012971.2:553093<ENTER>’ to go to position 553093 in the BAM file. (This name is taken from the fasta reference file, you could change to something more reasonable).

Use ‘q’ to quit.

## Statistics of alignments¶

This command

samtools view -c -f 4 SRR098038.bam


will count how many reads DID NOT align to the reference (214518).

This command

samtools view -c -F 4 SRR098038.bam


will count how many reads DID align to the reference (6832113).

And this command

wc -l SRR098038.fastq


will tell you how many lines there are in the FASTQ file (28186524). Reminder: there are four lines for each sequence.

There is another package, Picard Tools, that can give you more in depth information. Lets install with linuxbrew

brew install picard-tools


And use the particular tool CollectAlignmentSummaryMetrics

picard CollectAlignmentSummaryMetrics R=Ecoli_BL21.fasta I=SRR098038.sorted.bam O=statistics.txt


More picard tools stuff here

You can see the output with cat

cat statistics.txt


The definitions of all the columns in this file.

## Calling SNPs¶

You can use samtools and bcftools to call SNPs. They have great documentation of a standard workflow for calling SNPs, you should read more about it. We’re going to do a simplified and updated version here.

samtools mpileup -uf Ecoli_BL21.fasta SRR098038.sorted.bam | bcftools call -vmO v -o SRR098038.vcf --ploidy 1 --threads 2


You can check out the VCF file by using ‘tail’ to look at the bottom

tail SRR098038.vcf


Each variant call line consists of the chromosome name (for E. coli REL606, there’s only one chromosome); the position within the reference; an ID (here always ‘.’); the reference call; the variant call; and a bunch of additional information about the variant.

The information at the end can be very useful but difficult to interpret. One way to quickly look up the label shorthand is to grep

 grep '<ID=VDB' SRR098038.vcf
grep '<ID=AC' SRR098038.vcf

samtools tview SRR098038.sorted.bam Ecoli_BL21.fasta


You can use ‘samtools tview’ again and then type (for example) ‘g’ ‘rel606:4616538’ to go visit one of the positions. The format for the address to go to with ‘g’ is ‘chr:position’.

NC_012971.2:4558366


## Using IGV for Visualization¶

Installing IGV requires registration and some patience. IGV Link.

To open your alignments, you’ll need three files on your local computer: Ecoli_BL21.fasta, SRR098038.sorted.bam, and SRR098038.sorted.bam.bai. You can copy them over using scp (secure copy), for example. You will do this from a terminal on your computer that is NOT connected to amazon.

scp -i ~/Downloads/???.pem ubuntu@???:/home/ubuntu/Ecoli_BL21.fasta ~/Downloads


To add the gene annotation, get this file as well

curl ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Bacteria/Escherichia_coli_BL21_DE3__uid161947/NC_012971.gff


## Student Exercise¶

You are eager to use some E. coli reads from a collaborator, which you can download here

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR201/004/SRR2014554/SRR2014554_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR201/004/SRR2014554/SRR2014554_2.fastq.gz


You need to quality trim them, map them to the E. coli reference, and call SNPs. How far can you get?

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.