Variant calling

The goal of this tutorial is to show you the basics of variant calling using Samtools.

We’ll be using data from one of Rich Lenski’s LTEE papers, the one on the evolution of citrate consumption in LTEE.

Booting an Amazon AMI

Start up an Amazon computer (m1.large or m1.xlarge) using AMI ami-7607d01e (see Start up an EC2 instance and Starting up a custom operating system).

Log in with Windows or from Mac OS X.

Updating the operating system

Copy and paste the following two commands

apt-get update
apt-get -y install screen git curl gcc make g++ python-dev unzip \
        default-jre pkg-config libncurses5-dev r-base-core \
        r-cran-gplots python-matplotlib sysstat

to update the computer with all the bundled software you’ll need.

Install software

First, we need to install the BWA aligner:

cd /root
wget -O bwa-0.7.10.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download

tar xvfj bwa-0.7.10.tar.bz2
cd bwa-0.7.10
make

cp bwa /usr/local/bin

Also install samtools:

apt-get -y install samtools

Download data

Download the reference genome and the resequencing reads:

cd /mnt

curl -O http://athyra.idyll.org/~t/REL606.fa.gz
gunzip REL606.fa.gz

curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098038/SRR098038.fastq.gz

Note, this last URL is the “Fastq files (FTP)” link from the European Nucleotide Archive (ENA) for this sample: http://www.ebi.ac.uk/ena/data/view/SRR098042.

Do the mapping

Now let’s map all of the reads to the reference. Start by indexing the reference genome:

cd /mnt

bwa index REL606.fa

Now, do the mapping of the raw reads to the reference genome:

bwa aln REL606.fa SRR098038.fastq.gz  > SRR098038.sai

Make a SAM file (this would be done with ‘sampe’ if these were paired-end reads):

bwa samse REL606.fa SRR098038.sai SRR098038.fastq.gz > SRR098038.sam

This file contains all of the information about where each read hits on the reference.

Next, index the reference genome with samtools:

samtools faidx REL606.fa

Convert the SAM into a BAM file:

samtools import REL606.fa.fai SRR098038.sam SRR098038.bam

Sort the BAM file:

samtools sort SRR098038.bam SRR098038.sorted

And index the sorted BAM file:

samtools index SRR098038.sorted.bam

Visualizing alignments

At this point you can visualize with samtools tview or Tablet.

‘samtools tview’ is a text interface that you use from the command line; run it like so:

samtools tview SRR098038.sorted.bam REL606.fa

The ‘.’s are places where the reads align perfectly in the forward direction, and the ‘,’s are places where the reads align perfectly in the reverse direction. Mismatches are indicated as A, T, C, G, etc.

You can scroll around using left and right arrows; to go to a specific coordinate, use ‘g’ and then type in the contig name and the position. For example, type ‘g’ and then ‘rel606:553093<ENTER>’ to go to position 553093 in the BAM file.

Use ‘q’ to quit.

For the Tablet viewer, click on the link and get it installed on your local computer. Then, start it up as an application. To open your alignments in Tablet, you’ll need three files on your local computer: REL606.fa, SRR098042.sorted.bam, and SRR098042.sorted.bam.bai. You can copy them over using Dropbox, for example.

Counting alignments

This command:

samtools view -c -f 4 SRR098038.bam

will count how many reads DID NOT align to the reference (214518).

This command:

samtools view -c -F 4 SRR098038.bam

will count how many reads DID align to the reference (6832113).

And this command:

gunzip -c SRR098038.fastq.gz | wc

will tell you how many lines there are in the FASTQ file (28186524). Reminder: there are four lines for each sequence.

Calling SNPs

You can use samtools to call SNPs like so:

samtools mpileup -uD -f REL606.fa SRR098038.sorted.bam | bcftools view -bvcg - > SRR098038.raw.bcf

(See the ‘mpileup’ docs here.)

Now convert the BCF into VCF:

bcftools view SRR098038.raw.bcf > SRR098038.vcf

You can check out the VCF file by using ‘tail’ to look at the bottom:

tail *.vcf

Each variant call line consists of the chromosome name (for E. coli REL606, there’s only one chromosome - rel606); the position within the reference; an ID (here always ‘.’); the reference call; the variant call; and a bunch of additional information about

Again, you can use ‘samtools tview’ and then type (for example) ‘g’ ‘rel606:4616538’ to go visit one of the positions. The format for the address to go to with ‘g’ is ‘chr:position’.

You can read more about the VCF file format here.

Questions/discussion items

Why so many steps?


LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus