Variant calling and exploration of polymorphisms¶
Now that we have some experience in R, we will check out a vcf file with polymorphisms from
## Getting the data and installing extra packages¶
Installing a bunch of stuff:
get bwa
cd /root
wget -O bwa-0.7.10.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.10.tar.bz2/download
untar and compile (via make) bwa
tar xvfj bwa-0.7.10.tar.bz2
cd bwa-0.7.10
make
cp bwa /usr/local/bin
install some tools
apt-get update
apt-get -y install samtools screen git curl gcc make g++ python-dev unzip \
default-jre pkg-config libncurses5-dev r-base-core \
r-cran-gplots python-matplotlib sysstat libcurl4-openssl-dev libxml2-dev
git clone https://github.com/schimar/ngs2014_popGen.git
cd ngs2014_popGen/var_call2/
Let’s do another round of variant calling¶
index the reference genome
bwa index ref_genome.fna
map our reads to the indexed reference genome
bwa aln ref_genome.fna read_file.fq > mapped_reads.sai
Create the SAM file
bwa samse ref_genome.fna mapped_reads.sai read_file.fq > mapped_reads.sam
Index the reference genome
samtools faidx ref_genome.fna
Convert from SAM to BAM
samtools view -b -S -o mapped_reads.bam mapped_reads.sam
Sort the BAM
samtools sort mapped_reads.bam mapped_reads.sorted
And index again, but now the sorted BAM file
samtools index mapped_reads.sorted.bam
Visualize the alignment
samtools tview mapped_reads.sorted.bam ref_genome.fna
Variant exploration with Bioconductor¶
Now simply type R in the shell and:
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("VariantAnnotation")
biocLite("SNPlocs.Hsapiens.dbSNP.20101109")
biocLite("BSgenome.Hsapiens.UCSC.hg19_1.3.1000")
Quality control¶
Now we load the VariantAnnotation package as well as the data. The objective of this exercise is to compare the quality of called SNPs that are located in dbSNP, versus those that are novel.
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
Locate the sample data in the file system. Explore the metadata (information about the content of the file) using scanVcfHeader . Discover the ‘info’ fields VT (variant type), and RSQ (genotype imputation quality).
(hdr <- scanVcfHeader(fl))
info(hdr)[c("VT", "RSQ"),]
Input the data and peak at their locations:
(vcf <- readVcf(fl, "hg19"))
head(rowData(vcf), 3)
SNPs were called with MaCH/thunder (part of GotCloud) , for more info, see :doc: http://genome.sph.umich.edu/wiki/Thunder and http://genome.sph.umich.edu/wiki/MaCH_FAQ. Notice that the seqnames (chromosome levels) are set to ‘22’, we want to rename those
rowData(vcf) <- renameSeqlevels(rowData(vcf), c("22"="ch22"))
We now load the SNP database and discover whether our SNPs are in dbSNP
library(SNPlocs.Hsapiens.dbSNP.20101109)
destination <- tempfile()
pre <- FilterRules(list(isLowCoverageExomeSnp = function(x) {
grepl("LOWCOV,EXOME", x, fixed=TRUE)
}))
filt <- FilterRules(list(isSNP = function(x) info(x)$VT == "SNP"))
snpFilt <- filterVcf(fl, "hg19", destination, prefilters=pre, filters= filt)
vcf_filt <- readVcf(snpFilt, "hg19")
rowData(vcf)
rowData(vcf_filt)
If we compare vcf and vcf_filt, we see that of the 10376 SNPs in our initial vcf file, 794 are in the database.
inDbSNP <- rownames(vcf) %in% rownames(vcf_filt)
table(inDbSNP)
metrics <- data.frame(inDbSNP = inDbSNP, RSQ = info(vcf)$RSQ)
Let’s finally visualize it:
library(ggplot2)
ggplot(metrics, aes(RSQ, fill=inDbSNP)) +
geom_density(alpha=0.5) +
scale_x_continuous(name="MaCH / Thunder Imputation Quality") +
scale_y_continuous(name="Density") +
theme(legend.position="top")
(This won’t work in R on EC2, simply because we can’t run X11 through an ssh connection)
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus