We will run a variant calling pipeline using Genome Analysis Toolkit (GATK) using a subset sample of dog WGS as a representative to large mammalian genomes.
Start up an m1.medium instance running Ubuntu 16.04 on Jetstream.
log in, and then make & change into a working directory:
mkdir ~/GATK_tutorial && cd ~/GATK_tutorial
wget https://de.cyverse.org/dl/d/3CE425D7-ECDE-46B8-AB7F-FAF07048AD42/samples.tar.gz
tar xvzf samples.tar.gz
rm samples.tar.gz
Quick notes about read trimming for variant calling:
- Trimming is data loss so be careful.
- Sequence trimming is complementary to variant filtration
- Sources of errors: a) The call is suspicious ==> low quality score (variant filtration is better than quality trimming) b) Technical problems (e.g. sequencing chemistry or physics) ==> systematic errors (can be removed by careful kmer based trimming but GATK recalibration is an alternative)
- Very mild quality trimming: SLIDINGWINDOW:4:2 ==> this means that the Base call accuracy is ~ 40%
Install bwa:
cd
curl -L https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download > bwa-0.7.15.tar.bz2
tar xjvf bwa-0.7.15.tar.bz2
cd bwa-0.7.15
make
sudo cp bwa /usr/local/bin
echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
source ~/.bashrc
change into a working directory:
cd ~/GATK_tutorial
download and prepare the reference for mapping
wget https://de.cyverse.org/dl/d/A9330898-FC54-42A5-B205-B1B2DC0E91AE/dog_chr5.fa.gz
gunzip dog_chr5.fa.gz
bwa index -a bwtsw dog_chr5.fa
Add Read group information and do mapping
Read group information is typically added during this step, but can also be added or modified after mapping using Picard AddOrReplaceReadGroups.
for R1 in *_R1_001.pe.fq.gz;do
SM=$(echo $R1 | cut -d"_" -f1) ##sample ID
LB=$(echo $R1 | cut -d"_" -f1,2) ##library ID
PL="Illumina" ##platform (e.g. illumina, solid)
RGID=$(zcat $R1 | head -n1 | sed 's/:/_/g' |cut -d "_" -f1,2,3,4) ##read group identifier
PU=$RGID.$LB ##Platform Unit
echo -e "@RG\tID:$RGID\tSM:$SM\tPL:$PL\tLB:$LB\tPU:$PU"
R2=$(echo $R1 | sed 's/_R1_/_R2_/')
echo $R1 $R2
bwa mem -t 4 -M -R "@RG\tID:$RGID\tSM:$SM\tPL:$PL\tLB:$LB\tPU:$PU" dog_chr5.fa $R1 $R2 > ${R1%_R1_001.pe.fq.gz}.sam
done
install samtools
sudo apt-get -y install samtools
generate & sort BAM file
for samfile in *.sam;do
sample=${samfile%.sam}
samtools view -bS -o $sample.bam $samfile
samtools sort $sample.bam $sample.sorted
done
rm *.sam *_L00[0-9].bam
Install Java
sudo mkdir -p /usr/local/java
cd /usr/local/java
sudo wget -c --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz
sudo tar xvzf jdk-8u131-linux-x64.tar.gz
echo 'export PATH=$PATH:/usr/local/java/jdk1.8.0_131/jre/bin' >> ~/.bashrc
source ~/.bashrc
Download Picard tools
cd ~/GATK_tutorial
wget https://github.com/broadinstitute/picard/releases/download/2.9.4/picard.jar
merge the replicates
java -Xmx10g -jar picard.jar MergeSamFiles I=BD143_TGACCA_L005.sorted.bam I=BD143_TGACCA_L006.sorted.bam OUTPUT=BD143_TGACCA_merged.sorted.bam
check for the changes in the header
samtools view -H BD143_TGACCA_L005.sorted.bam
samtools view -H BD143_TGACCA_L006.sorted.bam
samtools view -H BD143_TGACCA_merged.sorted.bam
remove the individual replicates
rm BD143_TGACCA_L00*.sorted.bam
Duplicates:
- PCR duplicates (originating from a single fragment of DNA) or
- optical duplicates (result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument)
Duplicate marking should NOT be applied to amplicon sequencing data or other data types where reads start and stop at the same positions by design.
for sample in *.sorted.bam;do
name=${sample%.sorted.bam}
java -Xmx10g -jar picard.jar MarkDuplicates INPUT=$sample OUTPUT=$name.dedup.bam METRICS_FILE=$name.metrics.txt;
done
download Genome Analysis Toolkit (GATK)
wget https://de.cyverse.org/dl/d/6177B1E0-718A-4F95-A83B-C3B88E23C093/GenomeAnalysisTK-3.7-0.tar.bz2
tar xjf GenomeAnalysisTK-3.7-0.tar.bz2
Prepare GATK dictionary and index for the reference genome
java -Xmx10g -jar picard.jar CreateSequenceDictionary R=dog_chr5.fa O=dog_chr5.dict
samtools faidx dog_chr5.fa
Download known polymorphic sites
wget 'ftp://ftp.ensembl.org/pub/release-89/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz' -O canis_familiaris.vcf.gz
Select variants on chr5 and correct chr name
gunzip canis_familiaris.vcf.gz
grep "^#" canis_familiaris.vcf > canis_fam_chr5.vcf
grep "^5" canis_familiaris.vcf | sed 's/^5/chr5/' >> canis_fam_chr5.vcf
This algorithm treats every reference mismatch as an indication of error, so it is critical that a “comprehensive” database of known polymorphic sites is given to the tool in order to be masked and not counted as errors. What we can do with semi-model organisms?
Note the differences between genome annotation databases. Not only chromosome names but more importantly the coordinate system (interesting post)
download R (only to generate figures to observe the changes, but we will need it later as well)
sudo apt-get update && sudo apt-get -y install gdebi-core r-base
After that finishes, download and install RStudio:
wget https://download2.rstudio.org/rstudio-server-1.0.143-amd64.deb
sudo gdebi -n rstudio-server-1.0.143-amd64.deb
Install some packages
sudo Rscript -e "install.packages('ggplot2', contriburl=contrib.url('http://cran.r-project.org/'))"
sudo Rscript -e "install.packages('gplots', contriburl=contrib.url('http://cran.r-project.org/'))"
sudo Rscript -e "install.packages('reshape', contriburl=contrib.url('http://cran.r-project.org/'))"
sudo Rscript -e "install.packages('gsalib', contriburl=contrib.url('http://cran.r-project.org/'))"
sudo Rscript -e "install.packages('Biobase', contriburl=contrib.url('http://bioconductor.org/packages/release/bioc/'))"
Add a password to your instance
sudo passwd tx160085
You will be prompted to enter a new password. Make a password you can remember:
Enter new UNIX password:
Retype new UNIX password:
Get the address of your own RStudio web server
echo My RStudio Web server is running at: http://$(hostname):8787/
Copy the link to a new tab of your browser and hit enter
.
Username: `tx160085`
Password: `The one you just created`
Keep this tab open and will come back to it in a min. ** Now go to your web shell **
run recalibration
for sample in *.dedup.bam;do
name=${sample%.dedup.bam}
samtools index $sample
java -Xmx10g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R dog_chr5.fa -I $sample -knownSites canis_fam_chr5.vcf -o $name.1st.table
java -Xmx10g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R dog_chr5.fa -I $sample -knownSites canis_fam_chr5.vcf -BQSR $name.1st.table -o $name.2nd.table
java -Xmx10g -jar GenomeAnalysisTK.jar -T PrintReads -R dog_chr5.fa -I $sample -BQSR $name.2nd.table -o $name.recal.bam
java -Xmx10g -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R dog_chr5.fa -before $name.1st.table -after $name.2nd.table -plots $name.BQSR.pdf
done
More details about the tool and interpretation of the output figures
per-sample calling
for sample in *.recal.bam;do
name=${sample%.recal.bam}
java -Xmx10g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R dog_chr5.fa --dbsnp canis_fam_chr5.vcf -I $sample --emitRefConfidence GVCF -nct 3 -o $name.g.vcf
done
Joint Genotyping
java -Xmx10g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R dog_chr5.fa --dbsnp canis_fam_chr5.vcf \
--variant BD143_TGACCA_merged.g.vcf \
--variant BD174_CAGATC_L005.g.vcf \
--variant BD225_TAGCTT_L007.g.vcf \
-o raw_variants.vcf
The best way to filter the raw variant callset is to use variant quality score recalibration (VQSR). However this requires high-quality sets of known variants for training, which for many organisms are not yet available. It also requires a lot of data, so it can be difficult or even impossible to use on small datasets that involve only one or a few samples, on targeted sequencing data, or on RNAseq.
Hard filtering flat thresholds for specific annotations: GATK uses VariantFiltration for hard filtering. The documentation page provides links to all possible annotation modules. You can get some recommendations here.
Split variants into SNPs and INDELs
java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R dog_chr5.fa -V raw_variants.vcf -selectType SNP -o raw_SNP.vcf
java -Xmx10g -jar GenomeAnalysisTK.jar -T SelectVariants -R dog_chr5.fa -V raw_variants.vcf -selectType INDEL -o raw_INDEL.vcf
Explore the distribution of different annotations
wget https://raw.githubusercontent.com/drtamermansour/angus/2017/densityCurves.R
for var in "SNP" "INDEL";do
for ann in "QD" "MQRankSum" "FS" "SOR" "ReadPosRankSum";do
annFile=$var.$ann; echo $annFile;
awk -v k="$ann=" '!/#/{n=split($8,a,";"); for(i=1;i<=n;i++) if(a[i]~"^"k) {sub(k,$3" ",a[i]); print a[i]}}' raw_$var.vcf > $annFile
grep -v "^\." $annFile > known.$annFile
grep "^\." $annFile > novel.$annFile
Rscript densityCurves.R "$annFile"
rm $annFile known.$annFile novel.$annFile
done; done
Apply the filters
java -Xmx10g -jar GenomeAnalysisTK.jar -T VariantFiltration -R dog_chr5.fa -V raw_SNP.vcf \
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0" \
--filterName "snp_filter" \
-o filtered_SNP.vcf
java -Xmx10g -jar GenomeAnalysisTK.jar -T VariantFiltration -R dog_chr5.fa -V raw_INDEL.vcf \
--filterExpression "QD < 2.0 || FS > 200.0" \
--filterName "indel_filter" \
-o filtered_INDEL.vcf