Here we will use the BWA aligner to map short reads to a reference genome, and then call variants (differences between the reads and the reference).
Start up an m1.medium instance running Ubuntu 16.04 on Jetstream.
log in, and then install samtools:
sudo apt-get -y update && \
sudo apt-get -y install trimmomatic fastqc python-pip \
zlib1g-dev ncurses-dev python-dev
Goal: get the sequence data!
Run:
mkdir ~/data
cd ~/data
curl -O http://dib-training.ucdavis.edu.s3.amazonaws.com/2017-ucdavis-igg201b/SRR2584857.fq.gz
Goal: execute a basic mapping
Run the following commands to install bwa:
cd
curl -L https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2/download > bwa-0.7.15.tar.bz2
tar xjvf bwa-0.7.15.tar.bz2
cd bwa-0.7.15
make
sudo cp bwa /usr/local/bin
echo 'export PATH=$PATH:/usr/local/bin' >> ~/.bashrc
source ~/.bashrc
Make & change into a working directory:
mkdir ~/work
cd ~/work
Copy and gunzip the reference:
wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab1/ecoli-rel606.fa.gz
gunzip ecoli-rel606.fa.gz
Prepare it for mapping:
bwa index ecoli-rel606.fa
Map! (This will take about 2 minutes.)
bwa mem -t 6 ecoli-rel606.fa ~/data/SRR2584857.fq.gz > SRR2584857.sam
Observe!
head SRR2584857.sam
Goal: make it possible to go look at a specific bit of the genome.
Install samtools:
sudo apt-get -y install samtools
Convert the SAM file into a BAM file that can be sorted and indexed:
samtools view -hSbo SRR2584857.bam SRR2584857.sam
Sort the BAM file by position in genome:
samtools sort SRR2584857.bam SRR2584857.sorted
Index the BAM file so that we can randomly access it quickly:
samtools index SRR2584857.sorted.bam
Visualize with tview
:
samtools tview SRR2584857.sorted.bam ecoli-rel606.fa
tview
commands of relevance:
q
to quitg ecoli:3931002
will take you to a specific location.Goal: find places where the reads are systematically different from the genome.
Now we can call variants using samtools mpileup:
samtools mpileup -uD -f ecoli-rel606.fa SRR2584857.sorted.bam | \
bcftools view -bvcg - > variants.raw.bcf
This will take a few minutes... and output a file that is not human readable! But we can quickly convert it into the ‘variant call format’ that is human readable:
bcftools view variants.raw.bcf > variants.vcf
The output VCF file contains a list of all the variants that samtools thinks are there. What’s in it?
The official VCF specification is a great read...if you’re suffering from insomnia. Let’s skip this and just take a quick look at the file.
Look at the non-commented lines along with the header:
grep -v ^## variants.vcf
The first five columns: CHROM POS ID REF ALT
.
It’s a little easier to see if you run
grep -v ^## variants.vcf | less -S
Use your left and right arrows to scroll, and ‘q’ to quit.
Examine one of the variants with tview:
samtools tview SRR2584857.sorted.bam ecoli-rel606.fa -p ecoli:920514
‘q’ to quit, left arrow to scroll a bit left.
Well, at least that variant looks real...
Download and build bedtools:
cd ~/
curl -O -L https://github.com/arq5x/bedtools2/releases/download/v2.26.0/bedtools-2.26.0.tar.gz
tar -xzf bedtools-2.26.0.tar.gz
cd bedtools2
make
sudo make install
Go back to work:
cd ~/work
Download a GFF3 file with annotations for E. coli:
wget https://github.com/ctb/2017-ucdavis-igg201b/raw/master/lab3/ecoli-rel606.gff.gz .
Run bedtools intersect:
bedtools intersect -a ecoli-rel606.gff.gz -b variants.vcf -wa -u
Execute:
samtools view SRR2584857.sorted.bam 'ecoli:920514-920514' > out.sam
wc -l out.sam
and this will give you the coverage of the relevant position.