Install Bedtools:
# go to the code installation directory
cd /mnt
# download the code
wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
# unpack it
tar zxvf BEDTools.v2.17.0.tar.gz
# move into the tool folder
cd bedtools-2.17.0/
# compile the program
make
# copy the program to be universally accessible
cp bin/bedtools /usr/local/bin/
Prepare the Dropbox work folders:
# this will switch back to
# the home directory
cd
# create a new work directory
mkdir Dropbox/intervals
# switch to that directory
cd Dropbox/intervals
You can now create files on your local Dropbox folder and once the file are saved they will appear on the remote (Amazon cloud) computer.
With the editor of your choice create a fasta/fastq file in your Dropbox folder that contains the following:
Now the pipeline to generate a mapping file will look like this. You can also put this into a separate file and re-run it via the shell command:
# align the reads
bwa aln /mnt/dmel-all-chromosome-r5.37.fasta read1.fq > read1.sai
# format the alignment as a SAM file
bwa samse /mnt/dmel-all-chromosome-r5.37.fasta read1.sai read1.fq > temp.sam
# convert sam to bam format, then sort and index it
samtools view -Sb temp.sam > temp.bam
samtools sort -f temp.bam read1.bam
samtools index read1.bam
# delete the temporary SAM files
#rm temp.sam temp.bam
# display the SAM file with the header
samtools view -h read1.bam
Add other sequences, modify the input, verify and study the output. Understanding the SAM format is an fundamental requirement for most applications.
Novelty: The bwa MEM (Maximal Exact Matches) algorithm can produce local maximal alignments. Modify your query sequence to the extent that the normal aln method does not match then try the MEM method:
bwa mem /mnt/dmel-all-chromosome-r5.37.fasta read1.fa > results.sam
(it can do spliced alignments!)
Now let’s investigate your previous drosophila SAM file. First link it into the current folder. This creates a shortcut:
# create a symbolic link (shortcut) to the file
ln -s /mnt/RAL357_bwa.sorted.bam ral.bam
# index the file
samtools index ral.bam
# view your file
samtools view ral.bam | more
SamTools is a phenomenal tool that allows one to filter and view any part of the file at very high speeds, this shows only the region of 1-10000 of the XHet chromosome:
samtools view ral.bam XHet:1-10000
Counting how much data falls in a certain range is so common it has its own flag, -c
samtools view -c ral.bam XHet:1-10000
Filtering by flags at: http://picard.sourceforge.net/explain-flags.html
The -f flag keeps reads with a certain property, the -F flag removes reads with a certain property.
Task: find out if the for chromosome 2R and the first one million bases the number of reads on the positive strand are reasonably equal to the number of reads on the negative strand.
Visit the FlyBase http://flybase.org/ and download the gene annotations for the drosphila genome as a GFF file and put it into your Dropbox folder:
# download a single annotation file
wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gff/dmel-3RHet-r5.51.gff.gz
# decompress the file that was downloaded
gunzip dmel-3RHet-r5.51.gff.gz
# count the number of line in the file
wc -l dmel-3RHet-r5.51.gff
# this file contains all annotations but we want to filter
# for lines that are annotated as gene
# but we can't just directly match for the word gene as this
# word may be present in other context
# we need to search for lines that match
# the word FlyBase \tab gene
# below the capital \W indicates a whitespace character
grep -E 'FlyBase\Wgene' dmel-3RHet-r5.51.gff > genes.gff
Previously we have created a SAM alignment file and you can now have combine it with genes via the bedtools:
bedtools intersect -a ral.bed -b genes.gff -wo > results.txt