========================== Interval Analysis Tutorial ========================== Requirements ------------- Install Bedtools:: # go to the code installation directory cd /mnt # download the code wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz # unpack it tar zxvf BEDTools.v2.17.0.tar.gz # move into the tool folder cd bedtools-2.17.0/ # compile the program make # copy the program to be universally accessible cp bin/bedtools /usr/local/bin/ Prepare the Dropbox work folders:: # this will switch back to # the home directory cd # create a new work directory mkdir Dropbox/intervals # switch to that directory cd Dropbox/intervals You can now create files on your local Dropbox folder and once the file are saved they will appear on the remote (Amazon cloud) computer. With the editor of your choice create a fasta/fastq file in your Dropbox folder that contains the following: 1. an exact subsequence from the drosophila genome 2. a subsequence from the drosophila genome with one mismatch (or any other variation) 3. a spliced junction file Now the pipeline to generate a mapping file will look like this. You can also put this into a separate file and re-run it via the shell command:: # align the reads bwa aln /mnt/dmel-all-chromosome-r5.37.fasta read1.fq > read1.sai # format the alignment as a SAM file bwa samse /mnt/dmel-all-chromosome-r5.37.fasta read1.sai read1.fq > temp.sam # convert sam to bam format, then sort and index it samtools view -Sb temp.sam > temp.bam samtools sort -f temp.bam read1.bam samtools index read1.bam # delete the temporary SAM files #rm temp.sam temp.bam # display the SAM file with the header samtools view -h read1.bam Add other sequences, modify the input, verify and study the output. Understanding the SAM format is an fundamental requirement for most applications. Novelty: The bwa MEM (Maximal Exact Matches) algorithm can produce local maximal alignments. Modify your query sequence to the extent that the normal aln method does not match then try the MEM method:: bwa mem /mnt/dmel-all-chromosome-r5.37.fasta read1.fa > results.sam (it can do spliced alignments!) Now let's investigate your previous drosophila SAM file. First link it into the current folder. This creates a shortcut:: # create a symbolic link (shortcut) to the file ln -s /mnt/RAL357_bwa.sorted.bam ral.bam # index the file samtools index ral.bam # view your file samtools view ral.bam | more SamTools is a phenomenal tool that allows one to filter and view any part of the file at very high speeds, this shows only the region of 1-10000 of the XHet chromosome:: samtools view ral.bam XHet:1-10000 Counting how much data falls in a certain range is so common it has its own flag, -c :: samtools view -c ral.bam XHet:1-10000 Filtering by flags at: http://picard.sourceforge.net/explain-flags.html The ``-f`` flag keeps reads with a certain property, the ``-F`` flag removes reads with a certain property. Task: find out if the for chromosome 2R and the first one million bases the number of reads on the positive strand are reasonably equal to the number of reads on the negative strand. Interval Operations =================== Visit the FlyBase http://flybase.org/ and download the gene annotations for the drosphila genome as a GFF file and put it into your Dropbox folder:: # download a single annotation file wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/current/gff/dmel-3RHet-r5.51.gff.gz # decompress the file that was downloaded gunzip dmel-3RHet-r5.51.gff.gz # count the number of line in the file wc -l dmel-3RHet-r5.51.gff # this file contains all annotations but we want to filter # for lines that are annotated as gene # but we can't just directly match for the word gene as this # word may be present in other context # we need to search for lines that match # the word FlyBase \tab gene # below the capital \W indicates a whitespace character grep -E 'FlyBase\Wgene' dmel-3RHet-r5.51.gff > genes.gff Previously we have created a SAM alignment file and you can now have combine it with genes via the bedtools:: bedtools intersect -a ral.bed -b genes.gff -wo > results.txt