Teach me interval based analysis

Premise: a participant in the course solves a problem at the podium while taking input from the audience when necessary.

The problem to be solved

Create mutations in the yeast genome and simulate sequencing reads from this genome. Apply bioinformatics tools on this data and find those genes that contain mutations in their promoter regions.

The input data will consist of the yeast genome in FASTA format and the gene annotations as a GFF file. Download the data from http://bcc.bx.psu.edu/data/teachme.zip (3.6Mb)

The installation for bedtools is described at the beginning of Interval Analysis Tutorial. We assume that you have already installed the bwa and samtools packages.

The following is a annotated step-by-step history of the commands that have been performed at the podium.

First we download the teachme data:

wget http://bcc.bx.psu.edu/data/teachme.zip
unzip teachme.zip
cd teachme/

Investigate what exactly is stored in these files:

less genes.gff
cd refs/
less sc.fa

Our first task is simulate a genome and create sequencing reads from it. What this means is that we simulate the genome of an organism, then create a library from it and we simulate sequencing this library via a sequencing instrument. The output of the process will be data similar to what one would get back from a sequencing facility.

There are many sequence simulators, we will use wgsim written by Heng Li. This code is currently hosted on GitHub:

#clone and install wgsim (mutation simulator)
git clone https://github.com/lh3/wgsim

# move to the folder
cd wgsim/

# compile the code (see the README for info)
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

# copy the code to a path that makes it available on the computer
cp wgsim /usr/bin/

Now lets run the simulator and generate the reads. Both the README and just running the tool on its own produces information on the usage:

# go to refs folder and run wgsim on sc_fa to create NGS reads that contains mutations based on the sc.fa reference genome
cd ..
wgsim -N 100000 sc.fa sc_mut1.fq sc_mut2.fq > sc_mut_screen.txt

#look at mutation summary and mutated read (just sc_mut1.fq)
less sc_mut_screen.txt
less sc_mut1.fq

By default the tool creates paired-end reads (two files) but for simpliciy we are going to use only one file.

At this point the student at the podium reorganized the files, created a new directory called work and moved all data into that folder:

#make work directory and copy genes.gff and mutated reads
cd ..
mkdir work
mv genes.gff work/
mv refs/sc_mut* work/
cd work/

From this point on she worked off the work directory that will collect all files. The reference will be located at the relative path ../refs/sc.fa.

We have the simulated data in sc_mut1.fq and the next goal is to generate an alignment file from this data. We use a standard mapping protocol:

#index reference genome for alignment
bwa index ../refs/sc.fa

#perform bwa alignment
bwa aln ../refs/sc.fa sc_mut1.fq > sc_mut1.sai

#create SAM files
bwa samse ../refs/sc.fa sc_mut1.sai sc_mut1.fq > sc_mut1.sam

# always take a quick peek at what you just did, looks good!
less sc_mut1.sam

The next step is transforming the SAM file to a BAM format. The BAM file is one of the final results of the analysis, it is a fully self contained format that represents all the original data plus the alignment information:

# transform the sam file into a bam file
samtools view -Sb sc_mut1.sam > tmp.bam

# sort the file and create a new file
samtools sort tmp.bam sc_mut1

# index the bam file
samtools index sc_mut1.bam

Whew! We are half way through. We now have the information on where the reads align and each alignment contains information on wether the match is exact or wether it has mismatches/insertions or deletions. See the CIGAR column and/or the MD tag in the SAM file.

The next step is to create the promoter region for each gene. We start with gene information, although one could directly obtain the promoter information from another source.

The gene.gff file contains coordinates that we want to transform to represent promoter regions. A promoter region is an extension of the gene in the so called 5’ direction (upstream) of the start coordinate.

As an example the 100 bp long promoter region of an interval of 1000-2000 would be 900-1000 if the region lies on the positive strand and it would be 2000-2100 if the region lies on the negative strand.

You have to be careful here though as the above applies for a zero based representation. If the same interval 1000-2000 were specified with a one based coordinate system the promoter coordinates would be 900-999 if the gene were on the positive strand and 2001-2100 for the negative strand.

Interval operations with bedtools

Bedtools offers a series commands that can alter intervals in various ways. Moreover it can perform these operations in a strand specific and moreover it will recognize and respect the coordinate system that is being used.

To find the promoter regions we will first extend each gene upstream then we subtract the gene file from the extended file. Here is a visualization:

..........GGGGGGGGGGG........ # original gene
.....GGGGGGGGGGGGGGGG........ # extended gene
.....GGGGG................... # subtracted

# for the reverse strand the operation willbe

.....GGGGGGGGGGGG............. # original gene
.....GGGGGGGGGGGGGGGGG........ # extended gene
.................GGGGG........ # subtracted

As it happens the interval extension program needs to know how long the chromosomes are, this is to avoid extending the intervals past the actual chromosome size. So it needs a new tab delimited file that contains the chromosome name and the chromosome lenghts.

This information could be extracted from the SAM file header:

# head sc_mut1.sam
@SQ SN:chrI LN:230218
@SQ SN:chrII        LN:813184
@SQ SN:chrIII       LN:316620
@SQ SN:chrIV        LN:1531933
@SQ SN:chrV LN:576874
@SQ SN:chrVI        LN:270161
@SQ SN:chrVII       LN:1090940
... more lines ...

copy this information into a separete file and edit it to keep only the chromosome names and lenghts.

There is a second (possibly easier) way to do this is via indexing the the fasta file and looking at the .fai file that it creates:

# fasta index the reference
samtools faidx ../refs/sc.fa

# look at the file
head ../refs/sc.fa.fai
chrI        230218  6       80      81
chrII       813184  233109  80      81
chrIII      316620  1056466 80      81
chrIV       1531933 1377051 80      81
chrV        576874  2928140 80      81
chrVI       270161  3512232 80      81
chrVII      1090940 3785779 80      81
chrVIII     562643  4890365 80      81
chrIX       439888  5460049 80      81
chrX        745751  5905442 80      81

This file too would need to be edited to remove the extra columns:

cut -f 1,2 ../refs/sc.fa.fai > genome.txt

Now let’s extend the genes, the command to do this is called slop:

#'Extend' genes.gff intervals to 1000 bases upstream positive
# strand reads / 1000 downstream negative strand reads -
# this is to include a hypothetical 1000 bases promoter
bedtools slop -l 1000 -r 0 -s -i genes.gff -g genome.txt > slop.gff

# check if it worked, look at both files
head -3 genes.gff
head -3 slop.gff

Subtract the genes from the slop files. The order of files matter!:

#subtract the rest of the genes.gff file ("genome") from the slop.gff file to keep only the promoter regions
bedtools subtract -a slop.gff -b genes.gff -s > promoter.gff

#check if it worked
head -3 genes.gff
head -3 promoter.gff

We’re almost there. We have a BAM file with alignments and a promoter file. We just need to lay over one over the other. Now for simplicity we convert the BAM file to an easier to interpret text file. Moreover we will add an extra piece of information into the BED file. Normally the score value column contains the mapping quality but what we could also put there is the edit-distance: how many changes were in the alignment. The -ed command instructs bedtools to add that information:

#create BED files (containing interval information of the alingments)
bedtools bamtobed -ed -i sc_mut1_bwa.bam > sc_mut1_bwa.bed

Finally the result is near! We just need to intersect the alignments with the promoters. The -wo flag will make rows from both files be displayed next to one another:

# intersect the promoter.gff file (which contains
# annotation and interval information of the promoters)
# with the BED file created from the aligment to get the reads
# that mapped to the promoter regions
bedtools intersect -a promoter.gff -b sc_mut1_bwa.bed -wo > answer.txt

The answer.txt file has a lot of columns 9 from BED and 12 from SAM file. All we need are the columns for gene name and edit distance:

cut -f 9,14 answer.txt > edits.txt

To interpret it better we can sort this file to see which promoters have the most reads indicating edit distances. The sort needs to operate in the second column -k 2 it must use numerical values -n and we want reverse order -r:

sort -k 2 -r edits.txt | head

Voila, the promoter regions for these genes have variation with and edit distance of 4:

ID=YPR200C  4
ID=YPR198W  4
ID=YPR193C  4
ID=YPR187W  4
comments powered by Disqus

Table Of Contents

This Page