# Mapping and variant calling on yeast transcriptome Learning objectives: * define and explore the concepts and implications of shotgun sequencing; * explore coverage; * understand the basics of mapping-based variant calling; * learn basics of actually calling variants & visualizing. ## Boot up a Jetstream You should still have your jetstream instance running, you can following the instructions [here](jetstream/boot.html) to log in to [JetStream](https://use.jetstream-cloud.org/application/dashboard) and find your instance. Then `ssh` into it following the instructions [here](jetstream/boot.html#ssh-secure-login). ## Install software ``` conda install -y bwa samtools bcftools ``` ## Change to a new working directory and map data After installing the necessary software, we will create the working directory for the mapping as follows: ``` cd ~/ mkdir mapping cd mapping ``` Next, we will create links from the previously downloaded and quality-trimmed yeast dataset: ``` ln -fs ~/quality/*.qc.fq.gz . ls ``` ## Variant Calling Workflow
## Visualize mapping Goal: make it possible to vizualize some of our mapping results. ### Index the reference: Before we indexed the reference for BWA, now we need to index the reference for samtools. To see the manual for `samtools` we can type: ``` samtools ``` Although both tools use different indexing methods, they both allow the tools to find specific sequences within the reference quickly. We can see that an indexing function is `samtools faidx`. ``` samtools faidx orf_coding.fasta ``` ### Convert the SAM file into a BAM file: Next, we will convert our file format to a `.bam` file with the `samtools view` command. Let's see the different parameters for this function: ``` samtools view ``` We can see that: - `-S`: ignored (input format is auto-detected) - `-b`: output BAM So let's convert our file format for sample `ERR458493`: ``` samtools view -S -b ERR458493.sam > ERR458493.bam ``` ### Sort the BAM file by position in genome: You can sort on many different columns within a sam or bam file. After mapping, our files are sorted by read number. Most programs require mapping files to be sorted by position in the reference. You can sort a file using the `samtools sort` command. ``` samtools sort ERR458493.bam -o ERR458493.sorted.bam ``` ### Index the BAM file so that we can randomly access it quickly: ``` samtools index ERR458493.sorted.bam ``` ### Call variants with `bcftools`! **Goal:** find places where the reads are systematically different from the reference. A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome or transcriptome, often referred to as a Single Nucleotide Polymorphism (**SNP**). The call is usually accompanied by an estimate of variant frequency and some measure of confidence. Similar to other steps in this workflow, there are number of tools available for variant calling. In this workshop we will be using `bcftools`, but there are a few things we need to do before actually calling the variants. First, let's look at the manual page for `bcftools`, making note of the `mpileup` and `call` commands: ``` bcftools ``` Let's see the different parameters for the `bcftools mpileup` command by typing: ``` bcftools mpileup ``` **Note:** The parameters we will use for `bcftools mpileup` are: - `-O`: Output file type, which can be: - 'b' **compressed BCF. We will use this output format!** - 'u' uncompressed BCF - 'z' compressed VCF - 'v' uncompressed VCF - `-f`: faidx indexed reference sequence file And to find help for the `bcftools call` function: ``` bcftools call ``` **Note:** The parameters for `bcftools call` that we will use: - `-m`: alternative model for multiallelic and rare-variant calling (conflicts with -c) - `-v`: output variant sites only - `-o`: write output to a file Now, let's go ahead and run our command on our sample `ERR458493`! ``` bcftools mpileup -O b -f orf_coding.fasta ERR458493.sorted.bam | \ bcftools call -m -v -o variants.vcf ``` Next, we will use a perl script from `samtools` called `vcfutils.pl` that will filter out our variants and we can write the output to a new file. ``` vcfutils.pl varFilter variants.vcf > variants_filtered.vcf ``` To look at the entire `variants.vcf` file you can do `cat variants.vcf`; all of the lines starting with `#` are comments. You can use `tail variants.vcf` to see the last ~10 lines, which should be all of the called variants. The first few columns of the VCF file represent the information we have about a predicted variation: | column | info | | ------ | ------ | | CHROM | contig location where the variation occurs | | POS | position within the contig where the variation occurs | | ID | a `.` until we add annotation information | | REF | reference genotype (forward strand) | | ALT | sample genotype (forward strand) | | QUAL | Phred-scaled probablity that the observed variant exists at this site (higher is better) | | FILTER | a `.` if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed | The Broad Institute’s [VCF guide](https://www.broadinstitute.org/gatk/guide/article?id=1268) is an excellent place to learn more about VCF file format. ### Visualize with `tview`: Now that we know the locations of our variants, let's view them with `samtools`! Samtools implements a very simple text alignment viewer called tview. This alignment viewer works with short indels and shows MAQ consensus. It uses different colors to display mapping quality or base quality, subjected to users' choice. Samtools viewer is known to work with an 130 GB alignment swiftly. Due to its text interface, displaying alignments over network is also very fast. In order to visualize our mapped reads we use tview, giving it the sorted bam file and the reference file: ``` samtools tview ERR458493.sorted.bam orf_coding.fasta ``` `tview` commands of relevance: * left and right arrows scroll * `q` to quit * CTRL-h and CTRL-l do "big" scrolls * Typing `g` allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out: - `YLR162W:293` (impressive pileup, shows two clear variants and three other less clear) - `YDR034C-A:98` (impressive pileup, shows two clear variants) - `YDR366C:310` (impressive pileup, less clear variants) - `YLR256W:4420` (impressive pileup, less clear variants) - `YBL105C:2179` (less depth, shows two clear variants) - `YDR471W:152` (impressive pileup, shows one clear variant) Get some summary statistics as well: ``` samtools flagstat ERR458493.sorted.bam ```PRACTICE! Using thebwa
command we ran above as the foundation, construct a for loop to generate ".sam" alignment files for all of our quality controlled samples!Solutionfor filename in *.qc.fq.gz do name=$(basename $filename .qc.fq.gz) echo Working on: $name bwa mem -t 4 orf_coding.fasta $filename > ${name}.sam done