Learning objectives:
Boot an m1.medium Jetstream instance and log in.
conda install -y bwa samtools bcftools
After installing the necessary software, we will create the working directory for the mapping as follows:
cd ~/
mkdir -p mapping
cd mapping
Next, we will create links from the previously downloaded and quality-trimmed yeast dataset:
ln -fs ~/quality/*.qc.fq.gz .
ls
Goal: execute a basic mapping
curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz
and look at it:
head orf_coding.fasta
bwa index orf_coding.fasta
bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz > ERR458493.sam
Goal: make it possible to go look at a specific bit of the genome.
samtools faidx orf_coding.fasta
samtools import orf_coding.fasta.fai ERR458493.sam ERR458493.bam
samtools sort ERR458493.bam -o ERR458493.sorted.bam
samtools index ERR458493.sorted.bam
tview
:¶samtools tview ERR458493.sorted.bam orf_coding.fasta
tview
commands of relevance:
q
to quitg
allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:YLR162W:293
(impressive pileup, shows two clear variants and three other less clear)YDR034C-A:98
(impressive pileup, shows two clear variants)YDR366C:310
(impressive pileup, less clear variants)YLR256W:4420
(impressive pileup, less clear variants)YBL105C:2179
(less depth, shows two clear variants)YDR471W:152
(impressive pileup, shows one clear variant)Get some summary statistics as well:
samtools flagstat ERR458493.sorted.bam
Goal: find places where the reads are systematically different from the genome.
Now we can call variants using samtools mpileup:
samtools mpileup -u -t DP -f orf_coding.fasta ERR458493.sorted.bam | \
bcftools call -mv -Ov > variants.vcf
To look at the entire variants.vcf
file you can do cat variants.vcf
; all of the lines starting with #
are comments. You
can use tail variants.vcf
to see the last ~10 lines, which should
be all of the called variants.