# Exercise¶

## Before you guys got here¶

Started with data from: “Genomic islands of speciation separate cichlid ecomorphs in an East African crater lake”, Malinsky et al 2015.

Made the VCF smaller so we could analyze it in this lifetime: 36 individuals and no indels.:

vcftools --gzvcf Massoko_Dryad_VCF_final.vcf.gz --keep inds_to_keep.txt --stdout --recode --recode-INFO-all --remove-indels | gzip -c > Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz


Made the VCF smaller still to remove low frequency sites and then local linkage disequilibrium. We will use these files for many of our analyses.:

vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --maf 0.05 --max-maf 0.95 --stdout --recode --recode-INFO-all | gzip -c > Massoko_Dryad_VCF_final_subset_noIndels_maf05.vcf.gz
vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels_maf05.vcf.gz --thin 1000 --stdout --recode --recode-INFO-all | gzip -c > Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.vcf.gz


Used the thinned VCF to make input files for phylogenetic inference and population structure analyses.:

python vcf_to_phy.py --infile Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.vcf.gz --thin 5
python vcf_to_geno.py
EIG-6.1.3/bin/convertf -p convert.par
cat Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.map | awk -F'\\\s+' '{print $1,$2, $3,$4}' > map


## Now you start¶

Start your instance with at least 50 Gb.:

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install ruby git
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install)" export PATH="/home/ubuntu/.linuxbrew/bin:$PATH"
export MANPATH="/home/ubuntu/.linuxbrew/share/man:$MANPATH" export INFOPATH="/home/ubuntu/.linuxbrew/share/info:$INFOPATH"
export LD_LIBRARY_PATH="/home/ubuntu/.linuxbrew/lib/"
brew tap homebrew/science


Install many useful packages.:

brew install zlib
brew install gsl
brew install homebrew/science/raxml
brew install homebrew/science/vcftools
brew install homebrew/science/openblas


Install some other software.:

wget https://data.broadinstitute.org/alkesgroup/EIGENSOFT/EIG-6.1.3.tar.gz
tar -xvzf EIG-6.1.3.tar.gz


You can download the smaller data set and ancillary files from here.:

wget https://www.dropbox.com/s/ra4yqix0jfe1fgn/tutorial_files.tar.gz
tar -xzvf tutorial_files.tar.gz
cd tutorial_files


Calculate nucleotide diversity ($$\pi$$). Use VCFtools to figure out how to calculate it. We want to calculate it for ‘benthic’ and ‘littoral’ morphs separately.:

vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --keep littoral.txt --window-pi 100000 --out littoral_pi
vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --keep benthic.txt --window-pi 100000 --out benthic_pi


vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --keep littoral.txt --ld-window-bp 500000 --chr scaffold_0 --hap-r2 --out littoral_scaffold_0_ld --min-r2 0.001
vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --keep benthic.txt --ld-window-bp 500000 --chr scaffold_0 --hap-r2 --out benthic_scaffold_0_ld --min-r2 0.001


Summarize linkage disequilibrium data files so that they are smaller and easier to plot.:

python summarize_ld.py --infile littoral_scaffold_0_ld.hap.ld --win 10
python summarize_ld.py --infile benthic_scaffold_0_ld.hap.ld --win 10


Calculate $$F_{ST}$$ between benthic and limnetic forms.:

vcftools --gzvcf Massoko_Dryad_VCF_final_subset_noIndels.vcf.gz --weir-fst-pop littoral.txt --weir-fst-pop benthic.txt --fst-window-size 100000 --out benthic_limnetic_fst


Make a phylogeny.:

raxmlHPC-PTHREADS -T 12 -m GTRGAMMA -n Massoko -s Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K_thin5.phy -p 123 -o A_calliptera_Chitimba,A_calliptera_Bua,A_calliptera_Chizumulu


Run ADMIXTURE for up to 6 populations.:

~/admixture_linux-1.3.0/admixture Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.ped 1
...


Run smartpca.:

~/EIG-6.1.3/bin/smartpca -p Massoko_smartpca.par > Massoko_smartpca.out


Now that we have all the different pieces, let’s start to plot the data and see what we find. Put all the results into one folder and download them locally so that we can plot and visualize them using R.

Just to be sure, here are all the files you should have. Should things be taking too long, you can borrow my results that I generated earlier: https://www.dropbox.com/s/czrru76ku2kqwt2/results.tar.gz?dl=0

• benthic_limnetic_fst.windowed.weir.fst
• benthic_pi.windowed.pi
• benthic_scaffold_0_ld.hap.ld_summary_window10
• littoral_pi.windowed.pi
• littoral_scaffold_0_ld.hap.ld_summary_window10
• RAxML_bestTree.Massoko

We have the following data types.

1. Genetic diversity.
2. Genetic differentiation. ($$F_{ST}$$)
3. Decay of linkage disequilibrium.
4. A tree.
5. PCA results.
6. ADMIXTURE population clustering results.

We will be using R to plot all these results. I will get you started on how to start thinking about some of these. I would recommend setting your working directory to be the directory that has all your results. For example,:

setwd("/Users/sonal/Desktop/results/")


Note that this is generally considered bad programming practice for scripts that will be publicly shared, but it is convenient when doing exploratory data analysis.

## Genetic diversity¶

To load the genetic diversity results,:

b = read.table("benthic_pi.windowed.pi", header=T)


Look at how the data is structured and summarize it quickly,:

head(b)
summarize(b)


To answer some of the questions below, it might be useful to combine across both data-frames:

x = merge(b, l, by=c("CHROM", "BIN_START", "BIN_END"))


To answer some of the questions below, it might be useful to combine across both data-frames in another way:

all = data.frame(c(b$PI, l$PI), c(rep("benthic", nrow(b)), c(rep("littoral", nrow(l)))))
names(all) = c("PI", "MORPH")


You might want to also explore the following functions to answer the questions:

cor.test()
boxplot()
aov()
# if you store the results of aov() in a variable and then run summary() on the variable, you get more info


Some questions:

1. What is min, max, and mean levels of genetic diversity in each morph?
2. Is genetic diversity between the two morphs significantly different?
3. Why might genetic diversity be higher in one morph than another? How could you test this?
4. How correlated is genetic diversity between the two morphs?
5. Why would genetic diversity be correlated between the two morphs?

## Genetic differentiation¶

To load the genetic differentiation results,:

fst = read.table("benthic_limnetic_fst.windowed.weir.fst", header=T)


To select rows that have certain values,:

x = fst[fst$CHROM == 'scaffold_0', ] x = fst[fst$WEIGHTED_FST >= 0.1, ]


You might want to explore the functions:

dim()
nrow()


Which allow you to quickly figure out how big these dataframes are.

Some questions:

1. What is the mean $$F_{ST}$$ between these two morphs?
2. Is there a correlation between the number of variants in a window and $$F_{ST}$$? If so, it would suggest we need to be cautious of these results.
3. In this paper, the authors emphasize the importance of genomic regions that are highly differentiated. How many 100 kb windows are differentiated above $$F_{ST}$$>0.1? $$F_{ST}$$>0.2? $$F_{ST}$$>0.3?
4. Plot $$F_{ST}$$ along BIN_START on scaffold_15. The authors originally identified 10 peaks (see Fig. 3D) that are highly differentiated. How many do you identify? Why might our results be different?
5. How might you determine if windows with high $$F_{ST}$$ are significant?

## Decay of Linkage Disequilibrium¶

You can read in the tables for linkage disequilibrium just like you did for nucleotide diversity.

Having done that, we can now plot the data. Plot distance on the x-axis and $$r^2$$ on the y-axis (a measure of linkage disequilibrium that looks at the correlation coefficient between pairs of loci – higher values means that two loci “travel” together more than you would expect under random assortment).

Try plotting both morphs at once. You will want to use the points() function.

Some questions:

1. Do the two morphs have different decay patterns?
2. A key aspect of linkage disequilibrium is how quickly it decays. At what physical distance is the level of linkage disequilbrium halved? You can estimate this visually or using R.
3. These points are very very noisy. How might you do this exercise again to reduce some of this noise? If you have time, try it!

## Plot the phylogeny¶

To plot the phylogeny, you will need to install the library ape.:

install.packages("ape")
library(ape)


Then, you can read in and plot tree.:

t = read.tree("RAxML_bestTree.Massoko")
# makes the tree easier to visualize by ladderizing it
plot(t)


Some questions:

1. What do you think is going on with the “small” morph?
2. Looking at this tree, would you say that the “littoral” and “benthic” morphs are differentiated? Why or why not?
3. Before we use this tree for any formal analysis, what else might you want to check about the tree?

## Plot the PCA¶

To read in the PCA data:

d = read.table("Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.evec")


Note that the eval file has the data we would need to calculate the eigenvalues for each PCA axis.

Look at the data file using head() – how is it structured? What does each column mean?

You can plot it by:

plot(d$V2, d$V3, col=as.factor(d$V12), pch=16)  This isn’t such an informative plot. Why? How would you subset the data to make it more informative? Hint: look at column V12.: s = d[d$V12 %in% c("Massoko_benthic", "Massoko_littoral", "Massoko_small"),]


This still isn’t as informative as it could be. It likely would have been much more informative if we removed the outgroups before doing the PCA. That said, are these morphs differentiated? How do these results compare to what we saw with the phylogeny? Why might these results be different?

d1 = read.table("Massoko_Dryad_VCF_final_subset_noIndels_maf05_thinned1K.1.Q")


To plot the results:

par(mfrow=c(6,1), mar=c(1,4,1,1))
barplot(t(as.matrix(d1)), col=rainbow(1), border=NA)
barplot(t(as.matrix(d2)), col=rainbow(2), border=NA)
barplot(t(as.matrix(d3)), col=rainbow(3), border=NA)
barplot(t(as.matrix(d4)), col=rainbow(4), border=NA)
barplot(t(as.matrix(d5)), col=rainbow(5), border=NA)
par(mar=c(3,4,1,1))
x = barplot(t(as.matrix(d6)), col=rainbow(6), border=NA)
inds = c(rep('A_cal', 3), rep('Ita', 3), rep('B', 10), rep('L', 10), rep('S', 10))
mtext(inds, 1, at=x, las=2)


What’s going on here? Based on all the results you have seen from the phylogeny, the PCA, and this, how would you characterize the differentiation between these morphs?

# Resources¶

## Papers on population genomics¶

• A framework for variation discovery and genotyping using next-generation DNA sequencing data, DePristo et al 2010; 10.1038/ng.806
• Genome sequencing and population genomics in non-model organisms, Ellegren 2014; 10.1016/j.tree.2013.09.008
• Genotype and SNP calling from next-generation sequencing data, Nielsen et al 2011; 10.1038/nrg2986
• Methods and models for unravelling human evolutionary history, Schraiber and Akey 2015; 10.1038/nrg4005
• Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans, Begun et al 2007; 10.1371/journal.pbio.0050310
• The power and promise of population genomics: from genotyping to genome typing, Luikart et al 2003; 10.1038/nrg1226

## Learn Perl¶

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.