This should be the same Salmon counts you generated last week, but now we’re going to do a bit more with them. They are six yeast RNAseq samples.
This is where we load packages and data, and where I like to set up global variables if I need to.
# default to not showing code, you can change it for chunks as you like
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
# We don't actually use these next two, but they don't hurt anything, and you might want them for a real RNAseq analysis
# Setting a reasonable p-value threshold to be used throughout
p_cutoff <- 0.1
# this is a fold change cutoff
FC_cutoff <- log2(1.1)
I also have it print my session info once all my packages are loaded so there’s a record of my versions:
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.20.9 limma_3.34.9 data.table_1.11.4 tidyr_0.8.1
## [5] dplyr_0.7.6 plyr_1.8.4 pheatmap_1.0.10 ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 pillar_1.2.3 compiler_3.4.4
## [4] RColorBrewer_1.1-2 bindr_0.1.1 tools_3.4.4
## [7] digest_0.6.15 lattice_0.20-35 evaluate_0.10.1
## [10] tibble_1.4.2 gtable_0.2.0 pkgconfig_2.0.1
## [13] rlang_0.2.1 yaml_2.1.19 bindrcpp_0.2.2
## [16] withr_2.1.2 stringr_1.3.1 knitr_1.20
## [19] locfit_1.5-9.1 rprojroot_1.3-2 grid_3.4.4
## [22] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2
## [25] rmarkdown_1.10 purrr_0.2.5 magrittr_1.5
## [28] backports_1.1.2 scales_0.5.0 htmltools_0.3.6
## [31] assertthat_0.2.0 colorspace_1.3-2 stringi_1.2.3
## [34] lazyeval_0.2.1 munsell_0.5.0
The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq
This is where we check that the data looks the way we expected in general.
Dimensions:
## [1] 35502 3
Structure:
## 'data.frame': 35502 obs. of 3 variables:
## $ transcript: Factor w/ 5917 levels "Q0045","Q0050",..: 24 25 26 27 28 29 30 31 32 33 ...
## $ count : int 61 44 1009 1864 86 40 28 32 28 660 ...
## $ file : chr "ERR458493" "ERR458493" "ERR458493" "ERR458493" ...
Let’s look at the data!
These show us overall trends in the data
count vs transcript:
The same plot, but colored by sample:
That’s still a little hard to see. So, let’s split the samples into their own plots:
These are our library sizes (both raw, and in millions of reads)
## ERR458493 ERR458494 ERR458495 ERR458500 ERR458501 ERR458502
## 963028 955258 950704 1620091 1616974 1611768
## ERR458493 ERR458494 ERR458495 ERR458500 ERR458501 ERR458502
## 0.963028 0.955258 0.950704 1.620091 1.616974 1.611768
Here is a summary of our samples
A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? To draw a heatmap of individual RNA-seq samples, we suggest using moderated log-counts-per-million. This can be calculated by cpm with positive values for prior.count:
We can also look at how well the samples seperate just by their expression levels
This is a bit of exploration of what dispersion parameters we should use. edgeR defaults to a prior.n of 10:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.141e-05 6.491e-04 2.115e-03 7.073e-03 6.271e-03 4.674e-02
What happens if we increase the shrinkage/sqeezing toward the common?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.141e-05 6.491e-04 2.115e-03 7.073e-03 6.271e-03 4.674e-02
Looks like nothing! Let’s keep the original.
We can also look at how the mean and variance spread for this particular data set. This would tell you whether the default distribution is useful for your model:
### Analysis
Finally, we can get a list of differentially expressed genes!
## Coefficient: yeast.groupsB
## logFC logCPM LR PValue FDR
## YHR215W -5.434680 7.393140 1741.131 0 0
## YML123C -5.019035 9.024432 5272.580 0 0
## YGR234W -4.357990 7.840829 2050.401 0 0
## YDR077W -3.771987 11.605155 16001.861 0 0
## YMR120C -3.725649 8.217999 2281.826 0 0
## YJR029W -3.615481 9.053180 3795.265 0 0
## YDR033W -3.569033 8.250825 2210.102 0 0
## YDR342C -3.375454 11.163823 11147.923 0 0
## YFR015C -3.254098 9.244759 3699.434 0 0
## YAR009C -3.154745 8.880019 2821.314 0 0
Let’s order it by pvalue:
## ERR458493 ERR458494 ERR458495 ERR458500 ERR458501 ERR458502
## YAL005C 2242.1211 2318.7800 2159.7499 789.7740 777.3009 798.52767
## YAL038W 15113.8686 15320.1651 14900.4494 5658.0025 5574.7552 5673.35406
## YAR009C 1167.9719 896.7724 1189.9918 126.3852 107.9882 130.85343
## YBR118W 18614.1756 18465.5249 18264.7922 11156.0238 10910.0119 10949.85757
## YCL030C 953.8637 973.0163 929.6050 143.4498 137.3909 142.65169
## YDL229W 3846.7291 3745.6285 3724.5040 1709.6660 1770.0436 1714.50165
## YDR033W 756.5956 767.2790 744.6574 65.5923 62.0130 63.28157
## YDR077W 6339.0440 6431.1049 6415.9782 453.8134 455.4748 494.99062
## YDR342C 4769.3187 4761.0024 4555.5514 468.2117 434.0910 454.76928
## YDR345C 4481.8364 4376.1527 4369.3870 2283.4653 2299.2925 2266.87472
And here’s a high level summary of our differential expression results:
## yeast.groupsB
## Down 1377
## NotSig 3331
## Up 1209
With the models run, we can also plot some things about the analyzed data, to decide if our analysis is good or not
The function plotMDS draws a multi-dimensional scaling plot of the RNA samples in which distances correspond to leading log-fold-changes between each pair of RNA samples:
This shows the log-fold change against log-counts per million, with DE genes highlighted:
This plot in particular looks pretty fishy. Nearly all our genes show up as DE…that’s not normal. It may be that our fake metadata threw things off, or we didn’t do enough filtering of our data. We skipped filtering out low read counts, for example, which can break the shrinkage estimator.
Last but not least, let’s look at how our dispersion changed: This one is pretty crazy looking too. Maybe we should filter and try again. From the manual: Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.