1 RNAseq exploration

This should be the same Salmon counts you generated last week, but now we’re going to do a bit more with them. They are six yeast RNAseq samples.

1.1 Setup

This is where we load packages and data, and where I like to set up global variables if I need to.

# default to not showing code, you can change it for chunks as you like
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)

# We don't actually use these next two, but they don't hurt anything, and you might want them for a real RNAseq analysis

# Setting a reasonable p-value threshold to be used throughout
p_cutoff <- 0.1

# this is a fold change cutoff 
FC_cutoff <- log2(1.1)

I also have it print my session info once all my packages are loaded so there’s a record of my versions:

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] edgeR_3.20.9      limma_3.34.9      data.table_1.11.4 tidyr_0.8.1      
## [5] dplyr_0.7.6       plyr_1.8.4        pheatmap_1.0.10   ggplot2_3.0.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17       pillar_1.2.3       compiler_3.4.4    
##  [4] RColorBrewer_1.1-2 bindr_0.1.1        tools_3.4.4       
##  [7] digest_0.6.15      lattice_0.20-35    evaluate_0.10.1   
## [10] tibble_1.4.2       gtable_0.2.0       pkgconfig_2.0.1   
## [13] rlang_0.2.1        yaml_2.1.19        bindrcpp_0.2.2    
## [16] withr_2.1.2        stringr_1.3.1      knitr_1.20        
## [19] locfit_1.5-9.1     rprojroot_1.3-2    grid_3.4.4        
## [22] tidyselect_0.2.4   glue_1.2.0         R6_2.2.2          
## [25] rmarkdown_1.10     purrr_0.2.5        magrittr_1.5      
## [28] backports_1.1.2    scales_0.5.0       htmltools_0.3.6   
## [31] assertthat_0.2.0   colorspace_1.3-2   stringi_1.2.3     
## [34] lazyeval_0.2.1     munsell_0.5.0

The values in the matrix should be un-normalized counts or estimated counts of sequencing reads (for single-end RNA-seq) or fragments (for paired-end RNA-seq). https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq

1.2 Sanity Checks

This is where we check that the data looks the way we expected in general.

Dimensions:

## [1] 35502     3

Structure:

## 'data.frame':    35502 obs. of  3 variables:
##  $ transcript: Factor w/ 5917 levels "Q0045","Q0050",..: 24 25 26 27 28 29 30 31 32 33 ...
##  $ count     : int  61 44 1009 1864 86 40 28 32 28 660 ...
##  $ file      : chr  "ERR458493" "ERR458493" "ERR458493" "ERR458493" ...

1.3 Exploration of our data

Let’s look at the data!

1.3.1 Scatter plots

These show us overall trends in the data

count vs transcript:

The same plot, but colored by sample:

That’s still a little hard to see. So, let’s split the samples into their own plots:

1.3.2 Summary statistics

These are our library sizes (both raw, and in millions of reads)

## ERR458493 ERR458494 ERR458495 ERR458500 ERR458501 ERR458502 
##    963028    955258    950704   1620091   1616974   1611768
## ERR458493 ERR458494 ERR458495 ERR458500 ERR458501 ERR458502 
##  0.963028  0.955258  0.950704  1.620091  1.616974  1.611768
Here is a summary of our samples

1.3.3 Clustering plots

1.3.4 Heatmap

A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? To draw a heatmap of individual RNA-seq samples, we suggest using moderated log-counts-per-million. This can be calculated by cpm with positive values for prior.count:

1.3.4.1 PCA for overall expression

We can also look at how well the samples seperate just by their expression levels

1.4 edgeR exploration and analysis

1.4.1 Exploration

1.4.1.1 Dispersion

This is a bit of exploration of what dispersion parameters we should use. edgeR defaults to a prior.n of 10:

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.141e-05 6.491e-04 2.115e-03 7.073e-03 6.271e-03 4.674e-02

What happens if we increase the shrinkage/sqeezing toward the common?

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.141e-05 6.491e-04 2.115e-03 7.073e-03 6.271e-03 4.674e-02

Looks like nothing! Let’s keep the original.

1.4.1.2 Mean and variance modeling

We can also look at how the mean and variance spread for this particular data set. This would tell you whether the default distribution is useful for your model:

### Analysis

Finally, we can get a list of differentially expressed genes!

## Coefficient:  yeast.groupsB 
##             logFC    logCPM        LR PValue FDR
## YHR215W -5.434680  7.393140  1741.131      0   0
## YML123C -5.019035  9.024432  5272.580      0   0
## YGR234W -4.357990  7.840829  2050.401      0   0
## YDR077W -3.771987 11.605155 16001.861      0   0
## YMR120C -3.725649  8.217999  2281.826      0   0
## YJR029W -3.615481  9.053180  3795.265      0   0
## YDR033W -3.569033  8.250825  2210.102      0   0
## YDR342C -3.375454 11.163823 11147.923      0   0
## YFR015C -3.254098  9.244759  3699.434      0   0
## YAR009C -3.154745  8.880019  2821.314      0   0

Let’s order it by pvalue:

##          ERR458493  ERR458494  ERR458495  ERR458500  ERR458501   ERR458502
## YAL005C  2242.1211  2318.7800  2159.7499   789.7740   777.3009   798.52767
## YAL038W 15113.8686 15320.1651 14900.4494  5658.0025  5574.7552  5673.35406
## YAR009C  1167.9719   896.7724  1189.9918   126.3852   107.9882   130.85343
## YBR118W 18614.1756 18465.5249 18264.7922 11156.0238 10910.0119 10949.85757
## YCL030C   953.8637   973.0163   929.6050   143.4498   137.3909   142.65169
## YDL229W  3846.7291  3745.6285  3724.5040  1709.6660  1770.0436  1714.50165
## YDR033W   756.5956   767.2790   744.6574    65.5923    62.0130    63.28157
## YDR077W  6339.0440  6431.1049  6415.9782   453.8134   455.4748   494.99062
## YDR342C  4769.3187  4761.0024  4555.5514   468.2117   434.0910   454.76928
## YDR345C  4481.8364  4376.1527  4369.3870  2283.4653  2299.2925  2266.87472

And here’s a high level summary of our differential expression results:

##        yeast.groupsB
## Down            1377
## NotSig          3331
## Up              1209

1.4.2 DE plots

With the models run, we can also plot some things about the analyzed data, to decide if our analysis is good or not

The function plotMDS draws a multi-dimensional scaling plot of the RNA samples in which distances correspond to leading log-fold-changes between each pair of RNA samples:

This shows the log-fold change against log-counts per million, with DE genes highlighted:

This plot in particular looks pretty fishy. Nearly all our genes show up as DE…that’s not normal. It may be that our fake metadata threw things off, or we didn’t do enough filtering of our data. We skipped filtering out low read counts, for example, which can break the shrinkage estimator.

Last but not least, let’s look at how our dispersion changed: This one is pretty crazy looking too. Maybe we should filter and try again. From the manual: Usually a gene is required to have a count of 5-10 in a library to be considered expressed in that library. Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

1.4.2.1 DEplots after some filtering