# Visualizing BLAST score distributions in RStudio Learning objectives: * learn the basics of the RStudio interface. * explore plotting in RStudio. * explore some characteristics of the data resulting from your BLAST search. ## Getting started Connect to RStudio by setting your password (note, password will not be visible on the screen): ``` sudo passwd $USER ``` figuring out your username: ``` echo My username is $USER ``` and finding YOUR RStudio server interface Web address: ``` echo http://$(hostname):8787/ ``` Now go to that Web address in your Web browser, and log in with the username and password from above. ## Where are we? We've moved from using the command line to using R. Much like the command line allows us to do text and database like operations on large files, R is a statistical programming language that allows us to do Excel-like operations on *any* size dataset. R can do anything from act like a simple interactive calculator, all the way up to automatically analyzing thousands of files and outputting the analysis as website of interactive charts and graphs. For instance, this [Near Earth Object tracker](https://kvistrup.shinyapps.io/neo_fireball_tracking/) is an R script that pulls data from NASA, and will let you do exploratory analysis by just pointing and clicking on a website! R is it's own coding language, so it will take a little time before you can build your own NEO tracker, but we're going to keep coming back to R and learning new features over the course of the workshop. So, it's okay if you don't feel like you 'get' it today. ## Enter some R commands (Enter the below commands into RStudio, not the command line.) Load the data you created with BLAST: ``` blast_out <- read.table('blast/mm-second.x.zebrafish.tsv', sep='\t') ``` If you run `View(blast_out)`, you'll see the same information as in the previous section, *but* loaded into R for plotting and manipulation. The only problem is that the column names are kind of opaque - what does V1 mean!? To fix this, we can reset the column names like so, using the information from [the BLAST outfmt documentation](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6): ``` colnames(blast_out) <- c("qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore") View(blast_out) ``` `blast_out` is called a dataframe, which is a sort of R-ish version of a spreadsheet with named columns. `View` can be used to present it nicely, and `head(blast_out)` can be used to look at just the first few rows. Another useful command is `dim` which will tell you the DIMENSIONS of this data frame: ``` dim(blast_out) ``` That's a big data frame! 14,524 rows (and 12 columns!) Let's do some data visualization to get a handle on what our blast output looked like. First, let's look at the evalue: ``` hist(blast_out$evalue) ``` This is telling us that MOST of the values in the `evalue` column are quite low. What does this mean? How do we figure out what this is? (You can also try plotting the distribution of `-log(blast_out$evalue)` - why is this more informative?) So these are a lot of *low* e-values. Is that good or bad? Should we be happy or concerned? We can take a look at some more stats -- let's look at the `bitscore` column: ``` hist(blast_out$bitscore) ``` what are we looking for here? (And how would we know?) (Hint: longer bitscores are better, but even bitscores of ~200 mean a nucleotide alignment of 200 bp - which is pretty good, no? Here we really want to rescale the x axis to look at the distribution of bitscores in the 100-300 range.) Another question - if 'bitscore' is a score of the match, and 'pident' is the percent identity - is there a relationship between bitscore and pident? Well, we can ask this directly with `plot`: ``` plot(blast_out$pident, blast_out$bitscore) ``` why does this plot look the way it does? (This may take a minute to show up, note!) The answer is that bitscores are only *somewhat* related to pident; they take into account not only the percent identity but the length. You can get a napkin sketch estimate of this by doing the following: ``` plot(blast_out$pident * (blast_out$qend - blast_out$qstart), blast_out$bitscore) ``` which constructs a *new* variable, the percent identity times the length of the match, and then plots it against bitscore; *this* correlation looks much better. ### Summary points This is an example of initial *exploratory data analysis*, in which we poke around with data to see roughly what it looks like. This is opposed to other approaches where we might be trying to do statistical analysis to confirm a hypothesis. Typically with small replicate sizes (n < 5) it is hard to do confirmatory data analysis or hypothesis testing, so a lot of NGS work is done for *hypothesis generation* and then confirmed via additional experimental work. ## Some questions for discussion/points to make: * Why are we using R for this instead of the UNIX command line, or Excel? One important thing to note here is that we're looking at a pretty large data set - with ease. It would be much slower to do this in Excel. * What other things could we look at?