gain hands-on exposure to the UNIX command line or shell,
understand how data is turned into results by programs run at the command line.
bridge BLAST data output to R/RStudio visualizations.
Boot an m1.medium instance on Jetstream and connect to your shell prompt.
Make sure you are starting in your home directory:
and let’s make a new subdirectory to work in:
mkdir -p ~/blast
Creating a subdirectory will allow us to keep our home directory tidy and help keep us organized. Staying organized will make it easier to locate important files and prevent us from being overwhelmed. As you will find, we will create and use many files.
Now, install some software. We’ll need NCBI BLAST for the below tutorial:
conda install -y blast
BLAST is the Basic Local Alignment Search Tool. It uses an index to rapdily search large sequence databases; it starts by finding small matches between the two sequences and extending those matches. For more information on how BLAST works and the different BLAST functionality, check out the summary on Wikipedia or the NCBI’s list of BLAST resources.
BLAST can be helpful for identifying the source of a sequence, or finding a similar sequence in another organism. In this lesson, we will use BLAST to find zebrafish proteins that are similar to a small set of mouse proteins.
BLAST has a very nice graphical interface for searching sequences in NCBI’s database. However, running BLAST through the commmand line has many benefits:
It’s much easier to run many BLAST queries using the command line than the GUI
Running BLAST with the command line is reproducible and can be documented in a script
The results can be saved in a machine-readable format that can be analyzed later on
You can create your own databases to search rather than using NCBI’s pre-built databases
It allows the queries to be automated
It allows you to use a remote computer to run the BLAST queries
Later on in the workshop we will talk more about these advantages and have a more in-depth explanation of the shell.
We need some data! Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in our home directory.
Now, we’ll use
curl to download the files from a Web site onto our
computer; note, these files originally came from the
NCBI FTP site
curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download
curl -o mouse.2.protein.faa.gz -L https://osf.io/j2qxk/download
curl -o zebrafish.1.protein.faa.gz -L https://osf.io/68mgf/download
If you look at the files in the current directory:
You should now see these 3:
-rw-rw-r-- 1 titus titus 12553742 Jun 29 08:41 mouse.1.protein.faa.gz
-rw-rw-r-- 1 titus titus 4074490 Jun 29 08:41 mouse.2.protein.faa.gz
-rw-rw-r-- 1 titus titus 13963093 Jun 29 08:42 zebrafish.1.protein.faa.gz
The three files you just downloaded are the last three on the list - the
All three of the files are FASTA protein files (that’s what the .faa
suggests) that are compressed with
gzip (that’s what the .gz means).
and let’s look at the first few sequences in the file:
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -n 11 mouse.1.protein.faa > mm-first.faa
So now, for example, you can do
cat mm-first.faa to see the contents of
that file (or
less mm-first.faa). TIP: if you try
less mm-first.faa you will need to exit by pressing the
q key in your keyboard.
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First, we need to tell BLAST that the zebrafish sequences are (a) a database, and (b) a protein database. That’s done by calling ‘makeblastdb’:
makeblastdb -in zebrafish.1.protein.faa -dbtype prot
Next, we call BLAST to do the search:
blastp -query mm-first.faa -db zebrafish.1.protein.faa
This should run pretty quickly, but you’re going to get a lot of output!!
To save it to a file instead of watching it go past on the screen,
ask BLAST to save the output to a file that we’ll name
blastp -query mm-first.faa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt
and then you can ‘page’ through this file at your leisure by typing:
(Type spacebar to move down, and ‘q’ to get out of paging mode.)
Let’s do some more sequences (this one will take a little longer to run):
head -n 498 mouse.1.protein.faa > mm-second.faa
blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
will compare the first 96 sequences. You can look at the output file with:
(and again, type ‘q’ to get out of paging mode.)
you can copy/paste multiple commands at a time, and they will execute in order;
why did it take longer to BLAST
Things to mention and discuss:
blastp options and -help.
command line options, more generally - why so many?
Last, but not least, let’s generate a more machine-readable version of that last file –
blastp -query mm-second.faa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6
You can open the file with
less mm-second.x.zebrafish.tsv to see how the file looks like.
See this link for a description of the possible BLAST output formats.