Running command-line BLAST¶
The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
Start up an m1.xlarge Amazon EC2 instance.
All of the commands below can copy/pasted.
Updating the software on the machine¶
Copy and paste the following commands
sudo apt-get update && sudo apt-get -y install python ncbi-blast+
(make sure to hit enter after the paste – sometimes the last line doesn’t paste completely.)
This updates the software list and installs the Python programming language and NCBI BLAST+.
Running BLAST¶
First! We need some data. Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in /mnt, which is the scratch disk space for Amazon machines
sudo chmod a+rwxt /mnt
cd /mnt
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.1.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.2.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.3.protein.faa.gz
curl -O ftp://ftp.ncbi.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.1.protein.faa.gz
If you look at the files in the current directory, you should see four files, along with a directory called lost+found which is for system information:
ls -l
should show you:
total 28904
drwx------ 2 root root 16384 Nov 2 2014 lost+found
-rw-rw-r-- 1 ubuntu ubuntu 8132346 Aug 10 21:44 mouse.1.protein.faa.gz
-rw-rw-r-- 1 ubuntu ubuntu 8091255 Aug 10 21:44 mouse.2.protein.faa.gz
-rw-rw-r-- 1 ubuntu ubuntu 565224 Aug 10 21:44 mouse.3.protein.faa.gz
-rw-rw-r-- 1 ubuntu ubuntu 12735506 Aug 10 21:44 zebrafish.1.protein.faa.gz
All four of the files are FASTA protein files (that’s what the .faa suggests) that are compressed by gzip (that’s what the .gz suggests).
Uncompress them
gunzip *.faa.gz
and let’s look at the first few sequences in the file:
head mouse.1.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s just a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -11 mouse.1.protein.faa > mm-first.fa
So now, for example, you can do ‘cat mm-first.fa’ to see the contents of that file (or ‘less mm-first.fa’).
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First, we need to tell BLAST that the zebrafish sequences are (a) a database, and (b) a protein database. That’s done by calling ‘makeblastdb’
makeblastdb -in zebrafish.1.protein.faa -dbtype prot
Next, we call BLAST to do the search
blastp -query mm-first.fa -db zebrafish.1.protein.faa
This should run pretty quickly, but you’re going to get of output!! To save it to a file instead of watching it go past on the screen, do:
blastp -query mm-first.fa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt
and then you can ‘page’ through this file at your leisure by typing:
less mm-first.x.zebrafish.txt
(Type spacebar to move down, and ‘q’ to get out of paging mode.)
Let’s do some more sequences:
head -500 mouse.1.protein.faa > mm-second.fa
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
will compare the first 83 sequences. You can look at the output file with:
less mm-second.x.zebrafish.txt
(and again, type ‘q’ to get out of paging mode.)
Notes:
you can execute multiple commands at a time;
You should see a warning -
Selenocysteine (U) at position 310 replaced by X
what does this mean?
why did it take longer to BLAST
mm-second.fa
thanmm-first.fa
?
Things to mention and discuss:
- blastp options and -help.
- command line options, more generally - why??
- automation rocks!
Reminder: shut down your instance!
LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.
comments powered by Disqus