# Six glorious commands There are **a lot** of things we can do at the command line, and there is unfortunately no *one* place to go to find all of the things that would be helpful. But here we are going to introduce six standard and powerful commands that are worth being aware of. Again, we don't need to remember the details of any of them, but having an idea they exist means we might think of them when we have a problem to solve, and then we can learn what we need. **Remember, this is just about exposure right now** π If you'd like to follow along, but need to pull up the proper working environment again, revisit [here](shell-getting-started-01.html#how-to-access-the-shell-for-now) and then come back π ---
To be sure we are starting in the same place, let's run: ```bash cd ~/shell_intro ```--- We'll mostly be working with a file here called "gene_annotations.tsv", which is a tab-delimited table holding genes, their annotations, and their amino acid sequences. To help orient us, here is a peek at it in Excel:
## grep **`grep`** (**g**lobal **r**egular **e**x**p**ression) is a search tool. It looks through text files for strings (sequences of characters). In its default usage, **`grep`** will look for whatever string of characters you give it (1st positional argument), in whichever file you specify (2nd positional argument), and then print out the lines that contain what you searched for. Let's try it: ```bash head colors.txt grep blue colors.txt ``` If there are multiple lines that match, grep will print them all: ```bash grep re colors.txt ``` If what we are looking for is not in the file, we will just get our prompt back with nothing printed out: ```bash grep black colors.txt ``` Back to our gene annotations file, remember it holds KO-annotation information in the 3rd and 4th columns: ```bash head -n 1 gene_annotations.tsv ``` For the moment, let's pretend we're interested in genes predicted to encode for the enzyme epoxyqueuosine reductase. If we search at the [KO website](https://www.genome.jp/kegg/ko.html) for this, it tells us that there are [2 KO_IDs](https://www.genome.jp/dbget-bin/www_bfind_sub?mode=bfind&max_hit=1000&dbkey=orthology&keywords=epoxyqueuosine+reductase) associated with it: K09765 and K18979. **`grep`** is a super-quick way to see if they are in our annotations file: ```bash grep K09765 gene_annotations.tsv grep K18979 gene_annotations.tsv ``` It seems the first one wasn't found in our genomes, but the second one is in there twice!PRACTICE! From our tab-delimited file, "gene_annotations.tsv", try to make a new file that has just 2 columns: the gene_ID and KO_annotation columns (remember the > redirector). Name the new file "IDs_and_annotations.tsv".SolutionAnd to make sure it holds all 101 lines and not just the first 10!cut -f 1,4 gene_annotations.tsv | head cut -f 1,4 gene_annotations.tsv > IDs_and_annotations.tsv head IDs_and_annotations.tsvwc -l IDs_and_annotations.tsv
We're just scratching the surface of what **`grep`** can do, but one thing worth mentioning is the **`-c`** flag. This tells **`grep`** to just report how many lines matched, instead of printing them to the screen: ```bash grep -c K18979 gene_annotations.tsv ``` ## paste Like **`cut`**, **`paste`** also works with columns. It pastes things together horizontally with a delimiter in between them (a tab by default). We have another file in our working directory that holds some color names in Spanish: ```bash head colores.txt ``` For a quick example of how **`paste`** works, let's paste this file to our "colors.txt" file: ```bash paste colors.txt colores.txt ``` For a more practical example, let's look at another file in our directory that holds the amino acid lengths and sequences of our genes: ```bash head genes_and_seqs.tsv ``` If the lines are longer than the terminal window, then they will wrap like this and look kind of messy. We can take a look without linewraps with the **`less`** program by adding the **`-S`** option: ```bash less -S genes_and_seqs.tsv ``` In this view things run off the screen, but each line is one row. Note that the terminal doesnβt automatically line up columns for us. q will exit less. Let's say we want to add these protein lengths and sequences to our "gene_annotations.tsv" file. We can **`paste`** the two files together, but then we'll have two columns for gene_ID (columns 1 and 5): ```bash paste gene_annotations.tsv genes_and_seqs.tsv | head -n 1 ``` >**Note:** If a "paste: write error: Broken pipe" message pops up here, it can be ignored. It is just happening because the **`head`** command is finishing before the **`paste`** command, and then **`paste`** is telling us it had nowhere to send the output anymore. But since all we care about is the first line here, it does not affect what we're doing. (Not all systems do things this way, but the one we're working on does.) If we wanted to take everything except the fifth column (the second "gene_ID" column), we could do it like this: ```bash paste gene_annotations.tsv genes_and_seqs.tsv | cut -f 1-4,6- | head -n 2 ``` Notice that by putting the dash after the 6, and nothing else, we are specifying that column and all that follow. >**NOTE:** **`paste`** is a super-useful command. But it does **not** check to make sure what we are doing makes sense. If these files were out of order from each other, **`paste`** would still be just as happy to stick them together and then our merged file would hold mismatched information. So it's important to make sure things we are pasting together are in the appropriate order. It's a little too far off the path for now, but just to note them, useful commands to look into for doing this would be **`sort`** and **`comm`** π ## sed **`sed`** (for **s**tream **ed**itor) is our "search and replace" command, just like in something like Excel or Word, but much more powerful. Like many of the commands here, **`sed`** is useful in just general usage, but you can also learn to do a lot more with it if you need/want to at some point. For now, let's look at the general usage. Let's imagine a totally-not-real, never-happened scenario where co-authors waited until our paper was accepted (and we've even approved the proofs already) to then tell us they want to change the name of one of the new genomes in it π€¦ So now we need to change all instances of "UW179A" to "UW277". This genome happens to be at the end of our file, so we can check it with **`tail`** if we'd like: ```bash tail gene_annotations.tsv ``` The syntax of **`sed`** is a little strange at first, so let's run it and then break it down (don't forget, feel free to copy and paste things): ```bash sed 's/UW179A/UW277/' gene_annotations.tsv | tail ``` >Here, the **`sed`** command is followed by an expression within single quotes. This expression holds 4 items separated by the 3 forward slashes in there: the 1st is the letter "s", which is for "substitute"; the 2nd is what we'd like to find and replace, "UW179A"; the 3rd is what we'd like to replace it with, "UW277"; and the 4th is actually empty in this case (the next example will use that slot). Now that we've previewed this, we can remove the **`tail`** and write the new version to a file with a redirector: ```bash sed 's/UW179A/UW277/' gene_annotations.tsv > modified_gene_annotations.tsv tail modified_gene_annotations.tsv ``` And note that this did not alter the original file: ```bash tail gene_annotations.tsv ``` **One important thing to know about `sed` is that by default it will only change the first occurrence of something in a line.** For example, let's say we need to change all occurrences of "NA" to "\PRACTICE! Using a combination of grep and cut, try to print out just the genomes (column 2) that have the "K18979" annotation.Solutiongrep K18979 gene_annotations.tsv | cut -f 2
query qlen subject slen pident al_length Te_4133 1470 3R_1087 8642 100.0 200Query sequence "Te_4133" hit a reference sequence with 100% identity, **but** the alignment length is only 200 while the input sequence (the query) length is 1470. Depending on what we are doing, this might not be what we want. It is common to filter out hits like this by requiring some minimum fraction of the query sequence to have successfully aligned. Here is how we can tell **`awk`** to only keep the hits that are greater than 95% identical AND if more than 90% of the query sequence aligned: ```bash awk ' $5 > 95 && $6 > $2 * 0.9 ' blast_output.tsv ``` **Again, `awk` can seem pretty tricky, especially at first, but fortunately we don't need to remember *how* to do these things, just that they can be done. And then we can look it up when we need it π** ## tr The last one we're going to look at is **`tr`** (for **tr**anslate). **`tr`** changes one character into another character. It seems to become more useful with time, but it's worth knowing early if for no other reason than it deals with special characters really well β the type of special characters that many Excel versions put in exported tables that can ruin working with them at the command line π€¬ For example, when exporting a table as tab-delimited or as a csv file from many versions of Excel, there will be odd newline characters (newline characters tell the computer to end one line and start a new one). The typical newline character is represented like this **`\n`**, but Excel likes to put in **`\r`** characters. We can see this messing with things on the Excel-exported file if we open it with **`less`**: ```bash less gene_annotations_excel_exported.tsv ``` Where everywhere there should be a line break, there is an odd ^M thing going on (hit q to exit **`less`**). We can also see it if we try to count the number of lines: ```bash wc -l gene_annotations_excel_exported.tsv ``` **`wc -l`** actually just counts the newline characters in a file (normally **`\n`**), so here it finds none. But this is where **`tr`** comes to the rescue. These characters can be swapped so that working between Excel and the command-line is no longer a problem and we can enjoy both worlds π The **`tr`** command does not accept the file you want to work on as a positional argument like many of the other commands we've seen. Instead we need to use a new *redirector*, **`<`**. While **`>`** as we've seen handles the output, **`<`** handles the input. It gives the file following it to the program in front of it: ```bash tr "\r" "\n" < gene_annotations_excel_exported.tsv > gene_annotations_fixed.tsv ``` Here we are specifying the **`tr`** command; the first positional argument is what we want to replace, **`"\r"`**; the second positional argument is what we want to replace it with, **`"\n"`**; then our input file follows the **`<`** redirector; and the output file we want to make follows the **`>`** redirector. Now we can see the new file we made is ready for the command line (q exits **`less`**): ```bash less gene_annotations_fixed.tsv ``` ## Summary As mentioned, this page is just a first introduction to some great commands that are worth having in our toolkit. Each of them has much more functionality that we can dig into further as needed π Next we're going to look at [variables and for loops!](shell-for-loops-05.md)