# Automating workflows using bash ## Objectives + Learn what a bash/shell script is and how to write one + Incorporate for loops within a shell script --- We have now worked through two workflows, RNA-seq analysis and variant calling. In both cases, we pasted all of our commands into the terminal one command at a time, and used for loops to run each step of the workflow over our six samples. This is necessary at first, and a great way to understand how each tool works and establish a workflow we want to use. Let's imagine now that we decided to download all 672 samples from this dataset instead of only working with our six samples. Since we have already figured out how each tool in our workflow works, and we know how we want to run them, one way we can automate (and document) our workflow is with a bash script (remember that *bash* is just the most common programming language used in our terminal environment). Here we are going to automate the quality assessment and trimming we did [here](quality-and-trimming.md) and the read quantification we performed [here](salmon-quant.md) in one script. But first, let's see what a bash script is! ## Accessing our JetStream instances You should still have your jetstream instance running, you can follow the instructions [here](jetstream/boot.html) to log in to JetStream and find your instance. Then `ssh` into it following the instructions [here](jetstream/boot.html#ssh-secure-login). Next, we're going to access RStudio again on our instances. RStudio will be convenient for being able to edit a text file alongside an active terminal window (command-line window). We'll see this in action in a second. As earlier, run the following link to produce a url that navigates to RStudio when entered in your browser. ```bash echo http://$(hostname -i):8787/ ``` ## Setting up our environment Previously, we've been writting R code in the RStudio text-editor at the top left, but we can also write any kind of files there. So that will be where we build up our bash script. To start with a clean new document, select "**File**" from the top menu bar inside RStudio, then "**New File**", then "**Text File**". At the bottom left window is our R "console" right now. But if we select "**Tools**" from the top menu bar within RStudio, then "**Terminal**", and then "**New Terminal**", we will open a command-line terminal in that bottom left window that looks something like this:
Lastly, let's add on read quantification with Salmon like we did [here](salmon-quant.md#index-the-yeast-transcriptome). We only need to run the indexing command once, and then we can put a for loop to run the quantification command on our quality trimmed fastq files just like the one used [here](salmon-quant.md#run-salmon-on-all-the-samples). And then let's add a message at the end for our future selves: ```bash echo "Running FastQC..." fastqc *.fastq.gz echo "Moving FastQC results..." mkdir fastqc_untrimmed/ mv *.zip fastqc_untrimmed/ mv *.html fastqc_untrimmed/ echo "Running Trimmomatic..." for filename in *.fastq.gz do # first, getting the unique part of the file name alone by removing ".fastq.gz" base=$(basename $filename .fastq.gz) echo "On sample: $base" trimmomatic SE ${base}.fastq.gz \ ${base}.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:15 TRAILING:15 \ SLIDINGWINDOW:10:20 \ MINLEN:25 done echo "Running FastQC on trimmed files..." fastqc *.qc.fq.gz echo "Moving trimmed FastQC results..." mkdir fastqc_trimmed/ mv *.zip fastqc_trimmed/ mv *.html fastqc_trimmed/ echo "Indexing ref. transcriptome for Salmon..." salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz echo "Quantifying with Salmon..." for filename in *.qc.fq.gz do # getting the unique part of the file name alone by removing ".qc.fq.gz" base=$(basename $filename .qc.fq.gz) salmon quant -i sc_index --libType A \ -r ${filename} -o ${base}_quant --seqBias \ --gcBias --validateMappings done echo "Job Done!! (That's a nice-lookin' script!)" ``` Stellar! Don't forget to save it, and now let's run it! ```bash bash workflow.sh ``` Bash scripting can be very powerful!!PRACTICE! Add some commands at the end of our script that will:
1) run FastQC on the output files from Trimmomatic (those that end with ".qc.fq.gz")
2) create a new directory called "fastqc_trimmed"
3) move the output files from FastQC into "fastqc_trimmed".Solution
Our entire file should now look something like this:echo "Running FastQC..." fastqc *.fastq.gz
echo "Moving FastQC results..." mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/ mv *.html fastqc_untrimmed/
echo "Running Trimmomatic..." for filename in *.fastq.gz do
# first, getting the unique part of the file name by removing ".fastq.gz" base=$(basename $filename .fastq.gz) echo On sample: $base
trimmomatic SE ${base}.fastq.gz \ ${base}.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:15 TRAILING:15 \ SLIDINGWINDOW:10:20 \ MINLEN:25
done
echo "Running FastQC on trimmed files..." fastqc *.qc.fq.gz
echo "Moving trimmed FastQC results..." mkdir fastqc_trimmed/
mv *.zip fastqc_trimmed/ mv *.html fastqc_trimmed/