Table of Contents

Next-Gen Sequence Analysis Workshop (2019)

These are the schedule and classroom materials for the ANGUS workshop at UC Davis, which will run from July 1st to July 13th, 2019. ANGUS is part of the 2019 Data Intensive Biology Summer Institute.

This workshop runs under a Code of Conduct. Please respect it and be excellent to each other!

We hope that we can respect each other by following these guidelines & expectations.

Please submit your feedback here on any aspect of the workshop. We encourage feedback from everyone involved in the workshop in any capacity

Twitter hash tag: #dibsi2019

3 Rooms & lead instructors!

Room Mexía will be led by Rocio Martinez-Nunez and Taylor Reiter, in 1043 Valley.

Room Lovelace will enjoy the leadership of Chissa Rivaldi and Mike Lee, in 1041 Valley.

Room Conway will be led by Sateesh Peri & Marian Schmidt, in 2020 Valley.

Schedule, in brief:

Monday, July 1st, 2019:

Tuesday, July 2nd, 2019:

Wednesday, July 3rd, 2019:

  • 9:00-9:15am: Meet in Lecture Room (1030) for quick meeting

  • 9:20-10:20am: Wrapping up Introduction to the Shell

    • 10:20-10:40am: BREAK

  • 10:40am-12:00pm: Wrapping up Introduction to the Shell

    • 12:00-1:00pm: LUNCH

  • 1:00-2:00pm: Amanda Charbonneau: How to statistics Slides - view recording

    • 2:00-2:10pm: BREAK

  • 2:15-3:30pm: Logging onto the Cloud

    • 3:30-3:50pm: BREAK

  • 3:50-5:00pm: Conda

    • 5:00pm: DONE

Thursday, July 4th, 2019

Friday, July 5th, 2019

Second week: July 8-13, 2019.

Monday, July 8th, 2019

Tuesday, July 9th, 2019

Wednesday, July 10th, 2019

Thursday, July 11th, 2019

Friday, July 12th, 2019

Saturday, July 13th, 2019

(check out of your rooms and bring luggage over. Please be sure to bring your computer!)

  • 9:00am-9:30am: Cloud Resources with Sateesh Peri

  • 9:30am-10:00am: 5 things I wish I knew by Mike Lee

  • 10:00am-10:20am: What to Slack after DIBSI? by Shannon Joslin

    • 10:20am-10:40am: BREAK

  • 10:40am-11:00am: Survey in Room 1030

  • 11:00am-11:30am: Classroom feedback–1up/1down (in classrooms) & TA / Instructor meeting (in Rm1030)

  • 11:30am-12:00pm: Titus Wrap up & Outro

    • 12:00pm: THANK YOU & HAPPY TRAVELS!

Some contact info 😊

Workshop Code of Conduct

Introduction

All DIBSI attendees are expected to agree with the following code of conduct. We will enforce this code as needed. We expect cooperation from all attendees to help ensuring a safe environment for everybody.

Please see the Event Guidelines and Expectations for more information.

Enforcement may result in one or more behavioral modifications intended to ensure that this workshop continues to be a safe and welcoming teaching and learning environment.

The Quick Version

DIBSI events are neither a dating scene nor an intellectual contest.

DIBSI is dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, or religion (or lack thereof). We do not tolerate harassment of participants in any form. Sexual language and imagery is generally not appropriate for any DIBSI venue.

The Less Quick Version

Harassment includes offensive verbal comments related to gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, sustained disruption of talks or other events, inappropriate physical contact, and unwelcome sexual attention.

Attendees asked to stop any behavior are expected to comply immediately.

If you are being harassed, notice that someone else is being harassed, or if you are uncertain whether someone’s behavior might constitute harassment or have any other concerns, please contact any of the helpers, lead instructors, or organizers immediately. For more information, see the Attendee Procedure for Reporting Code of Conduct Incidents.

We expect participants to follow these guidelines at any DIBSI-organized event.

Original source and credit: http://2012.jsconf.us/#/about & The Ada Initiative. Please help by translating or improving: http://github.com/leftlogic/confcodeofconduct.com. This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Guidelines & Expectations for ANGUS 2019 Events

Introduction

Everyone who participates in our ANGUS 2019 workshop shares the responsibility of creating a welcoming environment for learning and networking. As such, we would like to outline Workshop Guidelines for everyone and expectations for people based on their participation role below.

Note: All DIBSI participants must follow the event Code of Conduct. Our first priority is to provide and sustain a safe and welcoming environment for learning and teaching!

Workshop Guidelines

  1. Our primary commitment is to learn from each other. We will listen to each other and not talk at each other. We acknowledge differences amongst us in backgrounds, skills, interests, and values. We realize that it is these very differences that will increase our awareness and understanding through this process.

  2. Share responsibility for including everyone in the discussion. Be mindful of taking up much more space than others. Empower yourself to speak up when others are dominating the conversation.

  3. Confidentiality. We want to create an atmosphere for open, honest exchange. Unless it is an actionable item or specifically indicated on notecards, we will not share beyond this room without permission.

  4. Do not demean, devalue, or “put down” people for their experiences, lack of experiences, difference in interpretation of those experiences, or choice of computer or software.

  5. Expect a range of emotions to be present, though not always visible. The workshop has long days and can be very intense for those around us.

  6. Trust that people are always doing the best they can. We will give each other the benefit of the doubt. We will assume we are all trying our hardest and that our intentions are good even when the impact is not. Consider being wrong and accept responsibility for the consequences of our actions, regardless of their intentions.

  7. Challenge the idea and not the person and expect to have your perspectives challenged by others. If we wish to challenge something that has been said, we will challenge the idea or the practice referred to, not the individual sharing this idea or practice.

  8. Speak your discomfort. If something is bothering you, please share this with the group, your helper/instructor/coordinators, or submit a response through our online form. Often our emotional reactions to this process offer the most valuable learning opportunities.

  9. Expect and accept a lack of closure. We will cover a myriad of topics throughout this workshop and not all questions will be answered or clear.

  10. We are all a work in progress. We will not assume that one comment or one opinion made at one time captures the whole of a person’s character. We are all in the process of learning, including the facilitators.

Expectations of event learners

  • Attendees are an active part of the workshop and help set the event atmosphere!

  • Ask questions if they have them & be aware how their questions may impact others

    • Raise their hand

    • Ask a helper

    • Type it in hackMD

    • Write it on a sticky at the end of each lesson

    • Submit your question to the feedback form

  • Have an open mind and treat others as if they have something to offer.

  • Be ready and willing to learn from instructors, helpers, and each other.

  • If you have the expertise, we invite others to contribute to the conversation, especially if directly related to their own experiences. This can be done through the:

    • Stickies

    • Group notes on the hackMD

    • Telling an instructor/helper

    • The feedback form

  • Actively participate in events they attend

    • Use stickies

    • Contribute to group notes

    • If you have the capacity, help the people around you

  • Respect others at the event

  • When receiving help from others during a lesson:

    • Use a whisper voice to lessen the impact on others

    • Be open to “driving” the computer

    • If more extensive help is necessary, be open to the helper/instructor “driving” your computer

    • Communicate personal space boundaries when necessary

  • Avoid disruptive behavior including:

    • Speaking to those around them during the lesson

    • Save specific questions regarding your project for outside of formal instruction or BYOD in week 2

  • Be an active and responsive bystander by either:

    • Speaking up in the moment of an event.

    • Submitting a climate response form.

    • Talking to instructors and course coordinators regarding events that you may witness.

Expectations of instructors and helpers

  • Have an open mind and treat others as they have something to offer.

  • Help event attendees learn and grow in a safe/brave and welcoming environment.

    • Welcome and encourage questions

    • When helping someone at their computer:

      • Be kind when approaching a learner at their computer and give them the benefit of a doubt that they are doing the best they can.

      • Be aware of the physical space between those that surround them.

      • Praise errors as opportunities for learners

      • Prioritize the opportunity for the learner to “drive” the computer.

      • If it makes more sense for the instructor/helper to “drive” the learners computer, always ask for permission to touch someone else’s computer.

    • Be an active and responsive by-stander by either:

      • Speaking up in the moment.

      • Submitting a climate response form.

      • Talking to instructors and course coordinators regarding events that you may witness.

  • Teach attendees

    • Ask leading questions and help learners get to answers on their own

    • When actively teaching, use a microphone.

    • Instructors will repeat questions from learners

  • Provide individualized or group help as needed

  • Avoid disruptive behavior including:

    • Intellectually agressive behavoir such as challenging other people (including instructors/helpers/attendees) in their content

    • Be mindful of the language they use:

      • Avoid minimizing language including “just”, “only”, “simply”, “easy”, “obviously”,etc.

      • Instead consider motivating language such as “This gets easier with practice” & “I remember when it was hard for me too

  • Comfort & empathize with others who may be stressed

  • Some other potential roles:

    • Have breaks when they are scheduled

      • Give the instructor who is teaching a few minute warning before the lessons will be finished

    • Note takers/supporters

    • slack monitoring

    • Room coordinating

    • Actively watching for sticky notes - and promoting sticky awareness for other instructors & helpers

    • Writing challenge questions

    • Assisting with curriculum tweaks

Expectations of workshop organizers

  • Provide the infrastructure needed for a safe and welcoming environment

  • Coordinate the overall event

  • Addressing individual attendee, instructor, and helper needs

  • Lay out clear and unambiguous guidelines for participant conduct, and providing incident response guidelines and training.

  • Be an active and responsive by-stander by either:

    • Speaking up in the moment

    • Submitting a climate response form

    • Talking to the lead-instructors and other coordinators regarding events that you may witness

Expectations in the lecture hall

  • Always use the microphone .

  • Speakers to repeat questions from the audience

Intro to cloud computing

Learning Objectives:

  • Understand what cloud computing is

  • Get to know some popular cloud compute resources

  • Perform a command-line blast

Rationale

Bioinformatics worldwide involves researchers connecting daily to cloud computing services to perform their analyses. Biology - like lots of areas - is dealing with a deluge of data due to the rapid advancement of data collection methods. It is now common that data collected for an experiment doesn’t fit on a researcher’s laptop and that the resources needed for running an analysis can far exceed a desktop computer’s computing power. Cloud computing is a powerful avenue for dealing with this.

What is the Cloud?


The cloud is basically lots of servers (think big computers) stacked together in a giant, powerful infrastructure. You can borrow part of this infrastructure for your computing needs.

There are two main approaches to accessing computing time and power: 1) we can borrow computing time and resources from a commercial provider; or 2) you may have access to a computing infrastructure through your institution or somewhere else. Some countries have built national infrastructures where you can apply for computing time for your research projects. Most academic institutions or departments also have their own computing resources.

Let’s connect to the cloud

We all have an IP address to link into our cloud computers stored in this spreadsheet here.

That will open a screen where you need to enter a password. (The password will be told to you in the room or in slack.) Then a screen like this will open (minus the blue arrows):


Now click the files tab at the top-left (that the smaller blue arrow points to above) and then click the “Terminal” icon at the bottom, and we’ll be in our appropriate command-line environment:


Installing blast and creating a working directory:

cd ~/
conda install -y -c bioconda blast
mkdir opening_example
cd opening_example/

Note: takes 5-10 minutes

Command Line BLAST

Basic Local Alignment Search Tool

Given one or more query sequences (usually in FASTA format), BLAST looks for matching sequence regions between them and a subject set.


A sufficiently close match between subsequences (denoted by arrows in the figure above, though matches are usually longer than illustrated here) is called a high-scoring pair (HSP), while a query sequence is said to hit a target sequence if they share one or more HSPs

Running command-line BLAST

The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.

First! We need some data and an Rscript to later visualize blast results. Let’s grab the mouse and zebrafish RefSeq protein data sets from NCBI, and put them in our home directory. We’ll use curl to download the files; these originally came from the NCBI Web site: ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot.

curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download
curl -o mouse.2.protein.faa.gz -L https://osf.io/j2qxk/download
curl -o zebrafish.1.protein.faa.gz -L https://osf.io/68mgf/download
curl -o blastviz.R -L https://osf.io/e548g/download

To look at the files in your current directory:

ls -l

All three of the sequence files are FASTA protein files (that’s what the .faa suggests) that are compressed with gzip (that’s what the .gz means).

Uncompress them:

gunzip *.faa.gz

And let’s look at the first few sequences in the file:

head mouse.1.protein.faa 

These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.

Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”

head -11 mouse.1.protein.faa > mm-first.fa

Now let’s BLAST these two sequences against the entire zebrafish protein data set. First we need to create a database. That’s done by calling ‘makeblastdb’:

makeblastdb -in zebrafish.1.protein.faa -dbtype prot

Next, we call BLAST to do the search:

blastp -query mm-first.fa -db zebrafish.1.protein.faa

This should run pretty quickly, but you’re going to get a lot of output!! To save it to a file instead of watching it go past on the screen, ask BLAST to save the output to a file that we’ll name mm-first.x.zebrafish.txt:

blastp -query mm-first.fa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt

and then you can ‘page’ through this file at your leisure by typing:

less mm-first.x.zebrafish.txt

(Type spacebar to move down, and ‘q’ to get out of paging mode.)


Let’s do some more sequences (this one will take a little longer to run):

head -498 mouse.1.protein.faa > mm-second.fa

blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt

will compare the first 96 sequences. You can look at the output file with:

less mm-second.x.zebrafish.txt

(and again, type ‘q’ to get out of paging mode.)

Notes:

  • you can copy/paste multiple commands at a time, and they will execute in order;

  • why did it take longer to BLAST mm-second.fa than mm-first.fa?


Last, but not least, let’s generate a more machine-readable version of that last file –

blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6

See this link for a description of the possible BLAST output table formats.

Now we’ll run an R script to visualize the blast results:

Rscript blastviz.R

A pdf will be generated with the results. We can view this by clicking on the Folder icon at the left ofour screen, and then double clicking on the file at the top to open the pdf:


Things to mention and discuss:

  • blastp options and -help.

  • command line options, more generally - why so many?

  • automation rocks!

An introduction to the shell


IMPORTANT NOTICE!!

Maybe the most imporant thing to keep in mind while going through these pages is that this is all about exposure, not memorization or mastering anything. Don't worry about the details 🙂

Some terminology

First, here are some terms that are often used interchangeably when referring to the text-based environment we were working in yesterday. It’s not important to remember the differences, but they’re here just so we see them at least once 🙂

Term What it is
shell what we use to talk to the computer; anything where you are pointing and clicking with a mouse is a Graphical User Interface (GUI) shell; something with text only is a Command Line Interface (CLI) shell
command line a text-based environment capable of taking input and providing output (a "terminal" is the same idea)
Unix a family of operating systems
bash the most common programming language used at a Unix command-line

Why learn the command line?

(modified from the carpentries)

  • Many bioinformatics tools can only be used through a command-line interface, and/or have extra capabilities in the command-line version that are not available in the GUI (e.g. BLAST).

  • It makes our work less error-prone. When humans do the same thing a hundred different times (or even ten times), we’re likely to make a mistake. Your computer can do the same thing a thousand times with no mistakes. This also frees us up to do other things we can’t automate, like the science part.

  • The shell makes our work more reproducible. When we carry out our work in a command-line interface (rather than a GUI), we can keep an exact record of all steps, which we can use to re-do our work when we need to. It also gives us a way to unambiguously communicate what we’ve done to others.

  • Many bioinformatic tasks require large amounts of computing power and can’t realistically be run on our own machines. These tasks are best performed using remote computers or cloud computing, which can only be accessed through a shell.

So let’s get to it!

  1. Getting started

  2. Working with files and directories

  3. Redirectors and wildcards

  4. Six glorious commands

  5. Variables and For loops


Accessing The Jetstream Cloud

Learning Objectives:

  • Launching Jetstream instances

  • SSH Secure Shell login to instances

  • Jetstream instance maintenance

What we’re going to do here is walk through starting up a running computer (an “instance”) on the Jetstream service.

Jetstream is an [NSF/XSEDE] resource designed to promote and provide configurable cyberinfrastructure in the form of cloud computing to both novice and experienced users.

If you would like to read more about cloud computing, see this Carpentry Cloud Computing lesson.

NOTE: WINDOWS users will need to install a UNIX-ready terminal to SSH-login (if you haven’t already).

Options:

  • Later versions of windows 10 support an ubuntu terminal. Read more here

  • Git-bash. Download here

Login & Launch Instance

  • Login to Jetstream by clicking the “login” button towards the right-upper corner of the screen.



  • Fill in your Jetstream username and password and click “LOGIN”


  • Select the “Projects” tab and then click the “CREATE NEW PROJECT” button


  • Give your Project folder a name, first and last name would be best while here (Description is optional). Then click “CREATE”.


  • Click on your newly created project and then Click on “NEW” and then “Instance” from the drop-down menu to start up a new virtual machine.


  • To select an image click on “Show All” tab and Search for “ANGUS 2019” and choose the “ANGUS 2019” image created by ‘titus’.


NOTE: Here “Image” refers to the template of a virtual machine containing an installed operating system, software, and configuration stored as a file on disk. Think of it like apps that come with your phone before you add new ones on your own.

You will be presented with options to choose and configure your virtual machine here:

  • Instance Name: e.x., “ANGUS 2019 base image” or you can leave it default which is the image name.

  • Base Image Version: choose the latest version (not what’s in the below image)

  • Project: select your project folder

  • Allocation Source: TG-MCB190088

  • Provider: “Jetstream Indiana University”

  • Instance size: We recommend “m1.medium” (CPU: 6, Mem: 16GB, Disk: 60GB) for this tutorial; though depending on your allocations, choose the most suitable one.

NOTE: Choose the latest “Base Image Version” (not what’s in this below image), it should be the default.


  • Launch instance and wait for the build to be deployed (~ 5-10 minutes).

Note: During the build process: scheduling-->building-->spawning-->deploying-->Networking-->N/A. Once the virtual machine is ready, the “Activity” column will show “N/A” and the “Status” column will turn green and “Active”.


  • Navigate back to ‘Projects’ and click on your new instance’s name to see more information related to the instance you just created! and Copy the IP address of your instance created.


Great! We have now built our very own remote virtual machine with all the software pre-installed. Next we will use SSH-Secure-Login to access these remote instances from our laptop !!!.

SSH Secure-Login


  • macOS & LINUX users can open a Terminal window now.

  • Windows 10 users with ubuntu can open a terminal

  • If your windows doesn’t come with ubuntu distribution, users start a new session in git bash

    • Start a new session; Fill in your “remote host” the IP address of your virtual machine; select “specify username” and enter your Jetstream username; Click OK.

  • Establish a secure-login to the instance by typing the following:

$ ssh your_Jetstreamusername@ip_address

  • This should log you into Jetstream and you should see a screen like this; Enter ‘yes’ and then enter your Jetstream password.

Your cursor will not move or indicate you are typing as you enter your password. If you make a mistake, hit enter and you will be prompted again.


Success !!! We have established connections with our instances. Proceed to the Tutorial section.

Instance Maintenance

To end your current session on an Instance and close SSH connection, type ‘exit’

Jetstream Dashboard


Instance Actions


Action Description
Report instance exhibiting unexpected behavior? Report here
Image Request an image (a type of template for a virtual machine) of a running instance
Suspend Suspending an instance frees up resources for other users and allows you to safely preserve the state of your instance without imaging. Your time allocation no longer counts against you in the suspended mode
Shelve Shelving will safely preserve the state of your instance for later use. And, it frees up resources for other users . In fact, it is the best way to reduce resource usage when compared with other actions, like "suspend" and "stop".Your time allocation no longer counts against you in the shelved mode
Stop Stop an instance
Reboot Reboot an instance
Redeploy Redeploying an instance will allow you to fix instances that show up as 'active - deploy_error'
Delete following instance will be shut down and all data will be permanently lost

Additional Features

You can access a shell terminal and a web-desktop via your browser !!!


Environment management with Conda

Learning objectives

  • Why should I use a package and environment management system as part of my research workflow?

  • What is Conda? Why use Conda?

  • Manage Conda environments

  • Conda channels & packages

Why should I use a package and environment management system?


Installing software is hard. Installing scientific software (including all required dependencies of said software!) is often even more challenging.

Installing software system-wide creates complex dependencies between your reearch projects that shouldn’t really exist!

Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?

What is an environment?

A environment is a directory that contains a specific collection of packages/tools that you have installed. For example, you may have one environment with Python 2.7 and its dependencies, and another environment with Python 3.4 for legacy testing. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.


What is Conda?

Conda is open source package and runs on Windows, Mac OS and Linux.

  • Conda can quickly install, run, and update packages and their dependencies.

  • Conda can create, save, load, and switch between project specific software environments on your local computer.


Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.

How does Conda work


Benefits of Conda


Installing & Activating Conda

We have already installed conda on these instances, but please see the stellar conda documentation after the class if you’d like to install it on your personal computer or elsewhere.

Adding a new environment

If you need to re-access your terminal environment, see here.

Check your current version of python by exectuting python --version

To create a new environment named, for instance mynewenv (you can name it what ever you like), that includes, let’s say, a Python version 3.4., run:

conda create --name mynewenv

Activating and leaving (deactivating) an environment

Inside a new Conda installation, the root environment is activated by default, so you can use it without activation.

In other cases, if you want to use an environment (for instance manage packages, or run Python scripts inside it) you need to first activate it.

conda activate mynewenv

The command prompt changes upon the environment’s activation, it now contains the active environment’s name.

The directories of the active environment’s executable files are added to the system path (this means that you can now access them more easily). You can leave an environment with this command:

conda deactivate

It needs to be mentioned that upon deactivating an environment, the base environment becomes active automatically.

What are Conda channels?

Channels are the locations of the repositories (directories) online containing Conda packages. Upon Conda’s installation, Continuum’s (Conda’s developer) channels are set by default, so without any further modification, these are the locations where your Conda will start searching for packages.

Channels in Conda are ordered. The channel with the highest priority is the first one that Conda checks, looking for the package you asked for. You can change this order, and also add channels to it (and set their priority as well).


If multiple channels contain a package, and one channel contains a newer version than the other one, the order of the channels’ determines which one of these two versions are going to be installed, even if the higher priority channel contains the older version.


Bioconda Channel

See the bioconda paper and the bioconda web site

Bioconda is a community-enabled repository of 3,000+ bioinformatics packages, installable via the conda package manager. Note: bioconda is not available for windows systems

Add channels

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Searching, installing and removing packages

To list out all the installed packages in the currently active environment, run:

conda list

Try running a program pre-installed on this instance:

fastqc

To search for all the available versions of a certain package, you can use the search command. For instance, to list out all the versions of the seaborn package (it is a tool for data visualization), run:

conda search samtools

Similarly to the conda list command, this one results in a list of the matching package names, versions, and channels:

Loading channels: done
# Name                       Version           Build  Channel             
samtools                      0.1.12               0  bioconda            
samtools                      0.1.12               1  bioconda            
samtools                      0.1.12               2  bioconda            
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""  
samtools                         1.9      h91753b0_3  bioconda            
samtools                         1.9      h91753b0_4  bioconda            
samtools                         1.9      h91753b0_5  bioconda            
samtools                         1.9      h91753b0_8  bioconda  

To install a package (for samtools) that is inside a channel that is on your channel list, run this command (if you don’t specify which version you want, it’ll automatically install the latest available version from the highest priority channel):

conda install samtools

You can also specify the package’s version:

conda install samtools=1.9

Freezing an environment

This will save the list of conda-installed software you have in a particular environment to the file packages.txt:

conda list --export > packages.txt

(it will not record the software versions for software not installed by conda.)

conda install --file=packages.txt

will install those packages in your local environment.

Conda Commands

Conda commands action
conda install install a package
conda list list installed packages
conda search To search a package
conda info list of information about the environment
conda list list out all the installed packages in the currently active environment
conda remove Remove a conda package
conda config --get channels list out the active channels and their priorities
conda update update all the installed packages
conda config --remove channels unwanted_channel remove unwanted_channel
conda env list list the different environments you have set up
conda activate myNewEnvironment activate the myNewEnvironment Conda environment (this also works for activating our base environment
conda info --envs list the locations of Conda directories

A note on the management Conda Environments

Conda environments are expceptionally useful! However, they can become quite large depending on how many packages we load into them. We can check how large any of our Conda enviroments are by finding the path to the environment directory and then estimating the file space usage of that directory.

First, let’s find where we put out mynewenv directory

conda info --envs

This will print out a list of the locations of our Conda environments.

# conda environments:
#
base                  *  /opt/miniconda
mynewenv                 /opt/miniconda/envs/mynewenv

Next, let’s use the command du to estimate the space our mynewenv directory is taking up!

du -sh /opt/miniconda/envs/mynewenv/

We can see our mynewenv environment is taking up about 12K of space.

12K	/opt/miniconda/envs/mynewenv/

QUESTION: How much space is your base environment taking up?

More Reading on Conda

Short read quality and trimming

Learning objectives:

  • Install software (fastqc, multiqc, trimmomatic) via conda

  • download data

  • visualize read quality

  • quality filter and trim reads

If you don’t have one running currently, start up a Jetstream m1.medium or larger instance (as detailed here), and then ssh login through your terminal (as shown here).


You should now be logged into your Jetstream computer! You should see something like this:

(base) dibada@js-171-9:~$ 

Getting started

Change to your home directory:

cd ~/

Let’s make sure our conda channels are loaded, just in case.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

and install FastQC, MultiQC, and trimmomatic:

conda install -y -c bioconda fastqc multiqc trimmomatic

Data source

Make a “data” directory:

cd ~/
mkdir data
cd data

and download some data from the Schurch et al, 2016 yeast RNAseq study. We will download three wildtype yeast datasets, and three mutant datasets (SNF2 knockout; SNF2 is a global transcription regulator).

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
curl -L https://osf.io/9c5jz/download -o err_md5sum.txt

We also downloaded a file holding some md5sum values for our sequence files (“err_md5sum.txt”). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don’t match what is specified in the “err_md5sum.txt” file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them.

If we look at what’s in that file:

head err_md5sum.txt

We can see a value for each of our sequence files:

2b8c708cce1fd88e7ddecd51e5ae2154  ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b  ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549  ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52  ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236  ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc  ERR458502.fastq.gz

Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this:

md5sum -c err_md5sum.txt

And if all is well, it will print out “OK” for each of the files:

ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK

Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories.

If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file:

md5sum ERR458493.fastq.gz > my_md5sum.txt

Or we could use the * wildcard to help specify all of our .fastq.gz files here (overwriting the previous file):

md5sum *.fastq.gz > my_md5sum.txt

Back to our data, if we type:

ls -l *.fastq.gz

We should see something like:

-rw-rw-r-- 1 dibada dibada  59532325 Jul  4 03:23 ERR458493.fastq.gz
-rw-rw-r-- 1 dibada dibada  58566854 Jul  4 03:24 ERR458494.fastq.gz
-rw-rw-r-- 1 dibada dibada  58114810 Jul  4 03:24 ERR458495.fastq.gz
-rw-rw-r-- 1 dibada dibada 102201086 Jul  4 03:24 ERR458500.fastq.gz
-rw-rw-r-- 1 dibada dibada 101222099 Jul  4 03:24 ERR458501.fastq.gz
-rw-rw-r-- 1 dibada dibada 100585843 Jul  4 03:24 ERR458502.fastq.gz

These are six data files from the yeast study.

One problem with these files is that they are writeable – which means we can edit or delete them. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let’s fix that before we go on any further:

chmod a-w *

Here we are using the chmod command (for change mode, which handles permissions of files/directories), and providing the argument a-w, which means for all users (a), take away (-) the permission to write (w).

If we take a look at their permissions now:

ls -l

We can see that the ‘w’ in the original permission string (-rw-rw-r--) has been removed from each file and now it should look like -r--r--r--.

We’ll talk about what these files are below.

1. Linking data to our working location

First, make a new working directory:

mkdir ~/quality
cd ~/quality

Now, we’re going to make a “link” to our quality-trimmed data in our current working directory:

ln -fs ~/data/* .

and you will see that they are now linked in the current directory when you do an ls. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!

These are FASTQ files – let’s take a look at them:

gunzip -c ERR458493.fastq.gz | head

FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score of 0 is the lowest quality score, and a score of 40 is the highest. Instead of typing the scores as numbers, they are each encoded as a character.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
                  |         |         |         |         |
Quality score:    0........10........20........30........40  

A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000.

Links:

2. FastQC

We’re going to use FastQC summarize the data. We already installed ‘fastqc’ above, with the conda command.

Now, run FastQC on two files:

fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz

Now let’s use ls:

ls *fastqc.zip

to list the files, and you should see:

ERR458493_fastqc.zip
ERR458500_fastqc.zip

Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files:

ls *.html

Let’s transfer the HTML files to our local computers so we can open and view them in our browsers. We’ll use the command scp to do this.

scp stands for secure copy. We can use this command to transfer files between two computers. We need to run this command on our local computers (i.e. from your laptop).

The scp command looks like this:

scp <file I want to move> <where I want to move it>

First we will make a new directory on our computer to store the HTML files we’re transfering. Let’s put it on our desktop for now. Make sure the terminal program you used this morning to ssh onto your instance is open. If you’re in it now, you can open a new tab you can use the pull down menu at the top of your screen or the cmd + t or maybe ctrl + t keyboard shortcut). Type:

mkdir ~/Desktop/fastqc_html 

Now we can transfer our HTML files to our local computer using scp, which will look something like this:

scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/

The first part of the command is the address for your remote computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the ssh command).

The second part starts with a : and then gives the absolute path of the files you want to transfer from your remote computer. Don’t forget the :. We use a wildcard (*.html) to indicate that we want all of the HTML files.

The third part of the command gives the absolute path of the location you want to put the files. This is on your local computer and is the directory we just created ~/Desktop/fastqc_html.

You will be prompted for your password. Enter the password we set for your instance during the morning session.

You should see a status output like this:

ERR458493_fastqc.html                      100%  642KB 370.1KB/s   00:01    
ERR458500_fastqc.html                      100%  641KB 592.2KB/s   00:01  

Now we can go to our new directory and open the HTML files.

You can also download and look at these copies of them:

These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, failing duplication levels aren’t necessarily a problem – we expect to have duplicated sequences given that some transcripts are very abundant.

Questions:

  • What should you pay attention to in the FastQC report?

  • Is one sample better than another?

Links:

There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at their docs site.

3. Trimmomatic

Now we’re going to do some trimming! Let’s switch back to our instance terminal, as we will be running these commands on our remote computers. We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via conda.

The first thing we’ll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory.

cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .

(you can look at the contents of this file with cat TruSeq2-SE.fa)

Now, let’s run Trimmomatic on a couple samples:

trimmomatic SE ERR458493.fastq.gz \
        ERR458493.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        
trimmomatic SE ERR458500.fastq.gz \
        ERR458500.qc.fq.gz \
        ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
        LEADING:2 TRAILING:2 \
        SLIDINGWINDOW:4:2 \
        MINLEN:25
        

CODE BREAKDOWN

Note that trimmomatic is actually a java program that we call from the command line, so it’s syntax is a little different than what we’ve been using. The arguments are followed by a colon, and then what you are specifying to them.

  • SE - this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse reads

  • ERR458493.fastq.gz - the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic)

  • ERR458493.qc.fq.gz - the second positional argument specifies the name of the output files the program will generate

  • ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 - here we are specifying the argument we’re addressing “ILLUMINACLIP”, first specifying the file holding the adapter sequences, then 3 numbers: “2” which states how many mismatches between the adapter sequence and what’s found are allowed; “0” which is only relevant when there are forward and reverse reads; and “15” which specifies how accurate the match must be

  • LEADING:2 TRAILING:2 - both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed off

  • SLIDINGWINDOW:4:2 - this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed there

  • MINLEN:25 - after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded

You should see output that looks like this:

...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully

We can also run the same process for all 6 samples more efficiently using a for loop, as follows. We’re going to use a new command called basename in the for loop, which we introduce below.

Using basename in for loops

basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq.gz extension from the files that we’ve been working with.

basename ERR458493.fastq.gz .fastq.gz

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

ERR458493

If we try the same thing but use .fasta as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

basename ERR458493.fastq.gz .fasta
ERR458493.fastq.gz

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}, and finally state that .fastq.gz should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

for filename in *.fastq.gz
do
  name=$(basename ${filename} .fastq.gz)
  echo ${name}
done

A note on variables: We can either refer a variable as $filename or ${filename}. They mean the same thing, but using {} explicitly tells bash when we are done refering to the variable. This makes it safer to refer to variables, especially when we’re combining it with other text. This is different than (), which tells bash that we want to execute something.

Trimming files using basename and for loops

Now we will use basename inside of our for loop to run trimmomatic on all 6 of our files.

for filename in *.fastq.gz
do
        # first, make the base by removing fastq.gz
        base=$(basename $filename .fastq.gz)
        echo $base

        trimmomatic SE ${base}.fastq.gz \
                ${base}.qc.fq.gz \
                ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
                LEADING:2 TRAILING:2 \
                SLIDINGWINDOW:4:2 \
                MINLEN:25
done

The for loop above will go through each for the filenames that end with fastq.gz and run Trimmomatic for it.

Questions:

  • How do you figure out what the parameters mean?

  • How do you figure out what parameters to use?

  • What adapters do you use?

  • What version of Trimmomatic are we using here? (And FastQC?)

  • Do you think parameters are different for RNAseq and genomic data sets?

  • What’s with these annoyingly long and complicated filenames?

For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.

Links:

4. FastQC again

Let’s take a look at the output files:

gunzip -c ERR458493.qc.fq.gz | head                                                        

It’s hard to get a feel for how different these files are by just looking at them, so let’s run FastQC again on the trimmed files:

fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz

And now view my copies of these files:

Alternatively, if there’s time, you can use scp to view the files from your local computer.

5. MultiQc

MultiQC is a great tool that aggregates multiple fastqc results into a single report for easy comparison.

Run Mulitqc on both the untrimmed and trimmed files

multiqc .

And now you should see output that looks like this:

[INFO   ]         multiqc : This is MultiQC v1.0
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '.'
Searching 15 files..  [####################################]  100%
[INFO   ]          fastqc : Found 4 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : multiqc_report.html
[INFO   ]         multiqc : Data        : multiqc_data
[INFO   ]         multiqc : MultiQC complete

You can view the output html file multiqc_report.html by going to RStudio, selecting the file, and saying “view in Web browser.”

Questions:

  • is the quality trimmed data “better” than before?

  • Does it matter that you still have adapters!?

RNA-seq read quantification with salmon

Learning objectives:

  • Install read quantification program Salmon

  • Learn quantification of RNA-seq data

Boot up a Jetstream

Boot an m1.medium Jetstream instance and log in.

You should have the quality trimmed data on your instance in the ~/quality/ folder. If not, you can get this data by running:

cd ~
mkdir -p quality
cd quality
curl -L https://osf.io/wfz34/download -o ERR458493.qc.fq.gz
curl -L https://osf.io/jxh4d/download -o ERR458494.qc.fq.gz
curl -L https://osf.io/zx7n3/download -o ERR458495.qc.fq.gz
curl -L https://osf.io/96mrj/download -o ERR458500.qc.fq.gz
curl -L https://osf.io/wc8yn/download -o ERR458501.qc.fq.gz
curl -L https://osf.io/sdtz3/download -o ERR458502.qc.fq.gz

Introduction to Salmon (adapted from salmon documentation)

Salmon is a tool for fast transcript quantification from RNA-seq data. It requires a set of target transcripts (either from a reference or de-novo assembly) to quantify and FASTA/FASTQ file(s) containing your reads.

Salmon runs in two phases, indexing and quantification. The indexing step is independent of the reads, and only needs to be run one for a particular set of reference transcripts. The quantification step is specific to the set of RNA-seq reads and is thus run more frequently.

Install software

Salmon is installed through conda.

Let’s make sure our conda channels are loaded.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -y -c bioconda salmon

Download the yeast transcriptome:

curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_rna_from_genomic.fna.gz

Index the yeast transcriptome:

salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz

Run salmon on all the samples:

for i in *.qc.fq.gz
do
   salmon quant -i sc_index --libType A -r ${i} -o ${i}_quant --seqBias --gcBias --validateMappings
done

What do all of these flags do?

Flag Meaning
-i path to index folder
--libType The library type of the reads you are quantifying. A allows salmon to automatically detect the library type of the reads you are quantifying.
-r Input file (for single-end reads)
-o output folder
--seqBias learn and correct for sequence-specific biases in the input data
--gcBias learn and correct for fragment-level GC biases in the input data
--validateMappings Enables selective alignment, which improves salmon's sensitivity

As Salmon is running, a lot of information is printed to the screen. For example, we can see the mapping rate as the sample finishes:

[2019-06-29 18:39:18.367] [jointLog] [info] Mapping rate = 86.2687%

When it finished, Salmon outputs a folder for each input RNA-seq sample. Let’s take a look at one of the output folders.

ls ERR458493.qc.fq.gz_quant

You should see output like this:

aux_info/               lib_format_counts.json
cmd_info.json           logs/
libParams/              quant.sf

The information we saw scroll through our screens is captured in a log file in aux_info/.

less ERR458493.qc.fq.gz_quant/aux_info/meta_info.json

We see information about our run parameters and performance. To exit out of the file, press q.

We can also look at the count file:

less -S ERR458493.qc.fq.gz_quant/quant.sf

We see our transcript names, as well as the number of reads that aligned to each transcript.

PRACTICE!

How can we use the grep command to find the percent of reads mapped in ALL our json files? Make sure to print out the name of each file before you print the percent of reads mapped!

Solution
cd ~/quant/
for infile in *_quant/aux_info/meta_info.json
do
echo ${infile}
grep "percent_mapped" ${infile}
done

First we use echo to print the name of the file. Then, we use grep to find and print the line containing the percent of reads mapped.

In our next lesson, we will be reading these quant files into R and performing differential expression with them. “Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference” Patro et al., 2016.

Also see seqanswers and biostars.

R and RStudio introduction


Using RStudio on Jetstream

We will be using R and RStudio on our cloud instances. We have pre-installed RStudio on the workshop image. Start up a Jetstream m1.medium or larger using the ANGUS 2019 image as per Jetstream startup instructions. Connect to your instance.

We will need our instance username to connect to RStudio. To determine what your username is, run:

echo My username is $USER

The link for the RStudio server interface is built with our instance’s IP address (by default in port 8787), which we can get from the terminal by running the following command:

echo http://$(hostname -i):8787/

hostname is a command in the terminal that will get various information about our computers. Using it with the -i option returns our computer’s IP address. So surrounding it within parentheses preceded by the $ means the command line will execute that command first, put the output of that command in it’s place (our IP address), and then run the echo command which then prints out the proper link to our RStudio server 😊

Now go to the web address printed to your terminal in your Web browser, which should look something like this:


and log in with the username and password we will provide.

Installing on your own computer:

We will be using R and RStudio in the cloud. To use RStudio on your computer, you can download R and RStudio. If you do not have R and RStudio installed:

Introduction to R and RStudio

R is a programming language and free software environment for statistical computing and graphics. Many bioinformatics programs are implemented in R. RStudio is a graphical integrated development environment (IDE) that displays and organizes more information about R. R is the underlying programming language. Once you start RStudio, you see this layout (it will be similar either on your instance or on your computer):


Scripts are like a digital lab book. They record everything your run, as well as notes to yourself about how and why you ran it. To start a new script you can click on the icon that looks like a page with a ‘+’ underneath File and select –>R script. This generates a .R file. .R files are plain text files that use the file extension .R so that humans can remember that the file contains R code.


Although scripts are a record of our code, they also include comments about our code. These comments are very important as they remind your future self (and whoever you pass your script) why you did what you did. Think of comments like the notes you take in your lab book when you do experiments.


When you execute a line of your script, it is sent to the console. While the script is a permanent record of everything you ran, the console is the engine that runs those things. You can run code from a script or the console, but it is only recorded in scripts.

There are two additional panes in RStudio:

  1. Environment/History/Connections pane

    • Environment Pane: This pane records the objects that are currently loaded in our RStudio environment. This includes any data object that we created and the basic information about them.

    • History Pane: This has output similar to the history command that we used in the shell. It records all of the commands that we have executed in the R console.

  2. Files/Plots/Packages/Help/Viewer pane provides a space for us to interact with plots and files we create and with built-in help files in R.

    • Packages Pane: A place to organize the packages within RStudio. Allows easy installation or loading (library()) of the packages we have.

And now we’re ready to start the fun part!

Basic concepts

Objects

An R object is something that you create and assign a value or function to. You do this using the assignment operator which is <- (NOTE: shortcut Alt + -). For example you can assign a numeric value to an object by typing in your script (remember to comment what you are doing):

a <- 34  # assign the number 34 to the object a

You can also assign a character value to an object:

b <- 'mouse' # assign the word 'character' to the object b

To run a line of code, put your cursor on the line that you want to run and press Enter+Cmd or Enter+Ctrl. Or select the line and click the button that says Run on the right top corner of your script.


Now check your Console and Environment tabs, you will see that the commands have run and that you have created new objects!

You can do all sorts of stuff with your objects: operations, combinations, use them in functions. Below we show how to perform multiplication using an object. Because a stores the value 34, we can multiply it by 2:

a2 <- a*2 # multiply a by 2

Functions and Arguments

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually takes one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function. An example of a function call is:

sqrt(a) ## calculate the square root of a

Here, the value of a is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to the object b. This function is very simple, because it takes just one argument.

The return ‘value’ of a function need not be numerical (like that of sqrt()), and it also does not need to be a single item: it can be a set of things, or even a dataset. We’ll see that when we read data files into R.

Similar to the shell, arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.

Let’s try a function that can take multiple arguments: round().

round(3.141592) ## round number

Arguments are the ‘conditions’ of a function. For example, the function round() has an argument to determine the number of decimals, and its default in R is 0. Here, we’ve called round() with just one argument, 3.14159, and it has returned the value 3. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) to find what arguments it takes, or look at the help for this function using ?round.

args(round) ## show the arguments of the function 'round'
?round ## shows the information about 'round'

We can change this to two decimals by changing the digits argument:

round(3.141592, digits = 2)

Note that we do not need to specify the name of the argument as R detects this by the position within the function:

round(3.141592, 2)

Though, it is preferablt to show the argument names. Plus, we have the added benefit of having the ability to change the order of the arguments if the arguments are named:

round(digits = 2, x= 3.141592)

It’s best practice to put the non-optional arguments (the data that you want the function to work on) first, and then specify the names of all optional arguments. This helps other to interpret your code more easily.

Vectors

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.

yeast_strain <- c('WT','snf2') 
concentration <- c(43.903, 47.871)

There are many functions to inspect a vector and get information about its contents:

length(yeast_strain) # number of elements in a vector
class(concentration) # type of object b
str(yeast_strain) # inspect the structure of a

Data types

An atomic vector is the simplest R data type, and it is a vector of a single type. The 6 main atomic vector types that R uses are:

  • 'character' letters or words

  • 'numeric' (or double) for numbers

  • 'logical' for TRUE and FALSE (booleans)

  • 'integer' for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)

  • 'complex' to represent complex numbers with real and imaginary parts, for example 2 + 4i

  • 'raw' for bitstreams that we will not use or discuss further

  • factor represents categorical data (more below!)

You can check the type of vecor you are using with the typeof() function.

typeof(concentration)

Challenge: Atomic vectors can be a character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?

Solution
yeast_strain_concentration <- c(yeast_strain, concentration) # combine two vectors
yeast_strain_concentration
## [1] "WT"     "snf2"   "43.903" "47.871"
# Solution: All indices are "coerced" to the same "type", e.g. character.
As we saw in our challenge problem, objects of different types get converted into a single type. This is called **coercion** and it follows a hierarchy:

logical → numeric → character ← logical

R uses many data structures, the main ones are:

  • vector contain elements of the same type in one dimension

  • list can contain elements of different types and multiple data structures

  • data.frame contains rows and columns. Columns can contain different types from other columns.

  • matrix with rows and columns that are of the same type

  • array stores data in more than 2 dimensions

We will use vectors, matrices, data.frames, and factors today.

Subsetting

We can extract values from a vector based on position or conditions: this is called subsetting. One way to subset in R is to use []. Indices start at 1.

strains <-  c('WT', 'snf2', 'snf1', 'snf1_snf2') 
strains[2] # subsets the second element
strains[c(2,3)] # subsets the second and third elements
more_strains <- strains[c(1, 2, 3, 1, 4, 3)] # to create an object with more elements than tha original one
more_strains
Conditional subsetting

We can also subset based on conditions, using a logical vector with TRUE and FALSE or with operators:

strains[c(TRUE, FALSE, FALSE, TRUE)] 
concentration >= 43
concentration > 40
concentration < 47
concentration <= 47
concentration[ concentration < 45 | concentration > 47] # | is OR
concentration[ concentration < 45 & concentration == 47] # & is AND , == is mathematical 'equal'
more_strains[more_strains %in% strains]

Missing data

As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

There are in-built functions and arguments to deal with missing data. For example, this vector contains missing data:

concentration <- c(43.903, 47.871, NA, 48.456, 53.435)

We can use the following functions to remove or ignore it:

mean(concentration, na.rm = TRUE) # mean of concentration removing your NA
concentration[!is.na(concentration)] # extract all elements that are NOT NA
na.omit(concentration) # returns a 'numeric' atomic vector with incomplete cases removed
concentration[complete.cases(concentration)] # returns a 'numeric' atomic vector with elements that are complete cases

Note, if your data includes missing values, you can code them as blanks (”“), or as”NA” before reading them into R.

Factors

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

strains <-  factor(c('WT', 'snf2', 'snf1', 'snf1_snf2')) # We make our strains object a factor
class(strains)
nlevels(strains)
strains # R orders alphabetically

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. We can reorder them using the levels argument.

strains <- factor(strains, levels = c('WT', 'snf1', 'snf2', 'snf1_snf2')) # we reorder them as we want them
strains 

We convert objects using as.XXX functions:

strains <- as.character(strains)
class(strains)
strains <- as.factor(strains)
class(strains)

And we can also rename factors based on position:

levels(strains)[4] <- 'wild_type'

Starting with tabular data

This tutorial is a continuation from our Salmon lesson – we are using sequencing data European Nucleotide Archive dataset PRJEB5348. This is an RNA-seq experiment using comparing two yeast strains, SFF2 and WT. Although we’re only working with 6 samples, this was a much larger study. They sequenced 96 samples in 7 different lanes, with 45 wt and 45 mut strains. Thus, there are 45 biological replicates and 7 technical replicates present, with a total of 672 samples! The datasets were compared to test the underlying statistical distribution of read counts in RNA-seq data, and to test which differential expression programs work best.

We will get the information about the RNA quality and quantity that the authors measured for all of their samples, and we will explore how the authors made their data avaialble and ‘linkable’ between technical and biological replicates.

We are going to explore those datasets using RStudio, combining a both base R and an umbrella package called tidyverse.

library('tidyverse')

First we are going to read the files from a url and assign that to experiment_info, using read_tsv():

experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')

Let’s investigate your new data using some exploratory functions:

class(experiment_info) # type of object that sample_details is
dim(experiment_info) # dimensions
summary(experiment_info) # summary stats
head(experiment_info) # shows the first 6 rows and all columns
tail(experiment_info) # last 6 rows and all columns
str(experiment_info) # structure

What do you notice? The last 4 columns are empty! We can get rid of them by subsetting, and having a look that we have done things correctly:

sum(is.na(experiment_info$X10)) # checks how many NAs the column X10 has so that you know all your rows are NAs (or not!)
experiment_info <- experiment_info[ , 1:9] # subset all rows and columns 1 to 9

Challenge: How many columns are in the experiment_info data frame now?

Solution
dim(experiment_info) # dimensions
The 9th column has no name, but we can change this using `rename()`:
experiment_info <- rename(experiment_info, units = X9)

Manipulating data with dplyr

We can now explore this table and reshape it using dplyr. We will cover in this lesson:

  • %>%: pipes, they allow ‘concatenating’ functions

  • select(): subsets columns

  • filter(): subsets rows on conditions

  • mutate(): create new columns using information from other columns

  • group_by() and summarize(): create summary statistics on grouped data

  • arrange(): sort results

  • count(): counts discrete values

Selecting columns and filtering rows

We have now a table with many columns – but what if we don’t need all of the columns? We can select columns using select, which requires the names of the colums that you want to keep as arguments:

# Let's confirm the arguments by looking at the help page 
?select # Notice how the help page is different from the shell manual pages?  

# Now lets select specific columns
select(experiment_info, Sample, Yeast Strain, Nucleic Acid Conc., Sample, 260/280, Total RNA)

What happened? R cannot parse the spaces in the column names properly. We can add back commas around the name of the column. These tell R that these words belong together:

select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, 260/280, `Total RNA`)

Also, R gets confused when column names start with a number. So, we will also need to specify where the numbers start and stop:

select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, `260/280`, `Total RNA`)

So let’s store that in an object named experiment_info:

experiment_info <- select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, A260, A280, `260/280`, `Total RNA`)

NOTE: When choosing names from columns, make sure that you are consistent and choose something simple, all lowercase, all uppercase, or CamelCase. Having spaces in column makes it hard for R to deal with, and can lead to unexpected results or silent errors.

We select all columns except for by using - before the name of the columns we do not want to include, for example all columns except for A260 and A280:

# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)

The filter function works on rows, so we can also filter the data based on conditions in a column:

filter(experiment_info, `Nucleic Acid Conc.` > 1500)

Exercise 1

Select the columns Sample, Yeast Strain, A260 and A280 and assign that to a new object called experiment_data.

Solution #1
experiment_data <-  select(experiment_info, Sample, `Yeast Strain`, A260, A280)
head(experiment_data)
## # A tibble: 6 x 4
##   Sample `Yeast Strain`  A260  A280
##    <dbl> <chr>          <dbl> <dbl>
## 1      1 snf2            43.9  19.9
## 2      2 snf2            47.9  22.1
## 3      3 snf2            34.5  15.6
## 4      4 snf2            34.0  15.4
## 5      5 snf2            28.1  12.7
## 6      6 snf2            48.2  22.1

Exercise 2

Filter rows for only WT strains that had more than 1500 ng/uL concentration and make a new tibble called wild_type.

Solution #2
wild_type <- filter(experiment_info, `Yeast Strain` %in% 'WT' & `Nucleic Acid Conc.` > 1500)
head(wild_type)
## # A tibble: 6 x 7
##   Sample `Yeast Strain` `Nucleic Acid Co…  A260  A280 `260/280` `Total RNA`
##    <dbl> <chr>                      <dbl> <dbl> <dbl>     <dbl>       <dbl>
## 1     49 WT                         2052.  51.3  23.6      2.17        61.6
## 2     50 WT                         1884.  47.1  21.6      2.18        56.5
## 3     51 WT                         2692.  67.3  30.6      2.2         80.8
## 4     52 WT                         3022   75.5  34.3      2.2         90.7
## 5     53 WT                         3235   80.9  36.9      2.19        97.0
## 6     54 WT                         3735.  93.4  43.2      2.16       112.

Pipes

What if we want to select and filter at the same time? We use a “pipe” in R, which is represented by %>%. Note: This is the R version of the shell “redirector” |!

Pipes are available via the magrittr package, installed automatically with dplyr. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac as a shortcut to get to produce a pipe.

We can combine the two subsetting activities from Exercise 1 and 2 using pipes %>% :

experiment_info_wt <- experiment_info %>% 
  filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>% 
  select(Sample, `Yeast Strain`, A260, A280)

Exercise 3

Using pipes, subset experiment_info to include all the samples that had a concentration less than 1500 and where total RNA was more or equal than 40 ug and retain only the columns that tell you the sample, yeast strain, nucleic acid concentration and total RNA. Assign that to a new table called samples_sequenced.

Solution #3
samples_sequenced <- experiment_info %>% 
  filter(`Nucleic Acid Conc.` < 1500 & `Total RNA` >= 40) %>% 
  select(Sample, `Yeast Strain`,`Nucleic Acid Conc.`,`Total RNA`)  

samples_sequenced
## # A tibble: 5 x 4
##   Sample `Yeast Strain` `Nucleic Acid Conc.` `Total RNA`
##    <dbl> <chr>                         <dbl>       <dbl>
## 1      3 snf2                          1379.        41.4
## 2      4 snf2                          1359.        40.8
## 3     22 snf2                          1428.        42.8
## 4     24 snf2                          1386.        41.6
## 5     90 WT                            1422.        42.7

Mutate

What if I want to create a new column? I use the function mutate for this. Let’s convert our units into micrograms/microliter:

experiment_info %>% 
  mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000)

Or create more than one column! And pipe that into head so that we have a peep:

experiment_info %>% 
  mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000,
          half_concentration = concentration_ug_uL / 1000) %>% 
  head()

Exercise 4

Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.

Solution #4
library_start <- experiment_info %>% 
  mutate(RNA_100 = 100/ `Nucleic Acid Conc.`,
          water = 50 - RNA_100) %>% 
  select(Sample, `Yeast Strain`, RNA_100, water)  

head(library_start)
## # A tibble: 6 x 4
##   Sample `Yeast Strain` RNA_100 water
##    <dbl> <chr>            <dbl> <dbl>
## 1      1 snf2            0.0569  49.9
## 2      2 snf2            0.0522  49.9
## 3      3 snf2            0.0725  49.9
## 4      4 snf2            0.0736  49.9
## 5      5 snf2            0.0891  49.9
## 6      6 snf2            0.0519  49.9

Exercise 5

Pretty difficult to pipette! Can redo the table considering that you have done a 1:10 dilution of your samples? Include the columns of A260 and A280 values in addition to sample, yeast strain, RNA_100 and water.

Solution #5
library_start_diluted <-  experiment_info %>% 
  mutate(diluted_RNA = 1000/ `Nucleic Acid Conc.`,
          water = 50 - diluted_RNA) %>% 
  select(Sample, `Yeast Strain`, `A260`, `A280`, diluted_RNA, water)  

head(library_start_diluted)
## # A tibble: 6 x 6
##   Sample `Yeast Strain`  A260  A280 diluted_RNA water
##    <dbl> <chr>          <dbl> <dbl>       <dbl> <dbl>
## 1      1 snf2            43.9  19.9     0.00569  50.0
## 2      2 snf2            47.9  22.1     0.00522  50.0
## 3      3 snf2            34.5  15.6     0.00725  50.0
## 4      4 snf2            34.0  15.4     0.00736  50.0
## 5      5 snf2            28.1  12.7     0.00891  50.0
## 6      6 snf2            48.2  22.1     0.00519  50.0

Exercise 6

Based on the tibble library_start_diluted, create a new tibble called seq_samples that includes only samples with a A260/A280 ratio of 2.2 or more and that only contains the columns for sample, yeast strain, A260280 ratio, diluted RNA and water.

Solution #6
seq_samples <- library_start_diluted %>%
  mutate(A260_A280 = `A260`/`A280`) %>% 
  filter(A260_A280 >= 2.2) %>% 
  select(Sample, `Yeast Strain`, A260_A280, diluted_RNA, water)

head(seq_samples)
## # A tibble: 6 x 5
##   Sample `Yeast Strain` A260_A280 diluted_RNA water
##    <dbl> <chr>              <dbl>       <dbl> <dbl>
## 1      1 snf2                2.21     0.00569  50.0
## 2      3 snf2                2.21     0.00725  50.0
## 3      4 snf2                2.21     0.00736  50.0
## 4      5 snf2                2.22     0.00891  50.0
## 5      7 snf2                2.21     0.00991  50.0
## 6     11 snf2                2.20     0.00963  50.0

Split-apply-combine

The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this usign group_by() and summarize().

group_by() takes as argument the column names that contain categorical variables that we want to calculate summary stats for. For example, to determine the average RNA concetration per strain:

experiment_info %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`))

or summarize using more than one column:

experiment_info %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
            mean_total_RNA = mean(`Total RNA`))

NOTE: Our table does not have missing values. However, we have built-in functions in R that can deal with this. We can filter values that are not NA:

experiment_info %>% 
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`, na.rm = TRUE), # remove NA values 
            mean_total_RNA = mean(`Total RNA`))

experiment_info %>%  
  filter(!is.na(`Nucleic Acid Conc.`)) %>%   # filter to keep only the values that are not NAs
  group_by(`Yeast Strain`) %>%  
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
            mean_total_RNA = mean(`Total RNA`))

We can now arrange the data in ascending order (default of arrange()) or descending using desc():

experiment_info %>% 
  filter(!is.na(`Nucleic Acid Conc.`)) %>%  # filter to keep only the values that are not NAs
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
            mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(mean_concentration) # arrange new table in ascending mean concentrations

experiment_info %>% 
  filter(!is.na(`Nucleic Acid Conc.`)) %>%  # filter to keep only the values that are not NAs
  group_by(`Yeast Strain`) %>% 
  summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
            mean_total_RNA = mean(`Total RNA`)) %>% 
  arrange(desc(mean_concentration)) # arrange new table in descending mean concentrations

Another useful function is count() to count categorical values:

experiment_info %>% 
  count(`Yeast Strain`)

Combining Datasets using join

On many ocassions we will have more than one table that we need to extract information from. We can easily do this in R using the family of join functions.

We are going to first download extra information about the samples that we are are investigating and assigning them to an object called ena_info:

ena_info <- read_tsv(file = 'https://osf.io/6s4cv/download')
sample_mapping <- read_tsv(file = 'https://osf.io/uva3r/download')

Exercise 7

Explore your new dataset ena_info and sample_mapping. What commands can you use?

Solution #7
head(ena_info)
## # A tibble: 6 x 14
##   study_accession run_accession tax_id scientific_name instrument_model
##   <chr>           <chr>          <dbl> <chr>           <chr>           
## 1 PRJEB5348       ERR458493       4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348       ERR458494       4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348       ERR458495       4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348       ERR458496       4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348       ERR458497       4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348       ERR458498       4932 Saccharomyces … Illumina HiSeq …
## # … with 9 more variables: library_layout <chr>, read_count <dbl>,
## #   experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## #   fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## #   sample_title <chr>
tail(ena_info)
## # A tibble: 6 x 14
##   study_accession run_accession tax_id scientific_name instrument_model
##   <chr>           <chr>          <dbl> <chr>           <chr>           
## 1 PRJEB5348       ERR459201       4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348       ERR459202       4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348       ERR459203       4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348       ERR459204       4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348       ERR459205       4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348       ERR459206       4932 Saccharomyces … Illumina HiSeq …
## # … with 9 more variables: library_layout <chr>, read_count <dbl>,
## #   experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## #   fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## #   sample_title <chr>
str(ena_info)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 672 obs. of  14 variables:
##  $ study_accession : chr  "PRJEB5348" "PRJEB5348" "PRJEB5348" "PRJEB5348" ...
##  $ run_accession   : chr  "ERR458493" "ERR458494" "ERR458495" "ERR458496" ...
##  $ tax_id          : num  4932 4932 4932 4932 4932 ...
##  $ scientific_name : chr  "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ...
##  $ instrument_model: chr  "Illumina HiSeq 2000" "Illumina HiSeq 2000" "Illumina HiSeq 2000" "Illumina HiSeq 2000" ...
##  $ library_layout  : chr  "SINGLE" "SINGLE" "SINGLE" "SINGLE" ...
##  $ read_count      : num  1093957 1078049 1066501 985619 846040 ...
##  $ experiment_alias: chr  "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-1" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-2" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-3" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-4" ...
##  $ fastq_bytes     : num  59532325 58566854 58114810 53510113 46181127 ...
##  $ fastq_md5       : chr  "2b8c708cce1fd88e7ddecd51e5ae2154" "36072a519edad4fdc0aeaa67e9afc73b" "7a06e938a99d527f95bafee77c498549" "fa059eab6996118828878c30bd9a2b9d" ...
##  $ fastq_ftp       : chr  "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458494/ERR458494.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458495/ERR458495.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458496/ERR458496.fastq.gz" ...
##  $ submitted_ftp   : chr  "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458493/120903_0219_D0PT7ACXX_1_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458494/120903_0219_D0PT7ACXX_2_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458495/120903_0219_D0PT7ACXX_3_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458496/120903_0219_D0PT7ACXX_4_SA-PE-039.sanfastq.gz" ...
##  $ sample_alias    : chr  "wt_sample1" "wt_sample2" "wt_sample3" "wt_sample3" ...
##  $ sample_title    : chr  "great RNA experiment data" "great RNA experiment data" "great RNA experiment data" "great RNA experiment data" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   study_accession = col_character(),
##   ..   run_accession = col_character(),
##   ..   tax_id = col_double(),
##   ..   scientific_name = col_character(),
##   ..   instrument_model = col_character(),
##   ..   library_layout = col_character(),
##   ..   read_count = col_double(),
##   ..   experiment_alias = col_character(),
##   ..   fastq_bytes = col_double(),
##   ..   fastq_md5 = col_character(),
##   ..   fastq_ftp = col_character(),
##   ..   submitted_ftp = col_character(),
##   ..   sample_alias = col_character(),
##   ..   sample_title = col_character()
##   .. )
class(ena_info)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
dim(ena_info)
## [1] 672  14
summary(ena_info)
##  study_accession    run_accession          tax_id     scientific_name   
##  Length:672         Length:672         Min.   :4932   Length:672        
##  Class :character   Class :character   1st Qu.:4932   Class :character  
##  Mode  :character   Mode  :character   Median :4932   Mode  :character  
##                                        Mean   :4932                     
##                                        3rd Qu.:4932                     
##                                        Max.   :4932                     
##  instrument_model   library_layout       read_count     
##  Length:672         Length:672         Min.   : 821888  
##  Class :character   Class :character   1st Qu.:1252120  
##  Mode  :character   Mode  :character   Median :1457518  
##                                        Mean   :1499348  
##                                        3rd Qu.:1656280  
##                                        Max.   :2804885  
##  experiment_alias    fastq_bytes         fastq_md5        
##  Length:672         Min.   : 44752808   Length:672        
##  Class :character   1st Qu.: 68068787   Class :character  
##  Mode  :character   Median : 78775952   Mode  :character  
##                     Mean   : 81343477                     
##                     3rd Qu.: 89792915                     
##                     Max.   :151800385                     
##   fastq_ftp         submitted_ftp      sample_alias      
##  Length:672         Length:672         Length:672        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  sample_title      
##  Length:672        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
# and the same with sample_mapping!
The `join` functions allow you to merge tables on columns/rows characteristics, so that you can do cool stuff such as (taken from R cheatsheets!):

We have run accession numbers in both our tibbles sample_mapping and ena_info. We want to merge both datasets to link the information present in both tables in one big table that contains everything. We need a column that has the same name in both tables.

Challenge: Which function would you use if the column names were not the same?

Solution
sample_mapping <- rename(sample_mapping, run_accession = RunAccession) # would rename a column called Sample into sample_number to match with the column sample_number in ena_info
yeast_metadata_right <- right_join(ena_info, sample_mapping, by = "run_accession") # this will join both tables matching rows from experiment_info in ena_info

That is a big table!

Exercise 8

How would you merge both tables so that only the rows that are common between both tables are preserved?

Solution #8
yeast_metadata_inner <- inner_join(ena_info, sample_mapping, by = "run_accession")
head(yeast_metadata_inner)
## # A tibble: 6 x 17
##   study_accession run_accession tax_id scientific_name instrument_model
##   <chr>           <chr>          <dbl> <chr>           <chr>           
## 1 PRJEB5348       ERR458493       4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348       ERR458494       4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348       ERR458495       4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348       ERR458496       4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348       ERR458497       4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348       ERR458498       4932 Saccharomyces … Illumina HiSeq …
## # … with 12 more variables: library_layout <chr>, read_count <dbl>,
## #   experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## #   fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## #   sample_title <chr>, Lane <dbl>, Sample <chr>, BiolRep <dbl>
yeast_metadata_left <- left_join(ena_info, sample_mapping, by = "run_accession") 
head(yeast_metadata_left)
## # A tibble: 6 x 17
##   study_accession run_accession tax_id scientific_name instrument_model
##   <chr>           <chr>          <dbl> <chr>           <chr>           
## 1 PRJEB5348       ERR458493       4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348       ERR458494       4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348       ERR458495       4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348       ERR458496       4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348       ERR458497       4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348       ERR458498       4932 Saccharomyces … Illumina HiSeq …
## # … with 12 more variables: library_layout <chr>, read_count <dbl>,
## #   experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## #   fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## #   sample_title <chr>, Lane <dbl>, Sample <chr>, BiolRep <dbl>
We will work from now onwards with the tibble `yeast_metadata_inner`.

Exercise 9

We do not want all the columns; we want to create a new tibble called yeast_metadata that contains only run accession, experiment alias, read count, fastq bytes and md5, lane, yeast strain and biological replicate. Rename the column names so that all of the column names are lower_case (lowercase followed by underscore). And include only data from lane number 1. Use pipes!

Solution #9
yeast_metadata <-  yeast_metadata_inner %>% 
  rename(yeast_strain = Sample, lane = Lane, biol_rep = BiolRep) %>% 
  filter(lane == 1) %>% 
  select(run_accession, experiment_alias, read_count, fastq_bytes, fastq_md5, lane, yeast_strain, biol_rep) 

head(yeast_metadata)
## # A tibble: 6 x 8
##   run_accession experiment_alias read_count fastq_bytes fastq_md5  lane
##   <chr>         <chr>                 <dbl>       <dbl> <chr>     <dbl>
## 1 ERR458493     ena-EXPERIMENT-…    1093957    59532325 2b8c708c…     1
## 2 ERR458500     ena-EXPERIMENT-…    1885330   102201086 107aad97…     1
## 3 ERR458507     ena-EXPERIMENT-…    1594160    86382961 93bca10a…     1
## 4 ERR458514     ena-EXPERIMENT-…    1594695    86462515 48fa134b…     1
## 5 ERR458521     ena-EXPERIMENT-…    1977253   106992945 98a9d336…     1
## 6 ERR458528     ena-EXPERIMENT-…    1371825    74704461 e4bf39d6…     1
## # … with 2 more variables: yeast_strain <chr>, biol_rep <dbl>
We also want to create a table called `salmon_samples` that will include our 6 samples that we used in `salmon`, namely samples ‘ERR458493’,‘ERR458494’,‘ERR458495’,‘ERR458500’,‘ERR458501’and ’ERR458502’.
samples <- c('ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501', 'ERR458502') # create a vector that includes the samples that we want to subset
salmon_samples <- yeast_metadata_inner %>% 
  filter(run_accession %in% samples) # filter rows based on these sample names

Making plots with ggplot2

We have learned how to manipulate datasets and combine them. We are now going to the basics of constructing a plot in ggplot2 which is part of tidyverse.

The basic syntax of ggplot2 is:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>()

NOTE: The “<” and “>” used here don’t mean anything special. They are just here to signifiy inputing something there.

We are going to explore our yeast_metadata tibble.

We build plots in ‘layers’:

ggplot(data = yeast_metadata)

Specifying data binds the plot to a specific data frame…

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes))

defines the mapping using aes (aesthetics of the plot) by selecting the variables to be plotted and how they will be plotted (e.g. x/y, size, shape, color…)…

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
  geom_point()

sets what type of plot we want to have (boxplot, lines, bars):

  • geom_point() for scatter plots, dot plots, etc.;

  • geom_boxplot() for boxplots;

  • geom_line() for trend lines, time series, etc.


We can modify plots to extract more information. We can add transparency:

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
  geom_point(alpha = 0.1)

or change the color of the points:

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
  geom_point(alpha = 0.1, color = "red")

Or color each strain differently:

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
  geom_point(alpha = 0.5, aes(color = yeast_strain))

We can also specify the color inside the mapping:

ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes, color = yeast_strain)) +
  geom_point(alpha = 0.5)

We can use boxplots to see the distribution of reads within strains:

ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
  geom_boxplot()

This is useful, but if we add dots to the boxplots we will have a better idea of the number of measurements and their distribution:

ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
  geom_boxplot(alpha = 0.25) + geom_point(alpha = 0.25)

With categorical bins on the x-axis, we can also use geom_jitter() to spread out the points to be more clear:

ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
  geom_boxplot(alpha = 0.1) + geom_jitter(alpha = 0.7, color = "tomato")

Exercise 10

Replace the boxplot with a violin plot.

Solution #10
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
  geom_violin(alpha = 0.1) +
  geom_jitter(alpha = 1, color = "tomato")

Exercise 11

Represent the read_count on log10 scale. Check out scale_y_log10().

Solution #11
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
  scale_y_log10() + 
  geom_boxplot(alpha = 0.1) +
  geom_jitter(alpha = 1, color = "tomato")

Exercise 12

Try to make a histogram plot for the read counts, coloring each yeast strain.

Solution #12
# Basic histogram
ggplot(data = yeast_metadata, aes(x=read_count)) + 
  geom_histogram()

# Change colors
p<-ggplot(data = yeast_metadata, aes(x=read_count)) + 
  geom_histogram(color="black", fill="white")
p

# Change colors based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  geom_histogram(color="black")

# Facet based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  geom_histogram(color="black") + 
  facet_grid(yeast_strain~.)

# Change to custom colors that are color blind friend
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  geom_histogram(color="black") + 
  facet_grid(yeast_strain~.) +
  # Add blue and yellow colors that are more colorblind friendly for plotting
  scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))

# Density based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  facet_grid(yeast_strain~.) + 
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))

# A white background might be preferable for the yellow color
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  facet_grid(yeast_strain~.) + 
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
  theme_bw()

Exercise 13

What if we want to add the mean read counts in a vertical line?

Solution #13
# To do so, we need to calculate the mean_read_count in a new data frame
mean_yeast_data <- yeast_metadata %>%
  group_by(yeast_strain) %>%
  summarise(mean_read_count = mean(read_count))
head(mean_yeast_data)
## # A tibble: 2 x 2
##   yeast_strain mean_read_count
##   <chr>                  <dbl>
## 1 SNF2                1665717.
## 2 WT                  1596525.
# Add the mean read_count with vline
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(yeast_strain~.) + 
  geom_vline(data = mean_yeast_data, mapping = aes(xintercept = mean_read_count),
             linetype="dashed", size=0.5) +
  theme_bw() +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))

Acknowledgements

This lesson was inspired by and adapted from the Data Carpentry Ecology R lessons

More material

=============

There are many amazing resources and cheat sheets to continue learning R, including:

Differential Expression and Visualization in R

Learning objectives:

  • Create a gene-level count matrix of Salmon quantification using tximport

  • Perform differential expression of a single factor experiment in DESeq2

  • Perform quality control and exploratory visualization of RNA-seq data in R

Getting started on Jetstream

You should still have your jetstream instance running, you can following the instructions here to log in to JetStream and find your instance. Then ssh into it following the instructions here.

You should have the salmon counts on your instance in the ~/quant/ folder. If not, you can get this data by running:

cd ~
mkdir -p quant
cd quant
wget -O salmon_quant.tar.gz https://osf.io/5dbrh/download
tar -xvf salmon_quant.tar.gz

Importing gene-level counts into R using tximport

When we finished running salmon, we had a quant folder that contained transcript-level read counts. So far in our RNA-seq pipeline, we have been working from the command line in bash. For the rest of the RNA-seq tutorial, we will be working in R. As we saw in the R introduction lesson, R is really good and working with data in table format. When we produced counts for our reads, we essentially transformed our data to this format. Most of the software tools written to analyze RNA-seq data in this format are written in R.

We first need to read our data into R. To do that, we will use a package called tximport. This package takes transcript-level counts and summarizes them to the gene level. It is compatible with many count input formats, including salmon.

First, launch RStudio from your instance. Run the following link to produce a url that navigates to RStudio when entered in your browser.

echo http://$(hostname -i):8787/

From R, start a new script. Load the libraries tximport, DESeq2, and tidyverse so we have access to the functions. We’ve pre-installed these packages for you, but we’ll go over how to install them in the future.

library(tximport)
library(DESeq2)
library(tidyverse)

We need a couple files to import our counts.

First, we need a file that tells tximport where our files are. We’ve added other information to this file about sample names and condition, which we will also use later for differential expression. We can use the read_csv() function to read this file into R as a dataframe directly from a url.

# read in the file from url
samples <- read_csv("https://osf.io/cxp2w/download")
# look at the first 6 lines
head(samples)

you should see output like this,

# A tibble: 6 x 3
  sample    quant_file                                condition
  <chr>     <chr>                                     <chr>    
1 ERR458493 ~/quant/ERR458493.qc.fq.gz_quant/quant.sf wt       
2 ERR458494 ~/quant/ERR458494.qc.fq.gz_quant/quant.sf wt       
3 ERR458495 ~/quant/ERR458495.qc.fq.gz_quant/quant.sf wt       
4 ERR458500 ~/quant/ERR458500.qc.fq.gz_quant/quant.sf snf2     
5 ERR458501 ~/quant/ERR458501.qc.fq.gz_quant/quant.sf snf2     
6 ERR458502 ~/quant/ERR458502.qc.fq.gz_quant/quant.sf snf2  

Second, we need a file that specifies which transcripts are associated with which genes. In this case, we have associated transcript names with yeast ORF names.

# read in the file from url
tx2gene_map <- read_tsv("https://osf.io/a75zm/download")
# look at the first 6 lines
head(tx2gene_map)

the output should look something like this,

# A tibble: 6 x 2
  TXNAME                GENEID   
  <chr>                 <chr>    
1 lcl|BK006935.2_mrna_1 YAL068C  
2 lcl|BK006935.2_mrna_2 YAL067W-A
3 lcl|BK006935.2_mrna_3 YAL067C  
4 lcl|BK006935.2_mrna_4 YAL065C  
5 lcl|BK006935.2_mrna_5 YAL064W-B
6 lcl|BK006935.2_mrna_6 YAL064C-A

We can now use tximport() to read in our count files

txi <- tximport(files = samples$quant_file, type = "salmon", tx2gene = tx2gene_map)

Let’s look at this new object:

summary(txi)
                    Length Class  Mode     
abundance           36774  -none- numeric  
counts              36774  -none- numeric  
length              36774  -none- numeric  
countsFromAbundance     1  -none- character

txi is a bit like a list, where it has multiple objects in it. Let’s take a look at the counts:

head(txi$counts)

We have all of our counts in one place! One thing they’re missing is informative column names. We can set these using our samples files, which will also ensure that we assign the correct names to each column.

colnames(txi$counts) <- samples$sample

Why do we need to normalize and transform read counts

Given a uniform sampling of a diverse transcript pool, the number of sequenced reads mapped to a gene depends on:

  • its own expression level,

  • its length,

  • the sequencing depth,

  • the expression of all other genes within the sample.

In order to compare the gene expression between two conditions, we must therefore calculate the fraction of the reads assigned to each gene relative to the total number of reads and with respect to the entire RNA repertoire which may vary drastically from sample to sample. While the number of sequenced reads is known, the total RNA library and its complexity is unknown and variation between samples may be due to contamination as well as biological reasons. The purpose of normalization is to eliminate systematic effects that are not associated with the biological differences of interest.

Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by DESeq2 relies on the hypothesis that most features are not differentially expressed. It computes a scaling factor for each sample. Normalized read counts are obtained by dividing raw read counts by the scaling factor associated with the sample they belong to.

Differential Expression with DESeq2


Image credit: Paul Pavlidis, UBC

Differential expression analysis with DESeq2 involves multiple steps as displayed in the flowchart below. Briefly,

  • DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth.

  • Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts.

  • Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.


We’re now ready to use DESeq2, the package that will perform differential expression.

We’ll start with a function called DESeqDataSetFromTximport which will transform our txi object into something that other functions in DESeq2 can work on. This is where we also give information about our samples contain in the samples data.frame, and where we provide our experimental design. A design formula tells the statistical software the known sources of variation to control for, as well as, the factor of interest to test for during differential expression testing. Here our experimental design has one factor with two levels.

dds <- DESeqDataSetFromTximport(txi = txi, 
                                colData = samples, 
                                design = ~condition)

Note: DESeq stores virtually all information associated with your experiment in one specific R object, called DESeqDataSet. This is, in fact, a specialized object of the class “SummarizedExperiment”. This, in turn,is a container where rows (rowRanges()) represent features of interest (e.g. genes, transcripts, exons) and columns represent samples (colData()). The actual count data is stored in theassay()slot.

After running this command, you should see red output messages that look something like this:

using counts and average transcript lengths from tximport
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors

The first thing we notice is that both our counts and average transcript length were used to construct the DESeq object. We also see a warning message, where our condition was converted to a factor. Both of these messages are ok to see!

Now that we have a DESeq2 object, we can can perform differential expression.

dds <- DESeq(dds)

Everything from normalization to linear modeling was carried out by the use of a single function! This function will print out a message for the various steps it performs:

estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

And look at the results! The results() function lets you extract the base means across samples, log2-fold changes, standard errors, test statistics etc. for every gene.

res <- results(dds)
head(res)
log2 fold change (MLE): condition wt vs . 
Wald test p-value: condition wt vs . 
DataFrame with 6 rows and 6 columns
               baseMean     log2FoldChange             lfcSE               stat               pvalue                 padj
              <numeric>          <numeric>         <numeric>          <numeric>            <numeric>            <numeric>
ETS1-1 150.411380876596 -0.212379164913443 0.226307045512225 -0.938455824177912    0.348010209041573    0.670913203685731
ETS2-1                0                 NA                NA                 NA                   NA                   NA
HRA1   1.08884236321432   3.97189996253477  3.31146245569429   1.19943982928292    0.230356968252411                   NA
ICR1   28.8713313668145 -0.179654414034232 0.547534328578008 -0.328115343015677    0.742824453568867    0.907583917807416
IRT1   29.6053852845893  0.347498338858063 0.523744585645556   0.66348817416364    0.507017950920217    0.794457978516333
ITS1-1 23.8676180072013  -7.25640067170996  1.28178177600513  -5.66118258782368 1.50333365625958e-08 2.62112769016311e-07

The first thing that prints to the screen is information about the “contrasts” in the differential expression experiment. By default, DESeq2 selects the alphabetically first factor to the be “reference” factor. Here that doesn’t make that much of a difference. However, it does change how we interpret the log2foldchange values. We can read these results as, “Compared to SNF2 mutant, WT had a decrease of -0.2124 in log2fold change of gene expression.

Speaking of log2fold change, what do all of these columns mean?

baseMean giving means across all samples
log2FoldChange log2 fold changes of gene expression from one condition to another. Reflects how different the expression of a gene in one condition is from the expression of the same gene in another condition.
lfcSE standard errors (used to calculate p value)
stat test statistics used to calculate p value)
pvalue p-values for the log fold change
padj adjusted p-values

We see that the default differential expression output is sorted the same way as our input counts. Instead, it can be helpful to sort and filter by adjusted p value or log2 Fold Change:

res_sig <- subset(res, padj<.05)
res_lfc <- subset(res_sig, abs(log2FoldChange) > 1) 
head(res_lfc)

Visualization of RNA-seq and Differential Expression Results

Looking at our results is great, but visualizing them is even better!

MA Plot

The MA plot provides a global view of the relationship between the expression change between conditions (log ratios, M), the average expression strength of the genes (average mean, A) and the ability of the algorithm to detect differential gene expression: genes that pass the significance threshold are colored in red

plotMA(res)

Question
Why are more genes grey on the left side of the axis than on the right side?

Plotting individual genes

Although it’s helpful to plot many (or all) genes at once, sometimes we want to see how counts change for a specific gene. We can use the following code to produce such a plot:

plotCounts(dds, gene=which.min(res$padj), intgroup="condition")

Question
What gene is plotted here (i.e., what criteria did we use to select a single gene to plot)?

Normalization & Transformation

DESeq2 automatically normalizes our count data when it runs differential expression. However, for certain plots, we need to normalize our raw count data. One way to do that is to use the vst() function. It perform variance stabilized transformation on the count data, while controlling for library size of samples.

vsd <- vst(dds)

MDS Plot

An MDS (multi-dimensional scaling) plot gives us an idea of how our samples relate to each other. The closer two samples are on a plot, the more similar all of their counts are. To generate this plot in DESeq2, we need to calculate “sample distances” and then plot them.

# calculate sample distances
sample_dists <- assay(vsd) %>%
  t() %>%
  dist() %>%
  as.matrix() 

head(sample_dists)

Next, let’s calculate the MDS values from the distance matrix.

mdsData <- data.frame(cmdscale(sample_dists))
mds <- cbind(mdsData, as.data.frame(colData(vsd))) # combine with sample data
head(mds)

And plot with ggplot2!

ggplot(mds, aes(X1, X2, shape = condition)) + 
  geom_point(size = 3) +
  theme_minimal()

Question
How similar are the samples between conditions?

Heatmap

Heatmaps are a great way to look at gene counts. To do that, we can use a function in the pheatmap package. Here we demonstrate how to install a package from the CRAN repository and then load it into our environment.

install.packages("pheatmap")
library(pheatmap)

Next, we can select a subset of genes to plot. Although we could plot all ~6000 yeast genes, let’s choose the 20 genes with the largest positive log2fold change.

genes <- order(res_lfc$log2FoldChange, decreasing=TRUE)[1:20]

We can also make a data.frame that contains information about our samples that will appear in the heatmap. We will use our samples data.frame from before to do this.

annot_col <- samples %>%
  column_to_rownames('sample') %>%
  select(condition) %>%
  as.data.frame()

head(annot_col)

And now plot the heatmap!

pheatmap(assay(vsd)[genes, ], cluster_rows=TRUE, show_rownames=TRUE,
         cluster_cols=FALSE, annotation_col=annot_col)

We see that our samples do cluster by condition, but that by looking at just the counts, the patterns aren’t very strong. How does this compare to our MDS plot?

Question
When are heatmaps useful?
What other types of heatmaps have you seen in the wild?

Further Notes

Here are some helpful notes or resources for anyone performing differential expression.

  • Introduction to differential gene expression analysis using RNA-seq (Written by Friederike Dündar, Luce Skrabanek, Paul Zumbo). Click here

  • Introduction to DGE - click here

Making a tx2gene file

Making a tx2gene file is often a little different for each organism. If your organism has a transcriptome (or *rna_from_genomic.fna.gz file) on RefSeq, you can often get the information from the *feature_table.txt.gz. You might also be able to parse a gtf or gff file to produce the information you need. This information is sometimes also found in the fasta headers of the transcriptome file itself.

Mapping and variant calling on yeast transcriptome

Learning objectives:

  • define and explore the concepts and implications of shotgun sequencing;

  • explore coverage;

  • understand the basics of mapping-based variant calling;

  • learn basics of actually calling variants & visualizing.

Boot up a Jetstream

You should still have your jetstream instance running, you can following the instructions here to log in to JetStream and find your instance. Then ssh into it following the instructions here.

Install software

conda install -y bwa samtools bcftools

Change to a new working directory and map data

After installing the necessary software, we will create the working directory for the mapping as follows:

cd ~/
mkdir mapping
cd mapping

Next, we will create links from the previously downloaded and quality-trimmed yeast dataset:

ln -fs ~/quality/*.qc.fq.gz .
ls

Variant Calling Workflow


Map data

Goal: perform read alignment (also known as mapping) to determine where our reads originated from.

Download and gunzip the reference:

Here we are going to align our transcripts to the reference’s open reading frames to look for single-nucleotide variants. It’s important to think about what reference is appropriate for your experiment. Many biologically important variants exist in non-coding regions, so if we were looking at genomic sequences, it would be important to use a different reference such as the whole genome.

curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz

Let’s take a look at our reference:

head orf_coding.fasta

Finding Help with bwa

To bring up the manual page for bwa we can type:

bwa

Notice that it brings up a description of the different commands within the bwa software. We will be using index and mem.

We can also have a look at the bwa options page.

Indexing: Prepare reference for mapping:

Our first step is to index the reference sequences for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a reference, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference or you are using a different tool for alignment.

bwa index orf_coding.fasta

Mapping

We use an algorithm called bwa mem to perform mapping. To find more information via the help page for the bwa mem function we can type the command below. Without any input, the help page will come up by default:

bwa mem

Then, when we are ready perform mapping with our sample ERR458493 we can type:

bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz  > ERR458493.sam

While we are running bwa with the default parameters here, your use case might require a change of parameters. NOTE: Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.

What is the difference between Salmon and bwa mem?

  • Standard alignment tools (Hisat2, STAR, BWA) try to find the read origin by FULL-ALIGNMENT of reads to a genome or transcriptome.

  • Ultra-fast alignment-free methods, such as Sailfish, Salmon and Kallisto, have been developed by exploiting the idea that precise alignments are not required to assign reads to their origins

  • Salmon’s “quasi-mapping” approach uses exact matching of k-mers (sub-strings in reads) to approximate which read a transcipt originated from. The idea behind it being that it may not be important to exactly know where within a transcript a certain read originated from. Instead, it may be enough to simply know which transcript the read represents.

  • Salmon therefore does not generate a BAM file because it does not worry about finding the best possible alignment. Instead, it yields a (probabilistic) measure of how many reads originated from each transcript. This is enough information for read quantification, and is really fast.

  • However, BWA mem produces an alignment, where an entire read is mapped exactly against a reference sequence. This produces more information that is important for things like variant calling.

Observe!

SAM/BAM File formats

We can peek at our “.sam” file:

head ERR458493.sam
tail ERR458493.sam

The SAM file is a tab-delimited text file that contains information for each individual read and its alignment to the reference. While we do not have time to go in detail of the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.

The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

PRACTICE!

Using the bwa command we ran above as the foundation, construct a for loop to generate “.sam” alignment files for all of our quality controlled samples!

Solution

for filename in *.qc.fq.gz do   name=$(basename $filename .qc.fq.gz)   echo Working on: $name   bwa mem -t 4 orf_coding.fasta $filename > ${name}.sam done

Visualize mapping

Goal: make it possible to vizualize some of our mapping results.

Index the reference:

Before we indexed the reference for BWA, now we need to index the reference for samtools. To see the manual for samtools we can type:

samtools

Although both tools use different indexing methods, they both allow the tools to find specific sequences within the reference quickly. We can see that an indexing function is samtools faidx.

samtools faidx orf_coding.fasta

Convert the SAM file into a BAM file:

Next, we will convert our file format to a .bam file with the samtools view command. Let’s see the different parameters for this function:

samtools view

We can see that:

  • -S: ignored (input format is auto-detected)

  • -b: output BAM

So let’s convert our file format for sample ERR458493:

samtools view -S -b ERR458493.sam > ERR458493.bam

Sort the BAM file by position in genome:

You can sort on many different columns within a sam or bam file. After mapping, our files are sorted by read number. Most programs require mapping files to be sorted by position in the reference. You can sort a file using the samtools sort command.

samtools sort ERR458493.bam -o ERR458493.sorted.bam

Index the BAM file so that we can randomly access it quickly:

samtools index ERR458493.sorted.bam

Call variants with bcftools!

Goal: find places where the reads are systematically different from the reference.

A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given position in an individual genome or transcriptome, often referred to as a Single Nucleotide Polymorphism (SNP). The call is usually accompanied by an estimate of variant frequency and some measure of confidence. Similar to other steps in this workflow, there are number of tools available for variant calling. In this workshop we will be using bcftools, but there are a few things we need to do before actually calling the variants.

First, let’s look at the manual page for bcftools, making note of the mpileup and call commands:

bcftools

Let’s see the different parameters for the bcftools mpileup command by typing:

bcftools mpileup

Note: The parameters we will use for bcftools mpileup are:

  • -O: Output file type, which can be:

    • ‘b’ compressed BCF. We will use this output format!

    • ‘u’ uncompressed BCF

    • ‘z’ compressed VCF

    • ‘v’ uncompressed VCF

  • -f: faidx indexed reference sequence file

And to find help for the bcftools call function:

bcftools call

Note: The parameters for bcftools call that we will use:

  • -m: alternative model for multiallelic and rare-variant calling (conflicts with -c)

  • -v: output variant sites only

  • -o: write output to a file

Now, let’s go ahead and run our command on our sample ERR458493!

bcftools mpileup -O b -f orf_coding.fasta ERR458493.sorted.bam | \
    bcftools call -m -v -o variants.vcf

Next, we will use a perl script from samtools called vcfutils.pl that will filter out our variants and we can write the output to a new file.

vcfutils.pl varFilter variants.vcf  > variants_filtered.vcf

To look at the entire variants.vcf file you can do cat variants.vcf; all of the lines starting with # are comments. You can use tail variants.vcf to see the last ~10 lines, which should be all of the called variants.

The first few columns of the VCF file represent the information we have about a predicted variation:

column info
CHROM contig location where the variation occurs
POS position within the contig where the variation occurs
ID a . until we add annotation information
REF reference genotype (forward strand)
ALT sample genotype (forward strand)
QUAL Phred-scaled probablity that the observed variant exists at this site (higher is better)
FILTER a . if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed

The Broad Institute’s VCF guide is an excellent place to learn more about VCF file format.

Visualize with tview:

Now that we know the locations of our variants, let’s view them with samtools!

Samtools implements a very simple text alignment viewer called tview. This alignment viewer works with short indels and shows MAQ consensus. It uses different colors to display mapping quality or base quality, subjected to users’ choice. Samtools viewer is known to work with an 130 GB alignment swiftly. Due to its text interface, displaying alignments over network is also very fast.

In order to visualize our mapped reads we use tview, giving it the sorted bam file and the reference file:

samtools tview ERR458493.sorted.bam orf_coding.fasta

tview commands of relevance:

  • left and right arrows scroll

  • q to quit

  • CTRL-h and CTRL-l do “big” scrolls

  • Typing g allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:

    • YLR162W:293 (impressive pileup, shows two clear variants and three other less clear)

    • YDR034C-A:98 (impressive pileup, shows two clear variants)

    • YDR366C:310 (impressive pileup, less clear variants)

    • YLR256W:4420 (impressive pileup, less clear variants)

    • YBL105C:2179 (less depth, shows two clear variants)

    • YDR471W:152 (impressive pileup, shows one clear variant)

Get some summary statistics as well:

samtools flagstat ERR458493.sorted.bam

Automating workflows using bash

Objectives

  • Learn what a bash/shell script is and how to write one

  • Incorporate for loops within a shell script


We have now worked through two workflows, RNA-seq analysis and variant calling. In both cases, we pasted all of our commands into the terminal one command at a time, and used for loops to run each step of the workflow over our six samples. This is necessary at first, and a great way to understand how each tool works and establish a workflow we want to use.

Let’s imagine now that we decided to download all 672 samples from this dataset instead of only working with our six samples. Since we have already figured out how each tool in our workflow works, and we know how we want to run them, one way we can automate (and document) our workflow is with a bash script (remember that bash is just the most common programming language used in our terminal environment).

Here we are going to automate the quality assessment and trimming we did here and the read quantification we performed here in one script. But first, let’s see what a bash script is!

Accessing our JetStream instances

You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh into it following the instructions here.

Next, we’re going to access RStudio again on our instances. RStudio will be convenient for being able to edit a text file alongside an active terminal window (command-line window). We’ll see this in action in a second. As earlier, run the following link to produce a url that navigates to RStudio when entered in your browser.

echo http://$(hostname -i):8787/

Setting up our environment

Previously, we’ve been writting R code in the RStudio text-editor at the top left, but we can also write any kind of files there. So that will be where we build up our bash script. To start with a clean new document, select “File” from the top menu bar inside RStudio, then “New File”, then “Text File”.

At the bottom left window is our R “console” right now. But if we select “Tools” from the top menu bar within RStudio, then “Terminal”, and then “New Terminal”, we will open a command-line terminal in that bottom left window that looks something like this:


This is the same as our terminal that we used to ssh into and then get the link to the RStudio environment, but we are just accessing it through RStudio here. Running commands like cd, ls, and others we’ve been using at the command line all work the same as if we were in a terminal outside of RStudio. If something isn’t working there, double-check it is on the “Terminal” tab, and not the “Console” tab (the “Console” tab is the R environment, while the “Terminal” tab is our command-line environment).

Our first bash script!

First, in the Terminal of RStudio, let’s run these commands:

ls
echo "I'm here to show something random being done"
pwd

Those 3 commands just execute as we are used to when running them at the command line one-by-one. Now let’s copy those 3 commands, and put them into our text editor window at the top-left:


And save the file in our home directory (which should be where it starts) as “my-first-bash-script.sh”:


Now if we check at the command line, (our terminal window at the bottom left), we can see this file is there:

ls *.sh

And we can run it by providing the command bash, and then the file we just made as a positional argument like so:

bash my-first-bash-script.sh

And all three lines are executed as though we entered them one at a time! This is all a bash/shell script is 🙂

Next let’s build up a larger script!

Grabbing some data

For the sake of speed while running things in this tutorial, we are going to be working with only 2 samples. But it would be the same process even if we had 200.

In the Terminal window of RStudio (bottom left), let’s make a new directory to work in, download just two of our data files and the reference transcriptome we used for read quantification, and copy over the adapter file we used for Trimmomatic:

mkdir scripting/
cd scripting/

curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_rna_from_genomic.fna.gz

cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .

Constructing a bash script

When we write a bash script, we need to add all the commands that we would run in our workflow if we were doing it in real-time. This includes making and changing directories, moving files, and running analysis programs like fastqc.

Let’s start by just running fastqc inside a script.

Don’t enter this command in your Terminal, instead copy and paste it into a new text document in the top left panel (File -> New File -> Text File).

fastqc *fastq.gz

And then save the file as “workflow.sh”, and be sure to save it in the “scripting/” directory we are working in:


Now we have a bash script that automates fastqc! Let’s run the bash script from our terminal in RStudio:**

bash workflow.sh

Let’s add more to our script. Next we’ll organize our output files, and add some echo commands to tell us where we are up to if we happen to be watching. Add this text onto our file in the text-editor panel where our script is building, not in the Terminal window.

echo "Running FastQC..."
fastqc *.fastq.gz

echo "Moving FastQC results..."
mkdir fastqc_untrimmed/

mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/

And don’t forget to save the file after making changes, cmd + s or ctrl + s should do the trick, and the file name at the top should change from red to black when saved:


Now, let’s run this again.

bash workflow.sh

We now see that our echo messages print to the terminal and tell us where we are in the workflow. After it finishes, if we run ls on the “fastqc_untrimmed” directory, we see our fastqc output files are in there:

ls fastqc_untrimmed/

Now let’s add our Trimmomatic command in a for loop like we did here:

echo "Running FastQC..."
fastqc *.fastq.gz

echo "Moving FastQC results..."
mkdir fastqc_untrimmed/

mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/

echo "Running Trimmomatic..."
for filename in *.fastq.gz
do
  # first, getting the unique part of the file name alone by removing ".fastq.gz"
  base=$(basename $filename .fastq.gz)
  
  echo "On sample: $base"

  trimmomatic SE ${base}.fastq.gz \
              ${base}.qc.fq.gz \
              ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
              LEADING:15 TRAILING:15 \
              SLIDINGWINDOW:10:20 \
              MINLEN:25
done

And don’t forget to save the file again:


And now we are ready to run it in our terminal window:

bash workflow.sh
PRACTICE!

Add some commands at the end of our script that will:
  1) run FastQC on the output files from Trimmomatic (those that end with “.qc.fq.gz”)
  2) create a new directory called “fastqc_trimmed”
  3) move the output files from FastQC into “fastqc_trimmed”.

Solution

Our entire file should now look something like this:
echo "Running FastQC..."
fastqc *.fastq.gz
echo "Moving FastQC results..." mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/ mv *.html fastqc_untrimmed/
echo "Running Trimmomatic..." for filename in *.fastq.gz do
  # first, getting the unique part of the file name by removing ".fastq.gz"   base=$(basename $filename .fastq.gz)   echo On sample: $base
  trimmomatic SE ${base}.fastq.gz \               ${base}.qc.fq.gz \               ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \               LEADING:15 TRAILING:15 \               SLIDINGWINDOW:10:20 \               MINLEN:25
done
echo "Running FastQC on trimmed files..." fastqc *.qc.fq.gz
echo "Moving trimmed FastQC results..." mkdir fastqc_trimmed/
mv *.zip fastqc_trimmed/ mv *.html fastqc_trimmed/

Lastly, let’s add on read quantification with Salmon like we did here. We only need to run the indexing command once, and then we can put a for loop to run the quantification command on our quality trimmed fastq files just like the one used here. And then let’s add a message at the end for our future selves:

echo "Running FastQC..."
fastqc *.fastq.gz

echo "Moving FastQC results..."
mkdir fastqc_untrimmed/

mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/

echo "Running Trimmomatic..."
for filename in *.fastq.gz
do
  # first, getting the unique part of the file name alone by removing ".fastq.gz"
  base=$(basename $filename .fastq.gz)
  
  echo "On sample: $base"

  trimmomatic SE ${base}.fastq.gz \
              ${base}.qc.fq.gz \
              ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
              LEADING:15 TRAILING:15 \
              SLIDINGWINDOW:10:20 \
              MINLEN:25
done

echo "Running FastQC on trimmed files..."
fastqc *.qc.fq.gz

echo "Moving trimmed FastQC results..."
mkdir fastqc_trimmed/

mv *.zip fastqc_trimmed/
mv *.html fastqc_trimmed/

echo "Indexing ref. transcriptome for Salmon..."
salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz

echo "Quantifying with Salmon..."
for filename in *.qc.fq.gz
do
  # getting the unique part of the file name alone by removing ".qc.fq.gz"
  base=$(basename $filename .qc.fq.gz)
  
  salmon quant -i sc_index --libType A \
               -r ${filename} -o ${base}_quant --seqBias \
               --gcBias --validateMappings
done

echo "Job Done!! (That's a nice-lookin' script!)"

Stellar! Don’t forget to save it, and now let’s run it!

bash workflow.sh

Bash scripting can be very powerful!!

Workflow Management using Snakemake

Objectives

  • Identify cases where workflow managers are helpful for automation

  • Understand the components of a Snakefile: rules, inputs, outputs, and actions.

  • Write and run a Snakefile.

Accessing our JetStream intances

You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh into it following the instructions here.

You should now be logged into your Jetstream computer! You should see something like this:

diblynn@js-17-71:~$

Make sure you’re in your home directory:

cd ~

Install snakemake using conda.

conda install -y -c conda-forge -c bioconda snakemake-minimal

Type the following in your terminal to display a link to Rstudio web-server for your instance’s $(hostname -i):

echo http://$(hostname -i):8787/

Copy and paste the generated link into your browser to open Rstudio and login with your room’s jetstream username and password. We’re going to again work with the text editor and the terminal of Rstudio.

Automation with BASH

In both our RNA-seq workflow and our mapping and variant calling workflow, we performed many steps. We performed steps like quality assessment and filtering using fastqc and trimmomatic. We performed these steps on 6 files using for loops.

In our last lesson, we automated these steps using a bash script. We put all of our commands into one file so that we only had to run one command to orchestrate our quality control workflow. Bash scripting for automation is really powerful!

Let’s revisit the first part of our bash script for running and organizing our fastqc results:

cd ~/data/

echo "Running FastQC ..."
fastqc *.fastq.gz 

mkdir -p ~/fastqc_untrimmed

echo "Saving FastQC results..."
mv *.zip ~/fastqc_untrimmed/
mv *.html ~/fastqc_untrimmed/

cd ~/fastqc_untrimmed/

echo "Running multiqc..."
multiqc .

We can run it using this command:

bash qc.sh

Oh crap! We realize just after we’ve finished Running FastQC that we wanted to move our summary file to a subdirectory! Quick, press ctrl - C and cancel the run!

Even if we made this change though, we’re in a bit of a pickle. We want to re-start our script that automates the runs and the file movements for us, but we already ran the first part of our file! Let’s add comment characters to the lines we know already ran and then re-run the script:

cd ~/data/

# echo "Running FastQC ..."
# fastqc *.fastq.gz 

mkdir -p ~/data/fastqc_untrimmed

echo "Saving FastQC results..."
mv *.zip ~/data/fastqc_untrimmed/
mv *.html ~/data/fastqc_untrimmed/

cd ~/data/fastqc_untrimmed/

echo "Running multiqc..."
multiqc .

Now we can re-run the script:

bash qc.sh 

This (maybe) worked out ok this time. However, it’s hard to be sure we know where we stopped our command. For this reason and many others, we use workflow managers to keep track of the things that have and haven’t run yet!

We’ll be using snakemake for automation.

Introduction to Snakemake

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. It orchestrates and keeps track of all the different steps of workflows that have been run so you don’t have to! It has a lot of wonderful features that can be invoked for different applications, making it very flexible while maintaining human interpretability.

There are many different tools that researchers use to automate computational workflows. We selected snakemake for the following reasons:

  • It was written by a bioinformatician for bioinformatics workflows.

  • It’s free, open-source, and conda-installable

  • Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC schedulers. It works on laptops, the cloud, and clusters without modification to the main workflow (as long as you have enough compute resources!).

  • Snakemake is written using Python, but supports bash and R code as well.

  • Anything that you can do in Python, you can do with Snakemake (since you can pretty much execute arbitrary Python code anywhere).

Like other workflow management systems, Snakemake allows you to:

  • Keep a record of how your scripts are used and what their input dependencies are

  • Run multiple steps in sequence, parallelising where possible

  • Automatically detect if something changes and then reprocess data if needed

Our goal is to automate the first two steps (FastQC MultiQC) of our example workflow using snakemake!

Starting with Snakemake

Snakemake workflows are built around rules. The diagram below shows the anatomy of a snakemake rule:


Let’s make a rule to run fastqc on one of our samples below. We’ll put this rule in a file called Snakefile.

# This rule will run fastqc on the specified input file.
rule fastqc_raw:
    input: "data/ERR458493.fastq.gz"
    output: 
        "fastqc_raw/ERR458493_fastqc.html",
        "fastqc_raw/ERR458493_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

Let’s try and run our Snakefile! Return to the command line and run snakemake.

snakemake

You should see output that starts like this:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	fastqc_raw
	1

[Tue Jul  2 19:10:26 2019]
rule fastqc_raw:
    input: data/ERR458493.fastq.gz
    output: fastqc_raw/ERR458493_fastqc.html, fastqc_raw/ERR458493_fastqc.zip
    jobid: 0

Let’s check that the output file is there:

ls fastqc_raw/*fastqc*

Yay! Snakemake ran the thing!

We can also use better organization. Let’s specify a different output folder for our fastqc results

# This rule will run fastqc on the specified input file
# (replace the prior fastqc_raw rule with this new rule)
rule fastqc_raw:
    input: "data/ERR458493.fastq.gz"
    output: 
        "fastqc_raw/ERR458493_fastqc.html",
        "fastqc_raw/ERR458493_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

If we look in our directory, we should now see a fastqc_raw directory, even though we didn’t create it:

ls

Snakemake created this directory for us. We can look inside it to see if it really ran our command:

ls fastqc_raw

Creating a pipeline with snakemake

We told snakemake to do something, and it did it. Let’s add another rule to our Snakefile telling snakemake to do something else. This time, we’ll run multiqc.

# Run fastqc on the specified input file
rule fastqc_raw:
    input: "data/ERR458493.fastq.gz"
    output: 
        "fastqc_raw/ERR458493_fastqc.html",
        "fastqc_raw/ERR458493_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

# Run multiqc on the results of the fastqc_raw rule
rule multiqc_raw:
    input: "fastqc_raw/ERR458493_fastqc.zip"
    output: "fastqc_raw/multiqc_report.html"
    shell:'''
    multiqc -o fastqc_raw fastqc_raw
    '''

We see output like this:

Building DAG of jobs...
Nothing to be done.
Complete log: /Users/tr/2019_angus/.snakemake/log/2019-07-02T191640.002744.snakemake.log

However, when we look at the output directory fastqc_raw, we see that our multiqc file does not exist! Bad Snakemake! Bad!!

Snakemake looks for a rule all in a file as the final file it needs to produce in a workflow. Once this file is defined, it will go back through all other rules looking for which ordered sequence of rules will produce all of the files necessary to get the final file(s) specified in rule all. For this point in our workflow, this is our fastqc sample directory.. Let’s add a rule all.

rule all:
    """
    The "all" rule is the first rule that Snakemake will run.
    Snakemake will automatically figure out which rules need to 
    be run to generate the specified input file.
    """
    input:
        "fastqc_raw/multiqc_report.html"

rule fastqc_raw:
    input: "data/ERR458493.fastq.gz"
    output: 
        "fastqc_raw/ERR458493_fastqc.html",
        "fastqc_raw/ERR458493_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

rule multiqc_raw:
    input: "fastqc_raw/ERR458493_fastqc.html"
    output: "fastqc_raw/multiqc_report.html"
    shell:'''
    multiqc -o fastqc_raw fastqc_raw
    '''

And it worked! Now we see output like this:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	all
	1	multiqc_raw
	2

Snakemake now has two processes it’s keeping track of.


Using Snakemake to process multiple files

So far we’ve been using snakemake to process one sample. However, we have 6! Snakemake is can be flexibly extended to more samples using wildcards.

We already saw wildcards previously.

When we specified the output file path with {input}, {input} was a wildcard. The wildcard is equivalent to the value we specified for {input}.

rule fastqc_raw:
    input: "data/ERR458493.fastq.gz"
    output: 
        "fastqc_raw/ERR458493_fastqc.html",
        "fastqc_raw/ERR458493_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

We can create our own wildcard too. This is really handy for running our workflow on all of our samples.

# Create a list of strings containing all of our sample names
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501', 
'ERR458502']

rule all:
    input:
        "fastqc_raw/multiqc_report.html"

rule fastqc_raw:
    input: "data/{sample}.fastq.gz"
    output: 
        "fastqc_raw/{sample}_fastqc.html",
        "fastqc_raw/{sample}_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

rule multiqc_raw:
    input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
    output: "fastqc_raw/multiqc_report.html"
    shell:'''
    multiqc -o fastqc_raw fastqc_raw
    '''

We can run this again at the terminal.

snakemake

And we have now run these rules for each of our samples!

Note that we added new syntax here as well. We define a variable at the top of the snakefile call SAMPLES. Snakemake solves the values for the wildcard {sample} the last time that see that wildcard. However, we need to expand the wildcard using the expand function, and tell snakemake in which variable to look for the values.


Helpful guidelines

  • Indentation is important, use two or four spaces for each indentation.

  • Define your target (final output) files in rule all

  • Use unique extensions or directories for each rule to avoid wildcard collisions

Snakemake Additional Features

dry-run, print shell commands and reason for execution

snakemake -n –p -r

execute the workflow with 8 cores

snakemake --cores 8

run the workflow on a SLURM cluster

snakemake --cluster-config cluster.yml --cluster \ 
  "sbatch -A {cluster.account} -t {cluster.time}"

Visualize entire workflow diagram

First we need to install a tool called Graphviz, which codes for the dot command:

conda install -c conda-forge graphviz
snakemake --dag | dot -Tpng > dag.png

The DAG png file should look something as shown above.

Snakemake Report

Snakemake can automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.

To create the report, run

snakemake --report report.html

View sample report here

Specifying software required for a rule

You can specify software on a per-rule basis! This is really helpful when you have incompatible software requirements for different rules, or want to run on a cluster, or want to make your workflow repeatible.

For example, if you create a file env_fastqc.yml with the following content:

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - fastqc==0.11.8

and then change the fastqc rule to look like this:

rule fastqc_raw:
    input: "data/{sample}.fastq.gz"
    output: 
        "fastqc_raw/{sample}_fastqc.html",
        "fastqc_raw/{sample}_fastqc.zip"
    conda:
    	"env_fastqc.yml"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

you can now run snakemake like so,

snakemake --use-conda

and for that rule, snakemake will install the specified software and and dependencies in its own environment, with the specified version.

This aids in reproducibility, in addition to the practical advantages of isolating software installs from each other.

Resources

Adding onto our snakemake workflow

In our previous lesson, we use snakemake to automate the first two rules of our quality control/quality analysis. Given the things you learned during that lesson, we will now automate the rest of the quality control workflow with using snakemake. For this lesson, we would like you to work with others in your room to make the snakefile. We have provided the basic structure for each of the rules, and would like you to fill in the remaining necessary details for the full workflow to run on all samples.

We’ve added in the rules we created from our previous lesson as well.

Let’s make sure we’re in a good working directory and we have the necessary software installed. If you’ve already ran the installation command on your current instance, you don’t need to run it again.

cd ~
conda install -y fastqc multiqc trimmomatic
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501', 
'ERR458502']

rule all:
    input:
        "" # add the path to the final file 

rule fastqc_raw:
    input: "data/{sample}.fastq.gz"
    output: 
        "fastqc_raw/{sample}_fastqc.html",
        "fastqc_raw/{sample}_fastqc.zip"
    shell:'''
    fastqc -o fastqc_raw {input}
    '''

rule multiqc_raw:
    input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
    output: "fastqc_raw/multiqc_report.html"
    shell:'''
    multiqc -o fastqc_raw fastqc_raw
    '''

# Add the trimmomatic commands from our trimming lesson. 
rule trim:
    input:
        reads = "",
        adapters = ""
    output: "" # the output should be the trimmed file
    shell: '''
    trimmomatic SE 
    '''

# Use the commands above as a template to fill in these rules, this time
running the analyses on the trimmed reads.
rule fastqc_trimmed:

rule multiqc_trimmed:

When complete, this snakemake workflow automates all of the bash steps of our quality control workflow!

You can generate a dag of your workflow using the following command:

snakemake --dag | dot -Tpng > dag.png

Remember, “dag” stands for Directed Acyclic Graph. It shows the steps of your workflow that are executed by Snakemake. This dag can be helpful to communicate your workflow to others, or to visualize the decisions snakemake is making as you are constructing your workflow.

You can open it to view using the “Files” tab in the RStudio pane and opening the dag.png file you created.

De novo genome assembly

Here we’re going to run through some of the typical steps for taking a newly sequenced bacterial isolate genome from raw fastq files through to some assemblies and discussing how we might choose which assembly to move forward with 🙂

Before we get started, a public service announcement:

ATTENTION!

This is not an authoritative or exhaustive workflow for working with a newly sequenced genome. No such thing exists! All genomes, datasets, and goals are different, and new tools are constantly being developed. The point of this page is just to give examples of some of the things we can do. Don’t let anything here, or anywhere, constrain your science to doing only what others have done!

Now that that’s out of the way, let’s get to it!

Accessing our JetStream instances

We should still have our jetstream instances running. You can follow the instructions here to log in to the JetStream website and find your instance and IP address. Then ssh into it from your laptop following the instructions here, so we are working in a terminal on our JetStream instance.

Setting up our working environment

Throughout this process we’ll be using a variety of tools. They are going to be installed using conda with the specific versions used at the time this was put together.

First, we will create an environment, activate it, and then install all the needed programs:

conda create -y -n de_novo_example
conda activate de_novo_example

conda install -y -c bioconda -c conda-forge fastqc=0.11.5 \
              trimmomatic=0.36 spades=3.11.1 megahit=1.1.1 \
              quast=5.0.2 bowtie2=2.2.5 anvio=5.5.0 \
              centrifuge=1.0.4

CODE BREAKDOWN

  • conda create - creating our conda environment

    • -y - say yes to all questions automatically (just do it and leave me alone)

    • -n de_novo_example - names the environment we are creating “de_novo_example”

  • conda activate de_novo_example - activates our new environment (puts us in it)

  • conda install - installs the programs we need in our current conda environment

    • -y - same as above, just do it, no questions

    • -c bioconda and -c conda-forge - specifies to search these channels for the following programs

    • fastqc=0.11.5 and rest of positional arguments - names the program wanted, followed by an equals sign preceding the version we want (the version used when this was put together)

The data

The practice data we’re going to use here was provided by colleagues at the J. Craig Venter Institute. In working out the details for a rather large-scale project, which in part involves sequencing a bunch of Burkholderia isolates from the International Space Station and performing de novo genome assemblies, Aubrie O’Rourke and her team put an already sequenced isolate – Burkholderia cepacia (ATCC 25416) – through their pipeline in order to test things out and to have something to benchmark their expectations against. Here we are going to be assembling that genome. The sequencing was done on Illumina’s Nextseq platform as paired-end 2x150 bps, with about a 350-bp insert size.

The data files are rather large for this example, at 1.7 GB compressed, and 2.6 GB uncompressed. This is because the download comes with starting and output files, so that we can skip some of the more time-consuming steps and just copy over results.

Copying and pasting this code block will get us the data:

cd ~
curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz
tar -xzvf genomics_de_novo_temp.tar.gz
rm genomics_de_novo_temp.tar.gz

cd genomics_de_novo_temp/

CODE BREAKDOWN

  • cd ~ - this moves us into our home directory in case we aren’t already there, so we are all starting in the same place

  • curl - this let’s us download the data

    • -L - this specifies to follow “soft links” if there are any (these are like redirects when we click to download a file and are sent somewhere else first)

    • https:... - this is the link to download the data

    • -o - let’s us specify the name of the output file, which we are naming “genomics_de_novo_temp.tar.gz”

  • tar - program that unpacks “tarballs” (.tar) files, and can also optionally decompress them if they are gzipped (.gz)

    • -x - exctract files from a .tar

    • -z - decompress if gzipped (.gz)

    • -v - print verbose output

    • -f genomics_de_novo_tmp.tar.gz - specifies the file we want to act on

Then we delete the downloaded file with rm and change into our new directory with the last cd command

There are two subdirectories here, we are going to change into the “working/” directory:

cd working/

Quality trimming/filtering

Assessing the quality of our sequence data and filtering appropriately should pretty much always be one of the first things we do with our data. A great tool we’ve already used for this is FastQC. Here we’ll start with that on the raw reads.

FastQC

Remember that FastQC scans the fastq files we give it to generate a broad overview of some summary statistics. And it has several screening modules that test for some commonly occurring problems. But as we talked about before, its modules are expecting random sequence data, and any warning or failure notices the program generates should be interpreted within the context of our experiment (e.g. high duplication levels are to be expected with RNAseq or amplicon data, but FastQC would still flag them as something possibly wrong). If a module “fails” or throws a warning, it’s a good idea to look into what might be expected to cause that looking through their manual here. Here we’ll provide our forward and reverse reads as 2 positional arguments:

fastqc B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz -t 6

Looking at our output from the forward reads (B_cepacia_raw_R1_fastqc.html), not too much stands out other than the quality scores are pretty mediocre from about 2/3 of the way through the read on:

FASTQC LEGEND

The read length is stretched across the x-axis, quality scores on the y-axis; the blue line is the mean quality score of all reads at the corresponding positions; red line is the median; the yellow boxplots represent the interquartile range; and the whiskers represent the 10th and 90th percentiles.

The reverse reads look pretty similar:


Trimmomatic

Also as we’ve seen, Trimmomatic is a pretty flexible tool that enables us to trim up our sequences based on several quality thresholds. The summary from FastQC didn’t look all that terrible other than semi-low quality scores towards the end of each read, so for a first pass let’s run Trimmomatic with pretty generic, but stringent settings (particularly the minimum length being our full read size), and see what we get (takes < 1 min.):

trimmomatic PE B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz \
            BCep_R1_paired.fastq.gz BCep_R1_unpaired.fastq.gz \
            BCep_R2_paired.fastq.gz BCep_R2_unpaired.fastq.gz \
            LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:151 \
            -threads 6

CODE BREAKDOWN

  • PE - specifies our data are paired-end

  • B_cepacia_raw_R1.fastq.gz and B_cepacia_raw_R2.fastq.gz - first and second positional arguments are our input forward and reverse reads

  • BCep_R1_paired.fastq.gz and BCep_R1_unpaired.fastq.gz - third and fourth positional arguments are our output files for our forward reads: the third is for those that remain paired (i.e. their partner in the reverse files also passed the filtering thresholds); the fourth is for those that passed filtering but their partner did not 🥺 (gah, that emoji really tugs at the heart strings!)

  • BCep_R2_paired.fastq.gz and BCep_R2_unpaired.fastq.gz - fifth and sixth positional arguments are the same as 3rd and 4th above, but now for the reverse reads

  • LEADING:20 - this states, starting from the left, if a base has a quality score less than 20, cut it off (will stop at first base above 20)

  • TRAILING:20 - same as above, except working from the end of the read in

  • SLIDINGWINDOW:5:20 - starting at base 1, look at a window of 5 bps, and if the average quality score of those 5 cut the read at the first position of that window and only keep up to that point

  • MINLEN:151 - after performing all steps above, if the read is now shorter than 151, throw it away. For our purposes this means if any trimming happened the read will be removed. - Note that not all things happen in the order in which we provide them at the command line, but Trimmomatic does work that way

  • -threads 6 - specifies how many threads will be used, here 6 (“threads” are complicated)

The output printed to our terminals from Trimmomatic states that only about 14% of the read pairs (both forward and reverse from the same fragment) passed, leaving us with ~350,000 read pairs. That sounds low, but since we know what we’re working with here (meaning it’s an isolate of a known genus and not a metagenome or something completely unknown), we can pretty quickly estimate if this could even possibly be enough coverage for us to assemble the genome we’re expecting.

Assuming those reads were perfect quality and perfectly evenly distributed (which they’re not), that would be 350,000 read pairs * 300 bps per paired read = 105 Mbps covered. Most Burkholderia are around 8.5 Mbps, meaning we’d rougly expect around 12X coverage right now, if all were perfect (105,000,000 / 8,500,000). This confirms that this is a little low and we should probably adjust our stringency on filtering – there aren’t solid rules on a minimum level of coverage for all situations, but a general target to shoot for might be around >50X coverage for de novo assembly of a typical prokaryotic genome.

So, let’s go back and alter how we are filtering a bit.

Since the worst part of the reads quality-wise is at the end, and this is likely getting chopped off in a lot of reads during the sliding window scan, when we had the highly stringent MINLEN:151 above, those would all be completely thrown out. So let’s try allowing a shorter minimum length of 140 in order to retain some of those reads (takes < 1 min.):

trimmomatic PE B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz \
            BCep_R1_paired.fastq.gz BCep_R1_unpaired.fastq.gz \
            BCep_R2_paired.fastq.gz BCep_R2_unpaired.fastq.gz \
            LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 \
            MINLEN:140 -threads 6

These settings allowed ~49% of paired reads through (~1.2 million read pairs), which by the same quick estimation we did above suggests we have around an estimated 40X coverage (1,200,000 * 280 bps per read pair ≈ 336 Mbps covered; 336,000,000 / 8,500,000 ≈ 40X coverage). While “throwing away” half the data here may still feel a bit uncomfortable, it’s always important to think about what we’re doing with it when making these decisions. Here, we are trying to assemble a good genome. If we can successfully do that with less “good data”, that’s better than doing it with more “bad data” – especially when bad sequence data could severely inhibit our assembly efforts.

And just for a peek at the FastQC output after our trimming:

fastqc BCep_R1_paired.fastq.gz BCep_R2_paired.fastq.gz -t 6

These are the forward reads, the reverse looked basically the same. Things still don’t look perfect, but they look much cleaner than before – now our interquartile boxes (yellow) are much more snuggly sitting up top telling us our distribution of qualities across the ends of the reads is much better.

Read-error correction

Even though we’ve done our best to quality trim and filter our reads, there can still be errors in there that may hurt our assembly attempts. Conceptually, read-error correction tools try to find bases that have a lot of coverage where the base is the same, but maybe in a tiny fraction of reads that base is different, and they will change that base to be what the others are. This paper by Heydari et al. goes a bit into trying to systematically evaluate Illumina read-error correction tools, which might be a good place to start if you’d like to look into this more. In our experience employing a read-error correction step seems to consistently improve assembly, but as with most of these things, that isn’t necessarily always the case – which is discussed a bit in the Heydari et al. paper. Here we’ll be trying different 4 different assemblies all with error-corrected reads, but feel free to experiment with that too 🙂

The read-error correction we are going to use here is employed by the SPAdes assembler, and is called BayesHammer (initial publication here if you’d like to look into more sometime). So even though we want to try different assemblies, we are still going to run the error-correction step within SPAdes first, and take the error-corrected reads from that to be used with whatever assemblers we try. If we run the SPAdes program with the --error-correction-only flag specified, it will stop after that point and not go into assembly.

NOTICE

This was the most computationally intensive step, taking about 45 minutes with the resources specified in the following command (which is more than are available on our instances). So for the sake of this tutorial, instead of running this command (which is commented out below), we are just going to copy over the result files in the next code block:

   ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##

# spades.py -1 BCep_R1_paired.fastq.gz -2 BCep_R2_paired.fastq.gz \
#           -o spades_error_corrected_reads -t 50 -m 500 \
#           --only-error-correction

CODE BREAKDOWN

  • -1 and -2 - specify the input forward and reverse reads

  • -o - specifies the output directory where the results will be stored

  • -t - specifies how many threads will be used (“threads” are complicated)

  • -m - specifies the maximum amount of memory to possibly be used in gigabytes; program would quit if it needed more than that (our instances have 16 GB available to them)

  • --only-error-correction - telling SPAdes to only runn the error-correction step, and not perform an assembly

And here we are going to copy over the result files:

cp ../downloaded_results/BCep_R?_err_corr.fq.gz .

Assembly

Now that we have our reads quality filtered and we have run an error-correction step, we’re ready to move on to assembling them! There are lots of assembly programs out there, and once again, there is no one-size-fits-all. The reason each new assembly paper shows it doing better than all the others is not because everyone is lying about things, it’s just that the data have a lot to say about which assembler is going to work the “best” – and what’s “best” is not really a straightforward criterion to shoot for either.

Two of the most commonly used prokaryotic assemblers today are SPAdes and MEGAHIT. SPAdes uses much more memory than MEGAHIT, so it is often more suitable for working with one or a few genomes (like from an isolate or enrichment culture). But if working with high-diversity metagenomic samples, sometimes the memory requirements for SPAdes get too high, and MEGAHIT (which uses much less memory) can handle things. This is not to say SPAdes is always better when both can be run, it’s not necessarily.

When working with a new dataset, it’s not uncommon to generate a few assemblies testing different programs with different parameters and then compare the results to try to feel a little confident we are doing the best that can currently be done with the data.

Here we’ll run a couple with SPAdes and a couple with MEGAHIT.

NOTICE

We’re going to be comparing a total of 4 different assembly variations here, with each taking about 5-10 minutes to run. So for the sake of time while doing this tutorial, we will only run the first one together, and for the rest we’ll copy over the results like we did with the error-correction step above.

SPAdes

As mentioned above, SPAdes tends to do well with isolate or enrichment cultures that aren’t too diverse. Since this is an isolate culture that was sequenced, it should run just fine. We’re going to try one SPAdes run with default settings and then one with an additional option set, for a total for 2 assemblies from SPAdes.

SPAdes with default settings

Let’s first run a SPAdes assembly with default settings. Note that we are providing the --only-assembler flag because we already ran the read-error correction step above, and we are using the outputs from that here as the input (takes < 5 min.):

spades.py -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
          -o spades-default-assembly/ -t 6 --only-assembler

While that’s running, let’s talk about some of the others 🙂

SPAdes in careful mode

In the SPAdes documentation, they have this note suggesting specific assembly settings when using 2x150 paired-end Illumina data as we are here. Most of the suggestions there are defaults in the version we are using, and were run with our command above, but one (running things in --careful mode) was not. This tries to find and fix mismatches after the initial assembly is finished. Here’s how we would run one with careful mode on (though again we’re skipping it for time and will copy over the results once our other assembly finishes):

   ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
   
# spades.py -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
#           -o spades-careful-assembly -t 6 --only-assembler \
#           --careful

MEGAHIT

Next we’ll run some assemblies with MEGAHIT.

MEGAHIT with default settings

Here’s how we could run one with default MEGAHIT settings:

   ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##

# megahit -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
#         -o megahit-default-assembly -t 6

MEGAHIT adjusted min-count parameter

The MEGAHIT documentation has an assembly-tips page that notes here that the default settings are largely tuned for metagenomic assembly, and that for a generic assembly (like our case here with an isolate) when there is greater than 40X coverage it is suggested to set the --min-count parameter (which deals with the frequency of kmers and filtering out reads with unique kmers) to 3. The coverage we estimated was around 40X, so let’s also try a run with that parameter set:

   ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##

# megahit -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
#         -o megahit-min-count-3-assembly/ -t 6 --min-count 3

Once our first assembly finishes, let’s copy over the results for the other 3 with this code block:

cp -r ../downloaded_results/assemblies/spades-careful-assembly/ spades-careful-assembly/

cp -r ../downloaded_results/assemblies/megahit-default-assembly/ megahit-default-assembly/

cp -r ../downloaded_results/assemblies/megahit-min-count-3-assembly/ megahit-min-count-3-assembly/

Now that we have a handful of assemblies, let’s see how they compare 🙂

Comparing assemblies

Let’s just get this out there right off the bat, there is no individual metric that exists to determine if we have a good assembly or not, especially if we have no reference, and especially especially if we’re working with a metagenomic assembly.

There are, however, some general things we can look at, like N50 (half of the total assembly size can be found in contigs >= this size) or largest contig, or fraction of reads that successfully recruit to your assembly, or how many genes we can identify, and more. But: 1) these don’t really have any context to tell us if they’re “good” or not unless we’re comparing multiple assemblies of the same data; and 2) it’s possible to have an assembly with “worse” overall summary statistics compared to some other assembly, but that may actually enable us to do whatever it is we want to do better than the assembly with more appealing summary metrics. So, as usual, we have to keep in mind what our goals are, and know that picking the “best” assembly we can get out of our data is not necessarily a trivial or straightforward task. Having a reference genome like we do in this case makes things a lot easier as we’ll see next, and then we’ll talk a little bit about what we could do if we didn’t have a reference 🙂

QUAST

QUAST is a really nice tool for comparing multiple assemblies, and for metagenome assemblies there is a comparable MetaQUAST. We can provide QUAST with all of our assemblies, a fasta file of our reference genome, and a .gff (general feature format) file of our reference genome that contains information about its genes. Then QUAST will compare our assemblies to the reference, and also generate several reference-independent summary statistics. Our two reference files for our Bulkholderia cepacia ATCC 25416 came from NCBI here. And here’s how we can run QUAST (takes < 1 min.):

quast -o quast-B-cep-out -r reference_genome/BCep_ref.fna \
      -g reference_genome/BCep_ref.gff -t 6 -m 1000 \
      -l "SPAdes-default, SPAdes-careful, MEGAHIT-default, MEGAHIT-min-count-3" \
      spades-default-assembly/contigs.fasta \
      spades-careful-assembly/contigs.fasta \
      megahit-default-assembly/final.contigs.fa \
      megahit-min-count-3-assembly/final.contigs.fa

CODE BREAKDOWN

Remember the backslashes (\) don’t have anything to do with the actual command, they are just there so this is not one long line and we can copy and paste (otherwise the “enter” character that comes after them would try to execute the command on each line when pasted into our terminal.

  • -o quast-B-cep-out - setting the output directory

  • -r reference_genome/BCep_ref.fna - providing a fasta file of our reference genome

  • -g reference_genome/BCep_ref.gff - providing the gene coordinates of our reference (this is so QUAST can see if it finds them in our assemblies)

  • -l "..." - within these quotes are labels for each of the input assemblies (so the output has something cleaner than file names), these need to be provided here in the same order in which their input fasta files are entered next as positional arguments

  • the end is 4 positional arguments for our 4 input assemblies

There is more information about QUAST in its documentation here. The output directory contains lots of information, but it is also nicely summarized in an html file called “report.html” (we can access this from here for now). Here’s a portion of it:


The columns here hold information about each of our 4 assemblies, and each row presents a different metric. The majority of the rows starting from the top are in relation to the reference we provided, then the last few starting with “# contigs” are reference-independent. In the interactive html page, we can highlight the row names to get some help on what they mean, and there is more info in the QUAST manual. The cells are shaded across the assemblies for each row from red to blue, indicating “worst” to “best”, but this is only a loose guide to help your eye, as differences can be negligible or up to interpretation, and some rows are more important than others.

It seems all of them did pretty well. They each reconstructed over 98% of the reference genome, which is pretty stellar, but none of them got 100%. This is common when using solely short Illumina reads because they aren’t able to assemble repetitive regions that extend longer than the paired-read fragment length. We can also see that our assemblies aligned across about 8,180 genomic features (genes, transcripts, tRNAs, and such) out of the 8,418 that are annotated in the reference genome. And looking at mismatches per 100 kbp we can see we’re down around 2-4 mismatches per 100 kbp.

QUICK QUESTION!

What might be causing the 2-4 mismatches per 100 kbps?

Solution

This could be a mix of sequencing and/or assembly error, but it is also very likely some actual biological variation too. There is no reason to expect the genome to be exactly identical to the reference even after just 1 culturing transfer.

Moving down to the reference-independent section in the table we can see which assemblies cover the most bps with the fewest contigs. This doesn’t mean everything either, but fewer contigs covering about the same amount of bps does provide more information on synteny, which can be very helpful for functional annotation.

Read recruitment

Another metric to assess the quality of an assembly is to see how well the reads that went into the assembly recruit back to it. It’s sort of a way of checking to see how much of the data that went in actually ended up getting used. This tends to be more informative with mixed community metagenomes – e.g. if one assembly enables recruitment of 85% of our input reads, and another enables only 10% of the input reads to recruit, it is safe to say the first one does a better job of representing what was recovered with our metagenomic sequencing. But in this case – with an isolate genome when the few assemblies tested performed pretty similarly – it turns out that they all pretty much recruited reads just as efficiently (> 99% with the default settings of bowtie2 v2.2.5 at least).


As far as selecting which of our assemblies to move forward with, since they all performed reasonably well, mostly for the reason of maximizing insight into synteny as mentioned above, based on the QUAST output we might choose to move forward with the SPAdes assembly done with the --careful setting as that gave us the most genes on the fewest contigs.

What about when we don’t have a reference?

It’s pretty nice here because we have a known reference genome. When that’s not that case, it’s much harder to be confident we are making the best decisions we can. If we know something about the organism, even if we don’t have an ideal reference, we might have a rough idea of what size genome to expect. We can use the detected copy numbers of genes known to most often be present in single-copy in its nearest relatives to get a rough estimate of percent completion (and percent redundancy which gives us one window into if there may be contamination).

Also with no reference, we might go further down the path of processing and analysis with multiple assemblies. Some might be better than others depending on our particular purpose, and that might not always be apparent based on overall summary statistics. For instance, if recovering representative genomes from a metagenome is the goal, we might want to start that process and see which assembly is better suited for that. As mentioned above, in some cases it is possible to be able to better recover representative genomes from an assembly that has seemingly worse overall summary statistics. And if we just chose based on the assembly-level summary statistics, we wouldn’t find that out. Or, if our question is more about the functional potential of the whole community, or maybe looking for specific genes in a metagenomic assembly, then seeing which assembly yields the prediction of more open-reading frames, or better annotation results, might help us decide. The bottom line is it’s difficult, and there is no one answer, but we do the best we can 🙂

Epilogue

That’s as far as we can go for now. Some more things to do to check the quality of this assembly and begin exploring it – along with a mention of some common tasks after we’re happy with our new genome – are presented in this lesson 🙂

De novo genome exploration

Here we’re going to take off from a de novo prokaryotic genome assembly performed in this lesson. That involved logging into our JetStream instances as noted here and setting up a specific conda environment as noted here. If you didn’t do that lesson, be sure to set up that conda environment as shown there. If you did, then be sure to activate that conda environment with the following after logged in:

conda activate de_novo_example

The data

The data used here is described here. If you did not do the previous lesson, the following code block will download the data and copy over the required assembly. But otherwise you will have all of this already:

cd ~
curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz
tar -xzvf genomics_de_novo_temp.tar.gz
rm genomics_de_novo_temp.tar.gz

cp -r ~/genomics_de_novo_temp/downloaded_results/assemblies/spades-careful-assembly/ ~/genomics_de_novo_temp/working/spades-careful-assembly/

Now let’s change directories into our working directory:

cd ~/genomics_de_novo_temp/working/

Exploring our assembly with anvi’o

Now that we’ve selected an assembly we’re going to move forward with, we can start to take a closer look at it. To do that we’re going to use a wonderful platform called anvi’o - which stands for analysis and visualization of omics data.

anvi’o is very expansive, but to try to summarize it in a sentence, it is a powerful and user-friendly data visualization and exploration platform. There is a ton of excellent documentation at the anvi’o site, including workflows, tutorials, and blog posts.

Here we’re going to put our isolate-genome assembly into the anvi’o framework and see just a few of the ways it can help us begin to look at our assembled genome - including a visual way of checking for possible contamination (which in this case would be anything in there that’s not our target Burkholderia cepacia). Many of the major steps we’re going to be performing here are laid out in the metagenomic workflow presented here, as much of the processing is the same even though we are not working with a metagenome here.

Putting our assembly into anvi’o and generating some information about it

For us to get our assembly into anvi’o, first we need to generate what it calls a contigs database. This contains the contigs from our assembly and some information intrisic to them (like GC content, tetranucleotide frequency, and open-reading frame coordinates). The following command will organize our contigs in an anvi’o-friendly way, generate some basic stats about them, and use the program Prodigal to identify open-reading frames (takes < 2 min.):

anvi-gen-contigs-database -f spades-careful-assembly/contigs.fasta \
                          -o contigs.db -n B_cepacia_assembly

Now that we have our contigs.db that holds our sequences and some basic information about them, we can start adding more information. This is one of the places where the flexibility comes into play, but for now we’ll just move forward with some parts of a general anvi’o workflow, including:

  • using the program HMMER with profile hidden Markov models (HMMs) to scan for bacterial single-copy genes (from Campbell et al. 2013)

    • if new to HMMs, see the bottom of page 7 here for a good explanation of what exactly a “hidden Markov model” is in the realm of sequence data

    • to summarize, instead of treating all columns of an alignment equally (like something like BLAST does), HMMs can weight individual columns in an alignment differently. So the highly-conserved stretches within, say, a conserved domain of a protein will matter more than a stretch in between two domains that may vary more.

  • using NCBI COGs to functionally annotate the open-reading frames Prodigal predicted

  • using a tool called centrifuge for taxonomic classification of the identified open-reading frames

  • generate coverage information for our assembly so we can visualize it

The following block of code would take ~45 minutes to run, mostly because of needing to download and setup some files for their first use. So we are going to skip this code block, copy over the results, and pick up with mapping to generate the coverage information.

   ## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##

  # HMM searching for bacterial single-copy genes
# anvi-run-hmms -I Campbell_et_al -c contigs.db -T 6

  # functional annotation with DIAMOND against NCBI's COGs
# anvi-setup-ncbi-cogs -T 6 # only needed the first time
# anvi-run-ncbi-cogs -c contigs.db --num-threads 6

  # exporting Prodigal-identified open-reading frames from anvi'o
# anvi-get-sequences-for-gene-calls -c contigs.db -o gene_calls.fa

  # setting up centrifuge for taxonomy (only needed the first time)
# wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
# tar -xzvf p_compressed+h+v.tar.gz && rm -rf p_compressed+h+v.tar.gz

  # running centrifuge taxonomy (this taxonomically classifies each identified coding sequence)
# centrifuge -f -x p_compressed+h+v gene_calls.fa -S centrifuge_hits.tsv -p 6

  # importing the taxonomy results into our anvi'o contigs database
# anvi-import-taxonomy-for-genes -p centrifuge -c contigs.db \
#                                -i centrifuge_report.tsv \
#                                centrifuge_hits.tsv

Let’s copy over the new “contigs.db” that has the information stored from running those commands:

cp ../downloaded_results/anvio/contigs.db .

The last thing we want to add right now is mapping information from recruiting our reads to the assembly we built with them. Here is generating that mapping information with bowtie2 (this is another aligner like bwa that we used) and preparing the output file for anvi’o (altogether takes < 5 min.):

  # building bowtie index from our selected assembly fasta file
bowtie2-build spades-careful-assembly/contigs.fasta \
              spades-careful-assembly.btindex

  # mapping our reads (takes ~1 min.)
bowtie2 -q -x spades-careful-assembly.btindex \
        -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
        -p 6 -S spades-careful-assembly.sam

  # converting to a bam file (takes < 1 min.)
samtools view -bS spades-careful-assembly.sam > B-cep-assembly.bam

  # sorting and indexing our bam file (takes < 1 min.)
anvi-init-bam B-cep-assembly.bam -o B-cep.bam

We can now integrate this mapping information into anvi’o with the anvi-profile program, which generates another type of database anvi’o calls a “profile database” that holds extrinsic information about our contigs. For our purposes right now, this is mostly about storing coverage information for each contig. We’ll see what that looks like and why it’s helpful soon!

  # integrating the read-mapping info into anvi'o (takes ~2 min.)
anvi-profile -i B-cep.bam -c contigs.db -M 1000 -T 6 --cluster-contigs -o B-cep-profiled/

Pulling out some sequences and summarizing our assembly

Ok, great, so we’ve just generated and put quite a bit of information about our assembly into our contigs and profile databases. And this being anvi’o, it’s now very easy to access specifics of that information when we want it (you can see all of the available programs by typing anvi- and hitting tab twice).

For example, one of the commands we just ran, anvi-run-hmms, searched for bacterial single-copy genes. We can ask anvi’o to give us which genes were identified using anvi-get-sequences-for-hmm-hits (and adding the flag -h will tell us all about how to use the program). Including adding the -l flag to see which HMM sets we have:

anvi-get-sequences-for-hmm-hits -c contigs.db -l

Which shows us “Campbell_et_al” which we specified above. And adding the -L flag will list the names of the genes in that set:

anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al -L

And say we want to pull out the “RecA” gene, which is a commonly targeted bacterial single-copy gene involved in DNA repair:

anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al \
                                --gene-names RecA \
                                --get-aa-sequences \
                                --no-wrap -o RecA.faa

If we print out what’s in that file with cat:

cat RecA.faa

Copy it and paste it into NCBI’s protein BLAST here and BLAST it, we can see that the top hit is 100% identical across the entire query to B. cepacia, and all top hits are to the same species (which is a nice little sanity check). Here are just a few from the top:


We can also generate some summary statistics on our assembly, including estimated percent completion and redundancy based on the presence/absence of the single-copy marker genes we scanned for above. Here are the two steps needed to summarize our isolate-genome assembly:

  # this is adding all contigs to a group called "DEFAULT"
anvi-script-add-default-collection -p B-cep-profiled/PROFILE.db
  # and here is our summary command
anvi-summarize -c contigs.db -p B-cep-profiled/PROFILE.db \
               -C DEFAULT -o B-cepacia-assembly-summary/

A lot was generated with that, now found in our new directory, “B-cepacia-assembly-summary/”. If we glance at the “bins_summary.txt” file, we can see some summary statistics of our assembled genome, including the completion/redundancy estimates (column -t will keep the columns lined up for us):

column -t B-cepacia-assembly-summary/bins_summary.txt
## output from above command ##
# bins        taxon         total_length  num_contigs  N50     GC_content         percent_completion  percent_redundancy
# EVERYTHING  Burkholderia  8472061       92           194480  66.44548680306582  99.28057553956835   1.4388489208633093

Which shows us, in the second column, the majority of taxonomy calls by Centrifuge were actually for the right genus, that’s always nice. Also, based on the Campbell et al. 2013 bacterial single-copy marker genes, our assembly is estimated to be ~99.3% complete with ~1.4% redundancy. This approach doesn’t actually add up to 100% completion and 0% redundancy in many, if any, organisms, so for comparison’s sake, running the ATCC 25416 reference genome through the same pipeline gives the same results based on this single-copy gene set. Though we are still ~130 Mbps short, as we saw with the QUAST output above.

But great, things look pretty much as they should so far. We can also visually inspect our assembly, and how the reads that went into it recruit to it. In theory, if all the DNA in the assembly came from the same organisms (i.e. it’s a clean assembly), there should be pretty even coverage across the whole thing. And if there is contamination in there (something other than our target B. cepacia, it will likely have very different coverage). So let’s finally take a look with anvi-interactive.

Visualizing with anvi-interactive

To be able to visualize our work from the server, we need to ssh in slightly different than we did earlier. So now, let’s type exit in our terminal to disconnect, press the up arrow to bring up our last login command, and modify it to look like this:

ssh -L 8080:localhost:8080 dibada@XXX.XXX.XXX.XXX

Where the -L 8080:localhost:8080 stays exactly the same as it appears above, but dibada above should be your room name, and the IP address at the end should be yours. You will also need to enter the password again.

Next we need to re-activate our conda environment and change back into our working directory:

conda activate de_novo_example
cd ~/genomics_de_novo_temp/working/

Next, let’s copy-and-paste this command to get the link we will need to view our anvi’o interactive interface:

echo "http://$(hostname -i):8080"

Now let’s start up the anvi’o interactive interface:

anvi-interactive -c contigs.db -p B-cep-profiled/PROFILE.db \
                 --server-only -P 8080

Now if we copy the link that was generated above and paste it into our browser, after the page loads and we click “Draw” in the bottom left corner, it should look something like this:


So there is a lot going on here at first glance, especially if we’re not yet familiar with how anvi’o organizes things. So here’s a quick crash course 🙂

At the center of the figure is a hierarchical clustering of the contigs from our assembly (here clustered based on tetranucleotide frequency and coverage). So each tip (leaf) represents a contig (or a fragment of a contig as each is actually broken down into a max of ~20,000bps, but for right now we’ll just be referring to them as contigs). Then radiating out from the center are layers of information (“Parent”, “Length”, “GC content”, etc.), with each layer displaying information for each contig.

One of the first things that might jump out here is the outer layer (purple in this image, but may be different on yours), labeled “Taxonomy”. There is actually a color for each contig for whatever taxonomy was assigned to the majority of genes in that particular contig. This solid bar all around tells us that the genes in almost the entire assembly were identified as Burkholderia – minus two contigs which were not classified as anything (and have no color there). The next thing that stands out is how stable the mean coverage is across almost all contigs, other than that area near the top in this image (yours may be oriented differently too). That likely holds things found in multiple copies in the genome. For instance, according to IMG, B. cepacia ATCC 25416 has 7 16S copies and 9 23S copies, which would complicate assembly if they aren’t all identical, and would inflate their coverage compared to the parts of the genome that exist in single copy. Overall this is great and shows the culture really seems to have been axenic.

Just for a quick comparison, here is the same type of figure (taxonomy at the inner-part of the circle here), but from an enrichment culture of a cyanobacterium (so “contaminant” genomes in there too), rather than axenic:


Here the highlighted contigs, labeled “Bin 1”, represent the targeted cultivar from this sequencing run, demonstrating a nice example of how anvi’o can help you manually curate bins you’re trying derive from assemblies.

While we didn’t need much (any) manual curation in this case, it was still a good idea to visually inspect the coverage of our assembly to make sure nothing weird was going on. And if we wanted, we could further explore those parts with higher coverage to find out which parts of the genome seem to exist in greater than 1 copy.

Epilogue

This and the prior lesson were all basically to get a high-quality draft of our isolate genome, that we could feel confident about investigating further. Once we feel comfortable with our assembled genome, we can go a lot of different ways. Going into individual approaches are beyond the scope of this particular page, but here are just a few examples.

Phylogenomics

Pull available reference genomes of close relatives and build a phylogenomic tree to get a robust estimate of where our newly acquired isolate(s) fit in evolutionarily with what is already known.


Distributions

Pull available metagenomes from studies and recruit the reads to a reference library containing our isolate (and its close relatives if it has any) to begin assessing their distributions across samples. This example is done with ocean samples, but the same principle can be applied to any environments.


Pangenomics

Start investigating differences in the genetic complement of our new isolate as compared to its known close relatives. This example figure is combining pangenomics (the core of the figure showing the presence or absence of genes within each genome) with metagenomics (distributions of the genomes across samples in the top right corner) to try to associate genomic variability with ecological delineations:


And so much more! This is where just pure data crunching slows down, and the actual science begins. The above are just some of the ways to get to the point where we can then consider our experimental design and our questions and let them guide where we go next 🙂

Some more practical use of Unix

In many of our lessons we have been working with RNASeq data from Saccharomyces cerevisiae. In the transcriptome assembly lesson, which we didn’t run through this year (yet, at least), we would have assembled these RNA reads into contigs using Trinity, a popular transcriptome assembler. Here we will compare the results of that assembly to the reference RNA sequences from NCBI (specifically the GCF000146045.2_R64_rna.fna.gz file from here) while getting some more practice at the command line!


Objectives

  • Explore some more bash commands in practical usage examples (e.g. grep, sed, cut, comm, cat)

  • Practice stringing together multiple commands with pipes (|) step-by-step

  • Use blast to compare an assembled transcriptome to its reference transcriptome

  • Run a “remote” blast from the command line

Accessing our JetStream instances

You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh into it following the instructions here.

Setting up a new conda environment

We will be using blast here, and need one more package for being able to BLAST remotely (send sequences to their servers from the command line). So let’s make a new environment, install those, and enter our new environment:

conda create -y -n blast_env -c conda-forge -c bioconda blast gnutls
conda activate blast_env

Setting up our working directory

Let’s set up a new working directory and download the files we’ll be working with:

cd ~
mkdir more-unix-fun/
cd more-unix-fun/

curl -L https://ndownloader.figshare.com/files/16200176 -o our-transcriptome-assembly.fa
curl -L https://ndownloader.figshare.com/files/16218335 -o ref-transcripts.fa

Out of curiousity, let’s see how many contigs we assembled vs how many open-reading frames there are in the reference we have:

grep -c ">" our-transcriptome-assembly.fa
grep -c ">" ref-transcripts.fa
QUICK QUESTION!

What might be some of the reasons for the large difference between the number of transcripts we assembled and the number of reference transcripts?

Possible Causes
  • not all transcripts may have been expressed at the time of sampling
  • even expressed transcripts at low relative abundance may not have been amplified and sequenced
  • not all transcripts that were expressed and sequenced may have assembled successfully
(This is not an exhaustive list of possibilities.)

Even though we assembled fewer transcripts than the reference holds, let’s use BLAST and the command line to try to find out if all of what we did assemble can actually be found in our reference!

Making a blast database of our reference

Here we are going to make our blast database using our original reference fasta. Then we are going to blast the assembled transcripts against it.

makeblastdb -dbtype nucl -in ref-transcripts.fa -out ref-transcripts.blastdb

CODE BREAKDOWN

  • makeblastdb - this is our command to make a blast database out of our reference fasta file

    • -dbtype nucl - here we are specifying there are nucleotides

    • -in - specifying the input file

    • -out - naming the prefix of the output database files

Blasting our assembly against the reference

Now we’re going to try to align our assembled contigs against the reference. There are a lot of options for blastn that can be seen by running blastn -help, and there’s an explanation of what we’ve done here in the following code breakdown block.

blastn -query our-transcriptome-assembly.fa -db ref-transcripts.blastdb \
       -max_target_seqs 1 -max_hsps 1 \
       -out assembly-to-ref-blastout.tsv \
       -outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore"

CODE BREAKDOWN

  • blastn - this is our base command, to blast nucleotide queries against a nucleotide database (that’s what a blastn is)

    • -query - specifies our input fasta

    • -db - specifies our database we just made

    • -max_target_seqs 1 - sets the maximum number of reference sequences returned to 1

    • -max_hsps 1 - sets maximum highest-scoring pairs returned (since BLAST is a local alignment, you can get more than one alignment returned between one query and one reference)

    • -out - specifies name of output file

    • -outfmt - specifies the output format

Now let’s take a look (remember we can exit less by pressing the q key):

column -t assembly-to-ref-blastout.tsv | less -NS

Code breakdown:

  • column – the column command with the -t flag will attempt to keep columns together based on the tabs. This can be convenient sometimes.

  • less – since there are a lot of lines here, we are piping (|) the output into less

    • the -NS flags we are providing add line numbers and prevent lines from softwrapping at the edge of the terminal.

Let’s see how many of our assembled contigs successfully aligned to the orf reference fasta.

wc -l assembly-to-ref-blastout.tsv

Remember that wc -l tells us how many lines there are in the file. 3,262 of our contigs successfully aligned to the orf reference fasta. But we had a total of 3,323 contigs from our assembly as we saw with grep -c ">" our-transcriptome-assembly.fa, so some didn’t successfully align to the reference.

Here is one way we can do a quick calculation at the command line:

echo "3323-3262" | bc

This tells us we have 61 contigs from our assembly that did not successfully align to the reference. Let’s find out what they are!

What didn’t align?

Here are the steps we’ll take to get the sequences that didn’t align, and then try to find out what they are:

  1. get all the names of the contigs from our assembly

  2. get all the names of the contigs from our assembly that were reported as “hits” in the blast output

  3. compare these to figure out which contigs from our assembly are not in the list of contigs reported as successfully aligning to the reference

  4. use this new list of contig names to pull out their sequences in fasta format

  5. blast the sequences against NCBI’s nr database “remotely” (from the command line, sending our sequences to the NCBI servers)


1. Get all the names of the contigs from our assembly into a file

grep ">" our-transcriptome-assembly.fa | tr -d ">" | cut -f1 -d " " | sort > all-assembly-contig-IDs.txt

Code breakdown:

  • grep - Just like we used grep above to count how many sequences we had by providing the -c flag, here we are leaving off the -c flag so that it will pull out the lines that match. Since fasta format dictates the only place the “>” character appears is in front of sequence headers, we can pull out all head lines with that.

  • tr - We then “pipe” (|) that output into a command called tr, which is useful for manipulating indvidual characters. Here we are using to delete all of the “>” in the file (so our names are just names without the “>” in front of them, like they are in the blast output file).

  • cut - We then pipe that into the cut command, which is good for manipulating columns. Here, we’re going to use it to cut the first column (-f1) setting the delimiter to a space (-d " ")

    • this is because blast automatically cut the trailing space off of our sequence headers, and we want to match their format

    • you can see this by running head assembly_to_orf_coding_blastnout_with_header.tsv and head yeast-transcriptome-assembly.fa

  • sort - We use sort here because the command we’re going to use later to compare our two lists of headers needs them to be sorted in the same fashion, and running this on both will ensure that

  • > – We then redirect the output to a new file called “all-assembly-contigs.txt”

2. Get all the names of the contigs from our assembly that were reported as “hits” in the blast output

cut -f1 assembly-to-ref-blastout.tsv | sort > all-assembly-contig-IDs-with-hits.txt

Code breakdown:

  • cut – Here we are using cut to cut the first column (the sequence name) from the blast output file.

    • Note that here we didn’t need to provide the -d flag to set the delimiter. This is because by default cut sets the delimiter to tabs.

  • sort – We then pipe the output from cut into sort

    • as mentioned above, to compare the two lists of names the way we are going to below requires that the lists be sorted in the same fashion, and running sort on both of them ensures that

  • > – We then redirect the output to a new file called “all-assembly-contig-hits.txt”

3. Compare these to figure out which contigs from our assembly are not in the list of contigs reported as successfully aligning to the reference

We can use the comm command (compare with an extra “m” for some reason…) to quickly find out which sequences we assembled that didn’t successfully align to the reference transcriptome.

comm -23 all-assembly-contig-IDs.txt all-assembly-contig-IDs-with-hits.txt > all-assembly-contig-IDs-that-did-not-hit-ref.txt

Code breakdown:

  • comm – this command compares the lines in two files and by default returns 3 columns:

    1. The lines unique to file 1 (first file given)

    2. The lines unique to file 2 (second file given)

    3. The lines common to both files

    • Looking at the manual for it with man comm we can see you can suppress columns by provide them as arguments

    • So by providing the flags -23 we are saying to keep those, all we want are the lines (contig names) in file 1 that are not in file 2. This gives us all the contig names that we assembled but that did not successfully align to the reference.

  • > – We then redirect the output to a new file called “all-assembly-contigs-that-did-not-hit-ref.txt”

And this should hold the 61 contig names that we’re looking for (sanity check):

wc -l all-assembly-contig-IDs-that-did-not-hit-ref.txt

Stellar 🙂

4. Use this new list of contig names to pull out their sequences in fasta format

Now let’s use grep to pull out the sequences!

NOTE: This requires that the fasta file is in “true” fasta form – meaning that each entry (sequence) takes up two lines. Sometimes, fasta files are formatted with what are known as softwraps. There is a one-liner at the end of this page to convert them if that is the case with your data.

If we wanted to pull one sequence, we could do it like this:

grep -A1 "TRINITY_DN501_c0_g1_i1" our-transcriptome-assembly.fa
for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
  grep -A1 "$header" our-transcriptome-assembly.fa
done

Code breakdown: The special words of the loop are for, in, do, and done.

  • for – here we are specifying the variable name we want to use (“header”), this could be anything

  • in – here we are specifying what we are going to loop through, in this case it is every line of the “all-assembly-contigs-that-did-not-hit-ref.txt” file

    • $(cat ...) – this is a special type of notation in shell. The operation within the parentheses here will be performed and the output of that replaces the whole thing. It would be the same as if we typed out each line in the file with a space in between them

  • do – indicates we are about to state the command(s) we want to be performed on each item we are looping through

    • grep – just like with the individual example above, for each sequence name in our file we are pulling out the header line and the following line containing the sequence

  • done – tells the loop it is finished

This just printed to the screen for right now, but let’s grab one and take it to NCBI and do a web blast to see what we get.

>TRINITY_DN939_c0_g1_i1 len=900 path=[0:0-899]
TTCGACTATGATACTAATTCGCTGTGCACCCAGTGGGAGGATGTTATTTCGCTTACATGCCATATGCTGTATATTCTTGCGTTAACGGTTTCCGCGTGCGATCTCTCTTGTGTTCAGACGAGGCCCAATTGAGCACCATCCCCCTCGGGTAGTTTCCCGATCAAACTGGAAGATAGGCGTCTTTACTTACGCGCTCCTCTTGACCGAGACCCCCAATGCGCGATGTATCGAACCTTCACTAACCCTAGAAATTAGTGGTGGGAATCAGCGAAGTTACAATGTGGGGTTGGACCCAGGATGTTAGCCTGCAAGCTATACAATTCTCTTAGATTAGACGAGAACGGAGAATTTAACCCCTGCAGCATTGGAGGTATGGTCTTGGGCATACCCGATACATGCAACGCAGCTCGGGATGTTCATGGTAGCACCTAACTGTATGGCATAGTTATGCAGAAGTGCGCTGCTTAAGAGCGATACCCCATAAAGAACGATTTTGGTGGTATTGCCCAAAGATAATGTCCCACGTTATCATCTGGTCAACGATGAGGTGGGTTGTTTTGTGATTGTTTGAGATGCTGAGTGCTGTTTAATGCGGGACATAAGGAAGGATATTAGTAGGGAGAAACGCTTGATGCCGGAAATATCCTTGCCTGGTTAACTGCTCGAAGTTAATCTGCGACGCTCGCCCTCATTCGGATGCATCGAAGGGCTCCCCTGCAGTTGCAAAGTCTTTGTTCTGCGAACTCGTAAAGTCGTAATGCCGTTGGTGGACCGTGCTTGTTAGGGATATTAAATGTTTCCTGGCCTTTAAAGCTATTGGCACGGCGGTTTAGATGGGACACCCTATCTCGTTTTCTACTTGCGCTTCAAGCGTCCCAACGAAACGAAATTGCGGACCGG

Now let’s write these all to a file instead of printing them to a screen. We’re just adding the > redirector and a file name at the end of the loop to save the output.

for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
  grep -A1 "$header" our-transcriptome-assembly.fa
done > contigs-not-in-ref.fa

5. Blast the sequences against NCBI’s nr database “remotely” (from the command line, sending our sequences to the NCBI servers)

We can send our sequences to NCBI through the command line to blast them instead of doing it at their website. But in order to include taxonomic information with the output, we need to have another small blast database that holds that information.

We can download and unpack that database with these commands:

curl -L https://ndownloader.figshare.com/files/16219079 -o taxdb.tar.gz
tar -xzvf taxdb.tar.gz

Now we have the taxonomy database too so we can add taxonomy to our blast output. Let’s blast them!

nohup blastn -query contigs-not-in-ref.fa -remote -db nr \
       -max_target_seqs 1 -max_hsps 1 \
       -out contigs-not-in-ref-blast-to-nr.tsv \
       -outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore ssciname" &

CODE BREAKDOWN

The new things added here different from the previous blast are:

  • nohup – this tells the computer not to stop the process if we disconnect from the server (very handy)

  • -db – here we are specifying “nr” which is for NCBI’s “non-redundant” database (nucleotide in this case, which it knows because we used blastn

  • -remote – this tells the computer to send the BLAST job to the NCBI servers

  • & - this at the end tells the computer to run the process in the background so we can still work in our terminal. We can check on the job with the command jobs. This will finish when it does and it won’t be interrupted disconnect or sign off before then.

We can press enter to get get our regular prompt back.

This may take a few minutes (the amount of traffic the blast servers deal with fluctuates). So we can download a results file as follows and move on while that continues in the background:

curl -L https://ndownloader.figshare.com/files/16219238 -o contigs-not-in-ref-blast-to-nr-DL.tsv

What did we get?

The 9th column contains the taxonomy information, so we can use the cut command to look at just that.

cut -f 9 contigs-not-in-ref-blast-to-nr-DL.tsv | sort | uniq -c

CODE BREAKDOWN

  • cut – this crops out the specified columns, here we’re cutting column 9 (taxonomy column, labeled as “ssciname” by blast)

  • sort – sorting is required in order for the following uniq command to function properly

  • uniq – this removes all duplicates, leaving only single copies of the originals

    • the -c flag provided to uniq also tells us how many of each were found

      1 Bacillus subtilis subsp. subtilis str. NCIB 3610
      2 Cloning vector pBAD-FLP
      1 Methanocaldococcus jannaschii
     33 Saccharomyces cerevisiae
      1 Yeast expression vector pSCUDT2MFE6H
     23 synthetic construct

We see there are quite a few synthetic constructs in here, we might want to remove them and look further into these other sequences if this were a real project. And there are 33 S. cerevisiae sequences that didn’t successfully align to our reference cDNA file.

QUICK QUESTION!

Why might there by S. cerevisiae sequences assembled from our RNA that are not aligning to the reference cDNA file?

Possible Causes

It’s possible some of these might be non-coding RNAs, and therefore aren’t in the cDNA reference. There may (probably are) other reasons too.

Isn’t Unix grand 🙂



One-liner that removes softwraps from fasta files:

Download and unzip example file with softwraps:

curl -LO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_rna.fna.gz
gunzip GCF_000146045.2_R64_rna.fna.gz

Look at the file before removing softwraps:

head GCF_000146045.2_R64_rna.fna

Complicated command to fix this (but easy to save somewhere for when needed, so no need to remember):

awk '!/^>/ { printf "%s", $0; n="\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' GCF_000146045.2_R64_rna.fna > GCF_000146045.2_R64_rna_no_softwraps.fna

Looking at new file:

head GCF_000146045.2_R64_rna_no_softwraps.fna

Now we’d be able to grep out sequences if we wanted 🙂

Version Control with Github

Learning objectives

  • Learn about version Control

  • Learn about Github repositories

  • Create local repositories

  • Backup your work online using git

Setup

First, you’ll need to sign up for a free account on GitHub.com. It’s as simple as signing up for any other social network. Keep the email you picked handy; we’ll be referencing it again in the lesson.

We will be following this lesson on our Jetstream instances which comes with Git installed. However, if you want to work on your project on your local computer, you need to have Git installed. Instructions to install Git for Windows, Mac or Linux can be found here.

What is Github?

GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. GitHub is now the largest online storage space of collaborative works that exists in the world

What Is Git?

Why use something like Git? Say you and a coworker are both updating pages on the same website. You make your changes, save them, and upload them back to the website. So far, so good. The problem comes when your coworker is working on the same page as you at the same time. One of you is about to have your work overwritten and erased.

A version control application like Git keeps that from happening. You and your coworker can each upload your revisions to the same page, and Git will save two copies. Later, you can merge your changes together without losing any work along the way. You can even revert to an earlier version at any time, because Git keeps a “snapshot” of every change ever made.

Git terms

Repository:

A directory or storage space where your projects can live. Sometimes GitHub users shorten this to “repo.” It can be local to a folder on your computer, or it can be a storage space on GitHub or another online host. You can keep code files, text files, image files, you name it, inside a repository.

Version Control:

Basically, the purpose Git was designed to serve. When you have a Microsoft Word file, you either overwrite every saved file with a new save, or you save multiple versions. With Git, you don’t have to. It keeps “snapshots” of every point in time in the project’s history, so you can never lose or overwrite it.

Commit:

This is the command that gives Git its power. When you commit, you are taking a “snapshot” of your repository at that point in time, giving you a checkpoint to which you can reevaluate or restore your project to any previous state.

Branch:

How do multiple people work on a project at the same time without Git getting them confused? Usually, they “branch off” of the main project with their own versions full of changes they themselves have made. After they’re done, it’s time to “merge” that branch back with the “master,” the main directory of the project.


Git-Specific Commands

git init: Initializes a new Git repository. Until you run this command inside a repository or directory, it’s just a regular folder. Only after you input this does it accept further Git commands.

git config: Short for “configure,” this is most useful when you’re setting up Git for the first time.

git help: Forgot a command? Type this into the command line to bring up the 21 most common git commands. You can also be more specific and type “git help init” or another term to figure out how to use and configure a specific git command.

git status: Check the status of your repository. See which files are inside it, which changes still need to be committed, and which branch of the repository you’re currently working on.

git add: This does not add new files to your repository. Instead, it brings new files to Git’s attention. After you add files, they’re included in Git’s “snapshots” of the repository.

git commit: Git’s most important command. After you make any sort of change, you input this in order to take a “snapshot” of the repository. Usually it goes git commit -m “Message here.” The -m indicates that the following section of the command should be read as a message.

git branch: Working with multiple collaborators and want to make changes on your own? This command will let you build a new branch, or timeline of commits, of changes and file additions that are completely your own. Your title goes after the command. If you wanted a new branch called “cats,” you’d type git branch cats.

git checkout: Literally allows you to “check out” a repository that you are not currently inside. This is a navigational command that lets you move to the repository you want to check. You can use this command as git checkout master to look at the master branch, or git checkout cats to look at another branch.

git merge: When you’re done working on a branch, you can merge your changes back to the master branch, which is visible to all collaborators. git merge cats would take all the changes you made to the “cats” branch and add them to the master.

git push: If you’re working on your local computer, and want your commits to be visible online on GitHub as well, you “push” the changes up to GitHub with this command.

git pull: If you’re working on your local computer and want the most up-to-date version of your repository to work with, you “pull” the changes down from GitHub with this command.

Setting Up GitHub And Git For The First Time

Login to your Jetstream computer! You should see and should see something like this:

diblynn@js-17-71:~$

Make sure you’re in your home directory:

cd ~

It’s time to introduce yourself to Git. Type in the following code:

git config --global user.name "Your Name Here"

Next, tell it your email and make sure it’s the same email you used when you signed up for a GitHub.com account

git config --global user.email "your_email@youremail.com"

Creating Your Online Repository

Now that you’re all set up, it’s time to create a place for your project to live. Both Git and GitHub refer to this as a repository, or “repo” for short, a digital directory or storage space where you can access your project, its files, and all the versions of its files that Git saves.

  • On your Github profile, click the plus button and select a “New Repository”.


  • Give your repository a name & fill out the necessary information for your repository to be distinct and recognizeable.

  • Don’t worry about clicking the checkbox next to “Initialize this repository with a README.” A Readme file is usually a text file that explains a bit about the project. But we can make our own Readme file locally for practice.

  • Click the green “Create Repository” button and you’re set. You now have an online space for your project to live in.


Creating Your Local Repository

To begin, let’s create a new directory called MyProject.

mkdir ~/MyProject

Then we will move into this new directory.

cd ~/MyProject

To create a local repository, we will first initiate a new repository for “MyProject” by entering the following command:

git init

touch is a multi-purpose command, but one of its key uses is to creat new, empty files. In our case, we will create a new file called Readme.txt.

touch Readme.txt

We can check the status of our new repository by using git status.

git status

When we want Git to track a file, we use git add followed by the file we want Git to “see”. If we do not use git add, Git will not “see” this file.

git add Readme.txt

Lastly, to have Git track the current “snapshot” of our file, we enter git commit. The -m flag allows us to add a personal message with the files we are committing. In the following example, our message is “Add Readme.txt”. Examples of other messages could include version information, changes made to a document, document descriptions, etc.

git commit -m “Add Readme.txt”

Now Git has a “snapshot” of this version of Readme.txt which you can return to at any time in the future!


Connect Your Local Repository To Your GitHub Repository Online


This setup also makes it easy to have multiple collaborators working on the same project. Each of you can work alone on your own computers, but upload or “push” your changes up to the GitHub repository when they’re ready.

To tell Git the address off your remote repo in Github, Type the following replacing the address of the repo with your own

git remote add origin https://github.com/username/myproject.git

Git now knows there’s a remote repository and it’s where you want your local repository changes to go. To confirm, type this to check:

git remote -v

Great, Git is able to connect with our remote on Github. So, let’s go ahead and push our files to Github

git push origin master

You will be prompted for your Github username and password at this point

and you can see some output like this that git is sending packets of data to your github repo and by this you will force git to back up all of your commits since the last time you pushed to be backed up online. FOR FREE!

Counting objects: 3, done.
Writing objects: 100% (3/3), 217 bytes | 217.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/sateeshbio5/angus_test.git
 * [new branch]      master -> master

Note: To avoid having to type your username and password each time you push/pull from your github repos, read about Secure Login here


Collaborating via GitHub

  • GitHub Issues: Issues are a great way to keep track of tasks, enhancements, and bugs for your projects. They’re kind of like email—except they can be shared and discussed with all. Read more about Mastering Issues on Github here


  • GitHub Pull-Requests: Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.


Look at others’ repositories:

Host Websites & Blogs on GitHub

  • GitHub Pages is an awesome feature that lets you host websites/blogs for you and your projects.

  • Hosted directly from your GitHub repository. Just edit, push, and your changes are live.

  • Read more about GitHub Pages here

Sources for this tutorial & Additional Git Resources

Microbial Ecology - a discussion and overview of amplicon sequencing and metagenomics

ATTENTION!!

This is meant to be a guided, but very open discussion. Please feel free to jump in at any time with questions or thoughts on things!

This page presents a broad-level overview of amplicon sequencing and metagenomics as applied to microbial ecology. Both of these methods are most often applied for exploration and hypothesis generation and should be thought of as steps in the process of science rather than end-points – like all tools of science 🙂

Amplicon sequencing

Amplicon sequencing of marker-genes (e.g. 16S, 18S, ITS) involves using specific primers that target a specific gene or gene fragment. It is one of the first tools in the microbial ecologist’s toolkit. It is most often used as a broad-level survey of community composition used to generate hypotheses based on differences between recovered gene-copy numbers between samples.

Metagenomics

Shotgun metagenomic sequencing aims to amplify all the accessible DNA of a mixed community. It uses random primers and therefore suffers much less from pcr bias (discussed below). Metagenomics enables profiling of taxonomy and functional potential. Recently, the recovery of representative genomes from metagenomes has become a very powerful approach in microbial ecology, drastically expanding the known Tree of Life by granting us genomic access to as-yet unculturable microbial populations (e.g. Hug et al. 2016; Parks et al. 2017).

Here we’ll discuss some of the things each is useful and not useful for, and then look at some general workflows for each.

Amplicon sequencing utility

  • Useful for:

    • one metric of community composition

      • can say something about relative abundance of gene copies recovered

    • can track changes in community structure (as interpreted by recovered gene copy numbers) in response to a treatment and/or across environmental gradients/time

    • can provide strong support for further investigation of things

  • Not useful for:

    • abundance of organisms (or relative abundance of organisms)

      • recovered gene copies ≠ counts of organisms

        • gene-copy number varies per genome/organism (16S sequence can vary per genome)

        • pcr bias (small scale) -> under/over representation based on primer-binding efficiency

        • pcr bias (large scale) -> “universal” primers, only looking for what we know and they don’t even catch all of that

        • cell-lysis efficiencies

    • function

      • even if we can highly resolve the taxonomy of something from an amplicon sequence, it is still only one fragment of one gene

      • hypothesis generation is okay, e.g. speculating about observed shifts in gene-copy numbers based on what’s known about nearest relatives in order to guide further work

      • but, for example, writing a metagenomics paper based on 16S data would likely not be a good idea

      • There is no strain-level resolving capability for a single gene, all that tells you is how similar those genes are.

As noted above, amplicon data can still be very useful. Most often when people claim it isn’t, they are assessing that based on things it’s not supposed to do anyway, e.g.:

“Why are you doing 16S sequencing? That doesn’t tell you anything about function.”

“Why are you measuring nitrogen-fixation rates? That doesn’t tell you anything about the proteins that are doing it.”

We shouldn’t assess the utility of a tool based on something it’s not supposed to do anyway 🙂

Metagenomics utility

  • Useful for:

    • functional potential

    • insights into the genomes of as-yet unculturable microbes

    • much better for “relative” abundance due to no confounding copy-number problem and no drastic PCR bias (still not true abundance)

      • still some caveats, like cell-lysis efficiencies

  • Not useful for:

    • abundance of organisms

    • “activity”

      • neither is transcriptomics or proteomics for that matter – each gives you insight into cellular regulation at different levels

QUICK QUESTION!

With all that said, do you think we should expect relative abundance information from amplicon sequencing to match up with relative abundance from metagenomic sequencing?

Solution

No, and that’s not a problem if we understand that neither are meant to tell us a true abundance anyway. They are providing different information is all. And the relative abundance metrics they do provide can still be informative when comparing multiple samples generated the same way 🙂

General workflows

Amplicon overview

PDF download


A Note on OTUs vs ASVs

All sequencing technologies make mistakes, and (to a much lesser extent), polymerases make mistakes as well during the amplification process. These mistakes artificially increase the number of unique sequences in a sample, a lot. Clustering similar sequences together (generating OTUs) emerged as one way to mitigate these errors and to summarize data – though at the cost of resolution. The field as a whole is moving towards using solely ASVs, and there is pretty good reasoning for this. This Callahan et al. 2017 paper nicely lays out the case for that, summarized in the following points:

  • OTUs (operational taxonomic units)

    1. cluster sequences into groups based on percent similarity

    2. choose a representative sequence for that group

      • closed reference

        • + can compare across studies

        • - reference biased and constrained

      • de novo

        • + can capture novel diversity

        • - not comparable across studies

        • - diversity of sample affects what OTUs are generated

  • ASVs (amplicon sequence variants)

    1. attempt to identify the original biological sequences by taking into account error

      • + enables single-nucleotide resolution

      • + can compare across studies

      • + can capture novel diversity

If you happen to work with amplicon data, I highly recommend digging into the Callahan et al. 2017 paper sometime 🙂

Metagenomics overview

PDF download

De novo transcriptome assembly

Learning objectives:

* Learn what transcriptome assembly is
* Learn to differentiate different types of assemblies
* Discuss how do assemblers work
* Learn to check the quality of a transcriptome assembly

What if you are working with a species with no existing reference genome or transcriptome…how do you construct a reference?

The answer is de novo assembly. The basic idea with de novo transcriptome assembly is you feed in your reads and you get out a bunch of contigs that represent transcripts, or stretches of RNA present in the reads that don’t have any long repeats or much significant polymorphism. You run a de novo transcriptome assembly program using the trimmed reads as input and get out a pile of assembled RNA.

Trinity, one of the leading de novo transcriptome assemblers, was developed at the Broad Institute and the Hebrew University of Jerusalem. We will be losely following steps from the Eel pond protocol for our guide to doing RNA-seq assembly.

Download and trim the data

We will be using a set of Nematostella vectensis mRNAseq reads from Tulin et al., 2013.

To avoid needing to go though the trimming steps as before, we’ll download a snakefile to take care of these steps for us. If you look at this file, you may notice it is very similar to the one you generated during the snakemake challenge.

Make sure you have snakemake installed in your base environment, where we will be running the snakefile:

conda install -c conda-forge snakemake-minimal

Download the snakefile and a corresponding conda environment file:

cd ~ # cd to home directory
curl -L https://osf.io/nqh6p/download -o nema_trim.snakefile
curl -L https://osf.io/xs6k7/download -o trim-env.yml

Let’s take a look at the environment file to see what software snakemake will be putting in the environment it creates:

less trim-env.yml

Run the snakefile to download and trim the Nematostella reads:

snakemake -s nema_trim.snakefile --use-conda --cores 6

Here, we run snakemake. We use the -s command to tell snakemake where it can find our snakefile, tell it to build and use the environment above using the --use-conda flag, and have it execute the workflow over 6 cores using --cores 6.

The trimmed data should now be within the nema_trimmed folder.

Generate a de novo assembly

The following commands will create a new folder assembly and link the trimmed data we prepared in our snakemake workflow into the assembly folder:

cd 
mkdir assembly
cd assembly

ln -fs ../nema_trimmed/*.qc.fq.gz .
ls

Next we will combine our all forward reads into a single file and all reverse reads into a single file. Usually, you want to have many samples from a single individual that are combined, but that minimize polymorphism and improve assembly. See this preprint for best practices for care and feeding of your transcriptome.

Use the following command to combine all fq into 2 files (left.fq and right.fq).

cat *_1.pe.qc.fq.gz *se.qc.fq.gz > left.fq.gz
cat *_2.pe.qc.fq.gz > right.fq.gz

Note: this step just makes it easier for us to type out the trinity command. Trinity can accept comma-separated lists of files as inputs - check the trinity documentation for details.

Run the assembler

First, we’ll create and activate an environment where Trinity is installed:

conda create -y -n trinity-env trinity
conda activate trinity-env

Trinity works both with paired-end reads as well as single-end reads (including with both types of reads at the same time). In the general case, the paired-end files are defined as --left left.fq and --right right.fq respectively. Our single-end reads (orphans) have been concatenated onto the left.fq file.

So let’s run the assembler as follows:

time Trinity --seqType fq --max_memory 16G --CPU 6 --left left.fq.gz --right right.fq.gz --output nema_trinity

(This will take a few minutes)

The time command allows us to see how long Trinity takes to run, but is not a part of the Trinity command.

You should see something like:

** Harvesting all assembled transcripts into a single multi-fasta file...

Thursday, July 11, 2019: 18:21:18       CMD: find /home/dibada/assembly/nema_trinity/read_partitions/ -name '*inity.fasta'  | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/nema_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/nema_trinity/Trinity.fasta
Thursday, July 11, 2019: 18:21:18       CMD: mv Trinity.tmp.fasta /home/dibada/assembly/nema_trinity/Trinity.fasta


###################################################################
Trinity assemblies are written to /home/dibada/assembly/nema_trinity/Trinity.fasta
###################################################################


Thursday, July 11, 2019: 18:21:18       CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/nema_trinity/Trinity.fasta > /home/dibada/assembly/nema_trinity/Trinity.fasta.gene_trans_map

at the end.

Looking at the assembly

First, save the assembly:

cp nema_trinity/Trinity.fasta nema-trinity.fa

Now, look at the beginning:

head nema-trinity.fa

These are the transcripts! Yay!

Let’s capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:

TrinityStats.pl nema-trinity.fa

The output should look something like the following:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  18
Total trinity transcripts:      46
Percent GC: 46.68

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3631
        Contig N20: 2497
        Contig N30: 2145
        Contig N40: 2097
        Contig N50: 1977

        Median contig length: 1593
        Average contig: 1459.98
        Total assembled bases: 67159


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4447
        Contig N20: 3631
        Contig N30: 3631
        Contig N40: 2497
        Contig N50: 2151

        Median contig length: 1107
        Average contig: 1422.83
        Total assembled bases: 25611

This is a set of summary stats about your assembly. Are they good? Bad? How would you know?

Lastly, we’ll deactivate the environment where Trinity is installed.

conda deactivate

Suggestions for next steps

After generating a de novo transcriptome assembly:

Annotating and evaluating a de novo transcriptome assembly

At the end of this lesson, you will be familiar with:

  • how to annotate a de novo transcriptome assembly

  • parse GFF3 output from the annotation output to use for DE analysis

  • several methods for evaluating the completeness of a de novo transcriptome assembly

  • What a Jupyter notebook is and how to execute a few commands in Python

Annotation with dammit

dammit!

dammit is an annotation pipeline written by Camille Scott. The dammit pipeline runs a relatively standard annotation protocol for transcriptomes: it begins by building gene models with Transdecoder, then uses the following protein databases as evidence for annotation: Pfam-A, Rfam, OrthoDB, uniref90 (uniref is optional with--full).

If a protein dataset for your organism (or a closely-related species) is available, this can also be supplied to the dammit pipeline with the --user-databases as optional evidence for the annotation.

In addition, BUSCO v3 is run, which will compare the gene content in your transcriptome with a lineage-specific data set. The output is a proportion of your transcriptome that matches with the data set, which can be used as an estimate of the completeness of your transcriptome based on evolutionary expectation (Simho et al. 2015).

There are several lineage-specific datasets available from the authors of BUSCO. We will use the metazoa dataset for this transcriptome.

Installation

Annotation necessarily requires a lot of software! dammit attempts to simplify this and make it as reliable as possible, but we still have some dependencies.

dammit can be installed via bioconda, but the latest version is not up there yet. Let’s download a working conda environment.yml file that will help us install the latest version of dammit.

cd 
curl -L https://raw.githubusercontent.com/dib-lab/elvers/master/elvers/rules/dammit/environment.yml -o dammit-env.yaml

Now let’s build an environment from that yaml file:

conda env create -f dammit-env.yml -n dammit-env 
conda activate dammit-env

To make sure your installation was successful, run

dammit -h

This will give a list of dammit’s commands and options:

The version (dammit --version) should be:

dammit 1.1
Database Preparation

dammit has two major subcommands: dammit databases and dammit annotate. The databases command checks that databases are installed and prepared, and if run with the --install flag, it will perform that installation and preparation. If you just run dammit databases on its own, you should get a notification that some database tasks are not up-to-date. So, we need to install them!

Install databases (this will take a long time, usually >10 min):

dammit databases --install --busco-group metazoa

Note: dammit installs databases in your home directory by default. If you have limited space in your home directory or on your instance, you can install these databases in a different location (e.g. on an external volume) by running dammit databases --database-dir /path/to/install/databases before running the installation command.

We used the “metazoa” BUSCO group. We can use any of the BUSCO databases, so long as we install them with the dammit databases subcommand. You can see the whole list by running dammit databases -h. You should try to match your species as closely as possible for the best results. If we want to install another, for example:

dammit databases --install --busco-group protists

Phew, now we have everything installed!

Now, let’s take a minute and thank Camille for making this process easy for us by maintaining a recipe on bioconda. This saves us a lot of hassle with having to install individual parts required for the pipeline. AND on top of the easy installation, there is this slick pipeline! Historically, transcriptome annotation involved many tedious steps, requiring bioinformaticians to keep track of parsing databases alignment ouptut and summarizing across multiple databases. All of these steps have been standardized in the dammit pipeline, which uses the pydoit automation tool. Now, we can input our assembly fasta file -> query databases -> and get output annotations with gene names for each contig - all in one step. Thank you, Camille!

Annotation

Keep things organized! Let’s make a project directory:

cd ~/
mkdir -p ~/annotation
cd ~/annotation

Let’s copy in the trinity assembly file we made earlier:

cp ../assembly/nema-trinity.fa ./

Now we’ll download a custom Nematostella vectensis protein database. Somebody has already created a proper database for us Putnam et al. 2007 (reference proteome available through uniprot). If your critter is a non-model organism, you will likely need to grab proteins from a closely-related species. This will rely on your knowledge of your system!

curl -LO ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001593_45351.fasta.gz
gunzip -c UP000001593_45351.fasta.gz > nema.reference.prot.faa
rm UP000001593_45351.fasta.gz

Run the command:

dammit annotate nema-trinity.fa --busco-group metazoa --user-databases nema.reference.prot.faa --n_threads 6

While dammit runs, it will print out which task it is running to the terminal. dammit is written with a library called pydoit, which is a python workflow library similar to GNU Make. This not only helps organize the underlying workflow, but also means that if we interrupt it, it will properly resume!

After a successful run, you’ll have a new directory called trinity.nema.fasta.dammit. If you look inside, you’ll see a lot of files:

ls nema-trinity.fa.dammit/

Expected output:

annotate.doit.db                    nema-trinity.fa.transdecoder.cds                    nema-trinity.fa.x.nema.reference.prot.faa.crbl.gff3            nema-trinity.fa.x.sprot.best.csv
dammit.log                          nema-trinity.fa.transdecoder.gff3                   nema-trinity.fa.x.nema.reference.prot.faa.crbl.model.csv       nema-trinity.fa.x.sprot.best.gff3
nema-trinity.fa                     nema-trinity.fa.transdecoder.pep                    nema-trinity.fa.x.nema.reference.prot.faa.crbl.model.plot.pdf  nema-trinity.fa.x.sprot.maf
nema-trinity.fa.dammit.fasta        nema-trinity.fa.transdecoder_dir                    nema-trinity.fa.x.pfam-A.csv                                   run_nema-trinity.fa.metazoa.busco.results
nema-trinity.fa.dammit.gff3         nema-trinity.fa.x.OrthoDB.best.csv                  nema-trinity.fa.x.pfam-A.gff3                                  tmp
nema-trinity.fa.dammit.namemap.csv  nema-trinity.fa.x.OrthoDB.best.gff3                 nema-trinity.fa.x.rfam.gff3
nema-trinity.fa.dammit.stats.json   nema-trinity.fa.x.OrthoDB.maf                       nema-trinity.fa.x.rfam.tbl
nema-trinity.fa.transdecoder.bed    nema-trinity.fa.x.nema.reference.prot.faa.crbl.csv  nema-trinity.fa.x.rfam.tbl.cmscan.out

The most important files for you are nema-trinity.fa.dammit.fasta, nema-trinity.fa.dammit.gff3, and nema-trinity.fa.dammit.stats.json.

If the above dammit command is run again, there will be a message: **Pipeline is already completed!**

Parse dammit output

Camille wrote dammit in Python, which includes a library to parse gff3 dammit output. To send this output to a useful table, we will need to open the Python environment.

To do this, we will use a Jupyter notebook. In addition to executing Python commands, Jupyter notebooks can also run R (as well as many other languages). Similar to R markdown (Rmd) files, Jupyter notebooks can keep track of code and output. The output file format for Jupyter notebooks is .ipynb, which GitHub can render. See this gallery of interesting Jupyter notebooks.

Let’s open python in our terminal!

python

This opens python in your terminal, allowing you to run commands in the python language.

Let’s import the libraries we need.

import pandas as pd
from dammit.fileio.gff3 import GFF3Parser

Now enter these commands:

gff_file = "nema-trinity.fa.dammit/nema-trinity.fa.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()

Which will give an output that looks like this:

Try commands like,

annotations.columns

and

annotations.head()

or

annotations.head(50)

**Questions

  • What do these commands help you to see?

  • How might you use this information to modify the names line in the code above?

To save the file, add a new cell and enter:

new_file.to_csv("nema_gene_name_id.csv")

Now, we can return to the bash terminal, type:

quit()

We can look at the output we just made, which is a table of genes with ‘seqid’ and ‘Name’ in a .csv file: nema_gene_name_id.csv.

less nema_gene_name_id.csv

Notice there are multiple transcripts per gene model prediction. This .csv file can be used in tximport as a tx2gene file in downstream DE analysis.

Evaluation with BUSCO

BUSCO aims to provide a quantitative measure of transcriptome (or genome/gene set) completeness by searching for near-universal single-copy orthologs. These ortholog lists are curated into different groups (e.g. genes found universally across all metazoa, or all fungi, etc), and are currently based off of OrthoDB v9.

Run the command:

We’ve already installed and ran the BUSCO command with the dammit pipeline. Let’s take a look at the results.

Check the output:

cat nema-trinity.fa.dammit/run_nema-trinity.fa.metazoa.busco.results/short_summary_nema-trinity.fa.metazoa.busco.results.txt 
  • Challenge: How do the BUSCO results of the full transcriptome compare?

Run the BUSCO command by itself:

run_BUSCO.py \
-i nema-trinity.fa \
-o nema_busco_metazoa -l ~/.dammit/databases/busco2db/metazoa_odb9 \
-m transcriptome --cpu 4

When you’re finished, exit out of the conda environment:

conda deactivate

Quick Insights from Sequencing Data with sourmash

Getting started

Create / log into an m1.medium Jetstream instance.

Objectives

  1. Discuss k-mers and their utility

  2. Compare RNA-seq samples quickly

  3. Detect eukaryotic contamination in raw RNA-seq reads

  4. Compare reads to an assembly

  5. Build your own database for searching

  6. Other sourmash databases

Introduction to k-mers

A “k-mer” is a word of DNA that is k long:

ATTG - a 4-mer
ATGGAC - a 6-mer

Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:

AGGATGAGACAGATAG

becomes the following set of 6-mers:

AGGATG
 GGATGA
  GATGAG
   ATGAGA
    TGAGAC
     GAGACA
      AGACAG
       GACAGA
        ACAGAT
         CAGATA
          AGATAG

Today we will be using a tool called sourmash to explore k-mers!

Why k-mers, though? Why not just work with the full read sequences?

Computers love k-mers because there’s no ambiguity in matching them. You either have an exact match, or you don’t. And computers love that sort of thing!

Basically, it’s really easy for a computer to tell if two reads share a k-mer, and it’s pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.

Long k-mers are species specific

k-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)

Essentially, long k-mers are species specific. Check out this figure from the MetaPalette paper:


Here, Koslicki and Falush show that k-mer similarity works to group microbes by genus, at k=40. If you go longer (say k=50) then you get only very little similarity between different species.

Using k-mers to compare samples

So, one thing you can do is use k-mers to compare read data sets to read data sets, or genomes to genomes: data sets that have a lot of similarity probably are similar or even the same genome.

One metric you can use for this comparisons is the Jaccard distance, which is calculated by asking how many k-mers are shared between two samples vs how many k-mers in total are in the combined samples.

only k-mers in both samples
----------------------------
all k-mers in either or both samples

A Jaccard distance of 1 means the samples are identical; a Jaccard distance of 0 means the samples are completely different.

Jaccard distance works really well when we don’t care how many times we see a k-mer. When we keep track of the abundance of a k-mer, say for example in RNA-seq samples where the number of read counts matters, we use cosine distance instead.

These two measures can be used to search databases, compare RNA-seq samples, and all sorts of other things! The only real problem with it is that there are a lot of k-mers in a genome – a 5 Mbp genome (like E. coli) has 5 m k-mers!

About two years ago, Ondov et al. (2016) showed that MinHash approaches could be used to estimate Jaccard distance using only a small fraction (1 in 10,000 or so) of all the k-mers.

The basic idea behind MinHash is that you pick a small subset of k-mers to look at, and you use those as a proxy for all the k-mers. The trick is that you pick the k-mers randomly but consistently: so if a chosen k-mer is present in two data sets of interest, it will be picked in both. This is done using a clever trick that we can try to explain to you in class - but either way, trust us, it works!

We have implemented a MinHash approach in our sourmash software, which can do some nice things with samples. We’ll show you some of these things next!

Installing sourmash

To install sourmash, run:

conda install -y -c conda-forge -c bioconda sourmash

Creating signatures

A signature is a compressed representation of the k-mers in the sequence.

Depending on your application, we recommend different ways of preparing sequencing data to create a signature.

In a genome or transcriptome, we expect that the k-mers we see are accurate. We can create signatures from these type of sequencing data sets without any preparation. We demonstrate how to create a signature from high-quality sequences below.

First, download a genome assembly:

cd ~
mkdir sourmash_data
cd sourmash_data

curl -L https://osf.io/963dg/download -o ecoliMG1655.fa.gz 
gunzip -c ecoliMG1655.fa.gz | head

Compute a scaled MinHash from the assembly:

sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o ecoliMG1655.sig ecoliMG1655.fa.gz

For raw sequencing reads, we expect that many of the unique k-mers we observe will be due to errors in sequencing. Unlike with high-quality sequences like transcriptomes and genomes, we need to think carefully about how we want to create each signature, as it will depend on the downstream application.

  • Comparing reads against high quality sequences: Because our references that we are comparing or searching against only contain k-mers that are likely real, we don’t want to trim potentially erroneous k-mers. Although most of the k-mers would be errors that we would trim, there is a chance we could accidentally remove real biological variation that is present at low abundance. Instead, we only want to trim adapters.

  • Comparing reads against other reads: Because both datasets likely have many erroneous k-mers, we want to remove the majority of these so as not to falsely deflate similarity between samples. Therefore, we want to trim what are likely erroneous k-mers from sequencing errors, as well as adapters.

Let’s download some raw sequencing reads and demonstrate what k-mer trimming looks like.

First, download a read file:

curl -L https://osf.io/pfxth/download -o ERR458584.fq.gz
gunzip -c ERR458584.fq.gz | head

Next, perform k-mer trimming using a library called khmer. K-mer trimming removes low-abundant k-mers from the sample.

trim-low-abund.py ERR458584.fq.gz -V -Z 10 -C 3 --gzip -M 3e9 -o ERR458584.khmer.fq.gz

Finally, calculate a signature from the trimmed reads.

sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o ERR458584.khmer.sig ERR458584.khmer.fq.gz

qc

compute

We can prepare signatures like this for any sequencing data file! For the rest of the tutorial, we have prepared signatures for each sequencing data set we will be working with.

Compare many RNA-seq samples quickly

Use case: how similar are my samples to one another?

Traditionally in RNA-seq workflows, we use MDS plots to determine how similar our samples are. Samples that are closer together on the MDS plot are more similar. However, to get to this point, we have to trim our reads, download or build a reference transcriptome, quantify our reads using a tool like Salmon, and then read the counts into R and make an MDS plot. This is a lot of steps to go through just to figure out how similar your samples are!

Luckily, we can use sourmash to quickly compare how similar our samples are.

We generated signatures for the majority of the rest of the Schurch et al. experiment we have been working with this week. Below we download and compare the 647 signatures, and then produce a plot that shows how similar they are to one another.

First, download and uncompress the signatures.

curl -o schurch_sigs.tar.gz -L https://osf.io/p3ryg/download
tar xf schurch_sigs.tar.gz

Next, compare the signatures using sourmash.

sourmash compare -k 31 -o schurch_compare_matrix schurch_sigs/*sig

This outputs a comparison matrix and a set of labels. The matrix is symmetrical, and contains numbers 0-1 that captures similarity between samples. 0 means there are no k-mers in common between two samples, while 1 means all k-mers are shared.

Lastly, we plot the comparison matrix.

sourmash plot --labels schurch_compare_matrix

We see there are two major blocks of similar samples, which makes sense given that we have WT and SNF2 knockout samples. However, we also see that some of our samples are outliers! If this were our experiment, we would want to investigate the outliers further to see what caused them to be so dissimilar.

Detect Eukaryotic Contamination in Raw RNA Sequencing data

Use case: Search for the presence of unexpected organisms in raw RNA-seq reads

For most analysis pipelines, there are many steps that need to be executed before we get to the analysis and interpretation of what is in our sample. This often means we are 10-15 steps into our analysis before we find any problems. However, if our reads contain contamination, we want to know that as quickly as possible so we can remove the contamination and solve any issues that led to the contamination.

Using sourmash, we can quickly check if we have any unexpected organisms in our sequencing samples. We do this by comparing a signature from our reads against a database of known signatures from publicly available reference sequences.

We have generated sourmash databases for all publicly available Eukaryotic RNA samples (we used the *rna_from_genomic* files from RefSeq and Genbank…however keep in mind that not all sequenced genomes have these files!). This database includes fungi, plants, vertebrates, invertebrates, and protazoa. It does not include human, so we incorporate that separately. We also built another database of the ~700 recently re-assembled marine transcriptomes from the MMETSP project. These databases allow us to detect common organisms that might be unexpectedly present in our sequencing data.

First, let’s download and uncompress our three databases: human, MMETSP, and everything else!

wget -O sourmash_euk_rna_db.tar.gz https://osf.io/vpk8s/download
tar xf sourmash_euk_rna_db.tar.gz

Next, let’s download a signature from some sequencing reads. We’ll work with some sequencing reads from a wine fermentation.

wget -O wine.sig https://osf.io/5vsjq/download

We expected fungus and grape to be metabolically active in these samples. Let’s check which organisms we detect.

sourmash gather -k 31 --scaled 2000 -o wine.csv wine.sig sourmash_euk_rna_db/*sbt.json sourmash_euk_rna_db/*sig  

If we take a look at the output, we see:

== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded query: wine_fermentation... (k=31, DNA)
downsampling query from scaled=2000 to 2000
loaded 1 signatures and 2 databases total.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
2.2 Mbp       79.0%   28.5%     122.2     Sc_YJM1477_v1  Saccharomyces cerevis...
0.8 Mbp        1.4%    2.2%       6.2     12X  Vitis vinifera (wine grape)
2.1 Mbp        0.4%    1.6%       9.3     GLBRCY22-3  Saccharomyces cerevisiae...
124.0 kbp      0.1%    0.7%       3.1     Aureobasidium pullulans var. pullula...
72.0 kbp       0.0%    0.1%       1.9     Mm_Celera  Mus musculus (house mouse)
1.9 Mbp        0.0%    0.5%       3.7     Sc_YJM1460_v1  Saccharomyces cerevis...
1.8 Mbp        0.1%    0.5%      14.1     ASM18217v1  Saccharomyces cerevisiae...
2.1 Mbp        0.1%    0.3%      17.7     R008  Saccharomyces cerevisiae R008 ...
1.9 Mbp        0.0%    0.1%       3.1     ASM32610v1  Saccharomyces cerevisiae...
found less than 18.0 kbp in common. => exiting

found 9 matches total;
the recovered matches hit 81.2% of the query

…which is almost exactly what we expect, except we see some house mouse! And I promise Ratatouille was not making this wine.

Using this method, we have now identified contamination in our reads. We could align to the mouse genome to remove these reads, however the best strategy to remove these reads may vary on a case by case basis.

Compare reads to assemblies

Use case: how much of the read content is contained in the reference genome?

First we’ll download some reads from an E. coli genome, then we will generate a signature from them

curl -L https://osf.io/frdz5/download -o ecoli_ref-5m.fastq.gz
sourmash compute  -k 31 --scaled 2000 ~/sourmash_data/ecoli_ref-5m.fastq.gz -o ecoli-reads.sig

search

Build a signature for an E. coli genome:

sourmash compute --scaled 2000 -k 31 ~/sourmash_data/ecoliMG1655.fa.gz -o ecoli-genome.sig

and now evaluate containment, that is, what fraction of the read content is contained in the genome:

sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment

and you should see:

loaded query: /home/diblions/data/ecoli_ref-... (k=31, DNA)
loaded 1 signatures.

1 matches:
similarity   match
----------   -----
  9.7%       /home/diblions/data/ecoliMG1655.fa.gz

Why are only 10% or so of our k-mers from the reads in the genome!? Any ideas?

Try the reverse - why is it bigger?

sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment

(…but 100% of our k-mers from the genome are in the reads!?)

Make and search a database quickly.

Suppose that we have a collection of signatures (made with sourmash compute as above) and we want to search it with our newly assembled genome (or the reads, even!). How would we do that?

Let’s grab a sample collection of 50 E. coli genomes and unpack it –

mkdir ecoli_many_sigs
cd ecoli_many_sigs

curl -O -L https://github.com/dib-lab/sourmash/raw/master/data/eschericia-sigs.tar.gz

tar xzf eschericia-sigs.tar.gz
rm eschericia-sigs.tar.gz

cd ../

This will produce 50 files named ecoli-N.sig in the ecoli_many_sigs

ls ecoli_many_sigs

Let’s turn this into an easily-searchable database with sourmash index

sourmash index -k 31 ecolidb ecoli_many_sigs/*.sig

What does the database look like and how does the search work?

One point to make with this is that the search can quickly narrow down which signatures match your query, without losing any matches. It’s a clever example of how computer scientists can actually make life better :).


And now we can search!

sourmash search ecoli-genome.sig ecolidb.sbt.json -n 20

You should see output like this:

# running sourmash subcommand: search
select query k=31 automatically.
loaded query: /home/tx160085/data/ecoliMG165... (k=31, DNA)
loaded SBT ecolidb.sbt.json
Searching SBT ecolidb.sbt.json
49 matches; showing first 20:
similarity   match
----------   -----
 75.9%       NZ_JMGW01000001.1 Escherichia coli 1-176-05_S4_C2 e117605...
 73.0%       NZ_JHRU01000001.1 Escherichia coli strain 100854 100854_1...
 71.9%       NZ_GG774190.1 Escherichia coli MS 196-1 Scfld2538, whole ...
 70.5%       NZ_JMGU01000001.1 Escherichia coli 2-011-08_S3_C2 e201108...
 69.8%       NZ_JH659569.1 Escherichia coli M919 supercont2.1, whole g...
 59.9%       NZ_JNLZ01000001.1 Escherichia coli 3-105-05_S1_C1 e310505...
 58.3%       NZ_JHDG01000001.1 Escherichia coli 1-176-05_S3_C1 e117605...
 56.5%       NZ_MIWF01000001.1 Escherichia coli strain AF7759-1 contig...
 56.1%       NZ_MOJK01000001.1 Escherichia coli strain 469 Cleandata-B...
 56.1%       NZ_MOGK01000001.1 Escherichia coli strain 676 BN4_676_1_(...
 50.5%       NZ_KE700241.1 Escherichia coli HVH 147 (4-5893887) acYxy-...
 50.3%       NZ_APWY01000001.1 Escherichia coli 178200 gec178200.conti...
 48.8%       NZ_LVOV01000001.1 Escherichia coli strain swine72 swine72...
 48.8%       NZ_MIWP01000001.1 Escherichia coli strain K6412 contig_00...
 48.7%       NZ_AIGC01000068.1 Escherichia coli DEC7C gecDEC7C.contig....
 48.2%       NZ_LQWB01000001.1 Escherichia coli strain GN03624 GCID_EC...
 48.0%       NZ_CCQJ01000001.1 Escherichia coli strain E. coli, whole ...
 47.3%       NZ_JHMG01000001.1 Escherichia coli O121:H19 str. 2010EL10...
 47.2%       NZ_JHGJ01000001.1 Escherichia coli O45:H2 str. 2009C-4780...
 46.5%       NZ_JHHE01000001.1 Escherichia coli O103:H2 str. 2009C-327...

identifying what genome is in the signature. Some pretty good matches but nothing above %75. Why? What are some things we should think about when we’re doing taxonomic classification?

What’s in my metagenome?

First, let’s download and upack the database we’ll use for classification

cd ~/sourmash_data
curl -L https://osf.io/4f8n3/download -o genbank-k31.lca.json.gz
gunzip genbank-k31.lca.json.gz

This database is a GenBank index of all the microbial genomes – this one contains sketches of all 87,000 microbial genomes (including viral and fungal). See available sourmash databases for more information.

After this database is unpacked, it produces a file genbank-k31.lca.json.

Next, run the ‘lca gather’ command to see what’s in your ecoli genome –

sourmash lca gather ecoli-genome.sig genbank-k31.lca.json

and you should get:

loaded 1 LCA databases. ksize=31, scaled=10000
loaded query: /home/diblions/data/ecoliMG165... (k=31)

overlap     p_query p_match 
---------   ------- --------
4.9 Mbp     100.0%    2.3%      Escherichia coli

Query is completely assigned.

In this case, the output is kind of boring because this is a single genome. But! You can use this on metagenomes (assembled and unassembled) as well; you’ve just got to make the signature files.

To see this in action, here is gather running on a signature generated from some sequences that assemble (but don’t align to known genomes) from the Shakya et al. 2013 mock metagenome paper.

wget https://github.com/dib-lab/sourmash/raw/master/doc/_static/shakya-unaligned-contigs.sig
sourmash lca gather shakya-unaligned-contigs.sig genbank-k31.lca.json

This should yield:

loaded 1 LCA databases. ksize=31, scaled=10000
loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31)

overlap     p_query p_match 
---------   ------- --------
1.8 Mbp      14.6%    9.1%      Fusobacterium nucleatum
1.0 Mbp       7.8%   16.3%      Proteiniclasticum ruminis
1.0 Mbp       7.7%   25.9%      Haloferax volcanii
0.9 Mbp       7.4%   11.8%      Nostoc sp. PCC 7120
0.9 Mbp       7.0%    5.8%      Shewanella baltica
0.8 Mbp       6.0%    8.6%      Desulfovibrio vulgaris
0.6 Mbp       4.9%   12.6%      Thermus thermophilus
0.6 Mbp       4.4%   11.2%      Ruegeria pomeroyi
480.0 kbp     3.8%    7.6%      Herpetosiphon aurantiacus
410.0 kbp     3.3%   10.5%      Sulfitobacter sp. NAS-14.1
150.0 kbp     1.2%    4.5%      Deinococcus radiodurans (** 1 equal matches)
150.0 kbp     1.2%    8.2%      Thermotoga sp. RQ2
140.0 kbp     1.1%    4.1%      Sulfitobacter sp. EE-36
130.0 kbp     1.0%    0.7%      Streptococcus agalactiae (** 1 equal matches)
100.0 kbp     0.8%    0.3%      Salinispora arenicola (** 1 equal matches)
100.0 kbp     0.8%    4.2%      Fusobacterium sp. OBRC1
60.0 kbp      0.5%    0.7%      Paraburkholderia xenovorans
50.0 kbp      0.4%    3.2%      Methanocaldococcus jannaschii (** 2 equal matches)
50.0 kbp      0.4%    0.3%      Bacteroides vulgatus (** 1 equal matches)
50.0 kbp      0.4%    2.6%      Sulfurihydrogenibium sp. YO3AOP1
30.0 kbp      0.2%    0.7%      Fusobacterium hwasookii (** 3 equal matches)
30.0 kbp      0.2%    0.0%      Pseudomonas aeruginosa (** 2 equal matches)
30.0 kbp      0.2%    1.6%      Persephonella marina (** 1 equal matches)
30.0 kbp      0.2%    0.4%      Zymomonas mobilis
20.0 kbp      0.2%    1.1%      Sulfurihydrogenibium yellowstonense (** 6 equal matches)
20.0 kbp      0.2%    0.5%      Ruminiclostridium thermocellum (** 5 equal matches)
20.0 kbp      0.2%    0.1%      Streptococcus parasanguinis (** 4 equal matches)
20.0 kbp      0.2%    0.8%      Fusobacterium sp. HMSC064B11 (** 2 equal matches)
20.0 kbp      0.2%    0.4%      Chlorobium phaeobacteroides (** 1 equal matches)
20.0 kbp      0.2%    0.7%      Caldicellulosiruptor bescii
10.0 kbp      0.1%    0.0%      Achromobacter xylosoxidans (** 53 equal matches)
10.0 kbp      0.1%    0.2%      Geobacter sulfurreducens (** 17 equal matches)
10.0 kbp      0.1%    0.5%      Fusobacterium sp. HMSC065F01 (** 15 equal matches)
10.0 kbp      0.1%    0.3%      Nitrosomonas europaea (** 14 equal matches)
10.0 kbp      0.1%    0.5%      Wolinella succinogenes (** 13 equal matches)
10.0 kbp      0.1%    0.5%      Thermotoga neapolitana (** 12 equal matches)
10.0 kbp      0.1%    0.5%      Thermus amyloliquefaciens (** 10 equal matches)
10.0 kbp      0.1%    0.1%      Desulfovibrio desulfuricans (** 9 equal matches)
10.0 kbp      0.1%    0.4%      Fusobacterium sp. CM22 (** 8 equal matches)
10.0 kbp      0.1%    0.2%      Desulfovibrio piger (** 7 equal matches)
10.0 kbp      0.1%    0.5%      Thermus kawarayensis (** 6 equal matches)
10.0 kbp      0.1%    0.5%      Pyrococcus furiosus (** 5 equal matches)
10.0 kbp      0.1%    0.5%      Aciduliprofundum boonei (** 4 equal matches)
10.0 kbp      0.1%    0.2%      Desulfovibrio sp. A2 (** 3 equal matches)
10.0 kbp      0.1%    0.3%      Desulfocurvus vexinensis (** 2 equal matches)
10.0 kbp      0.1%    0.0%      Enterococcus faecalis

22.1% (2.8 Mbp) of hashes have no assignment.

What do the columns here mean?

Why might some of things in a metagenome be unassigned?


It is straightforward to build your own databases for use with search and lca gather; this is of interest if you have dozens or hundreds of sequencing data sets in your group. Ping us if you want us to write that up.

Final thoughts on sourmash

There are many tools like Kraken and Kaiju that can do taxonomic classification of individual reads from metagenomes; these seem to perform well (albeit with high false positive rates) in situations where you don’t necessarily have the genome sequences that are in the metagenome. Sourmash, by contrast, can estimate which known genomes are actually present, so that you can extract them and map/align to them. It seems to have a very low false positive rate and is quite sensitive to strains.

Above, we’ve shown you a few things that you can use sourmash for. Here is a (non-exclusive) list of other uses that we’ve been thinking about –

  • detect contamination in sequencing data;

  • index and search private sequencing collections;

  • search all of SRA for overlaps in metagenomes

RNA-seq Analysis

Some basics

We are going to first talk about different needs in RNA-seq, from basics in library prep to the last steps of analyses.

  • How do you choose your library pipeline?

Choice 1 Choice 2
More Replicates Deeper Sequencing
RIN values < 7 RNA values > 7
ERCC No ERCC
Poly A Ribo depletion
Fragment size 1 Fragment size 2
Stranded Non-stranded
Single end Paired end
Short Read (50) Medium Read (150)
Depth 1 Depth 2
RNA-seq Targeted RNA-seq
  • Biological vs Technical replicates

Schurch et al., 2016 “How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?”

  • How do you choose your bioinformatics pipeline?

Choice 1 Choice 2
trimmomatic cutadapt
Salmon, kallisto... STAR, HISAT2, bowtie2
DESeq2 edgeR
clustering 1 (corr) clustering 2 (PCA)
Pathway analysis GSET
Visualization 1 Visualization 100!
Upload them before paper submission! Upload them before paper submission!

Installations

We have installed the ‘heaviest’ packages that you need in this lesson in our Jetstream image. For future work, these are the installations that you will need, so that you have all of them together:

source("https://bioconductor.org/biocLite.R") # calls the package from the source
biocLite('Biobase') # newer versions of R use 'library()' as function instead of biocLite
install.packages("tidyverse") # we have this in our Image
library(tidyverse)
library(tximport)
library(DESeq2)
install.packages("devtools")
library(devtools)
install.packages('pheatmap')
library(RColorBrewer)
library(pheatmap)
biocLite("GOstats")
biocLite("org.Sc.sgd.db")
library(GOstats)
library(org.Sc.sgd.db) 

Bioconductor

Bioconductor is a project to develop innovative software tools for use in computational biology. Bioconductor is an ‘umbrella package’ that includes many packages employed in biological analyses. Biobase is part of the Bioconductor project, and is used by many other packages. Biobase contains standardized data structures to represent genomic data. You can find help for Biobase typing:

browseVignettes("Biobase")

A simple RNA-seq flow

DESeq2

We are going to follow the lesson by Mike Love at DESeq2

This code chunk assumes that you have a count matrix called cts and a table of sample information called coldata. design indicates how to model the samples, here, we want to measure the effect of the condition, controlling for batch differences. The two factor variables batch and condition should then be columns of coldata.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design= ~ batch + condition)
dds <- DESeq(dds) # creates a DESeqDataSet
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")
# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")

DESeqDataSet

DESeqDataSet class extends the RangedSummarizedExperiment class of the SummarizedExperiment package. Ranged referes here to counts associated with genomic ranges (exons) - we can then make use of other Bioconductor packages that explore range-based functionality (e.g. ChIP-seq peaks).

A DESeqDataSet must have a design formula: the variables that we will fit in the model to create our differential expression. The formula should be a tilde (~) followed by the variables with plus (+) signs between them (it will be coerced into an formula if it is not already).

Note: To get the most out of the package with default settings, put the variable of interest at the end of the formula and make sure the control level is the first level. The starting functions are:

  • If you have transcript quantification files, as produced by Salmon, Sailfish, or kallisto, you would use DESeqDataSetFromTximport.

  • If you have htseq-count files, the first line would use DESeqDataSetFromHTSeq.

  • If you have a RangedSummarizedExperiment, the first line would use DESeqDataSet.

DESeq2 internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.

Depending on the upstream pipeline, a DESeqDataSet can be constructed using:

  • From transcript abundance files and tximport - this is the pipeline that we will use in this tutorial

  • From a count matrix

  • From htseq-count files

  • From a SummarizedExperiment object

Transcript abundance files: tximport

You can import transcript abundance files from salmon, sailfish, kallisto and RSEM using tximport, which will create gene-level count matrices for use with DESeq2.

Advantages:

  • correction for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013);

  • Salmon, Sailfish and kallisto are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files;

  • it is possible to avoid discarding fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015).

Note: the tximport-to-DESeq2 approach uses estimated gene counts from the transcript abundance quantifiers, but not normalized counts.

Organise your directory to work on it!

Log into your Jetstream Instance and lauch R Studio. We will use both the Console (for R) and the Terminal (for bash) there.

We are first going to make a new directory using the Terminal where we will be doing all the analysis, called rna_seq_r, where we will create tow other directories called figures and r_script:

mkdir rna_seq_r cd rna_seq_r mkdir figures mkdir r_script

We are going to set our working directory in the R Console so that we can find and store all our analysis there:

setwd(~/rna_seq_r)

This is how you start any project in R: set your working directory, where you will find your input files (unless you download them directly as in this lesson) and where you will output all your data (and your RScript!).

Continue the lesson in RMarkdown

We are now going to download the rest of the lesson and follow it on RMarkdown, which is a great way of making and sharing documents. Type in your Terminal:

wget https://raw.githubusercontent.com/maggimars/YeastTutorial/master/R_RNA.Rmd

and let’s get started!

More resources

ExpressionSet

salmon

DESEq2

Data Carpentry R

Software Carpentry R

tags: RNA-seq R Bioconductor DESeq2