These are the schedule and classroom materials for the ANGUS workshop at UC Davis, which will run from July 1st to July 13th, 2019. ANGUS is part of the 2019 Data Intensive Biology Summer Institute.
This workshop runs under a Code of Conduct. Please respect it and be excellent to each other!
We hope that we can respect each other by following these guidelines & expectations.
Please submit your feedback here on any aspect of the workshop. We encourage feedback from everyone involved in the workshop in any capacity
Twitter hash tag: #dibsi2019
Room Mexía will be led by Rocio Martinez-Nunez and Taylor Reiter, in 1043 Valley.
Room Lovelace will enjoy the leadership of Chissa Rivaldi and Mike Lee, in 1041 Valley.
Room Conway will be led by Sateesh Peri & Marian Schmidt, in 2020 Valley.
Monday, July 1st, 2019:
1:30-2:20pm: Welcome to ANGUS 2019 slides
2:20-2:30pm: BREAK
2:30-3:30pm: Room Welcomes
3:30-3:45pm: BREAK
3:45pm-5:00pm: Introduction to cloud computing and a motivating example
5:00pm: DONE
Tuesday, July 2nd, 2019:
9:00-10:20am: Introduction to the Shell
10:20-10:40am: BREAK
10:40am-12:00pm: Introduction to the Shell
12:00-1:00pm: LUNCH
1:00-2:00pm: Lisa K Johnson: Intro to Modern Sequencing Practices - view recording!
2:00-2:10pm: BREAK
2:15-5:00pm: Poster session & Ice Cream Social (with dairy-free options!)
5:00pm: DONE
Wednesday, July 3rd, 2019:
9:00-9:15am: Meet in Lecture Room (1030) for quick meeting
9:20-10:20am: Wrapping up Introduction to the Shell
10:20-10:40am: BREAK
10:40am-12:00pm: Wrapping up Introduction to the Shell
12:00-1:00pm: LUNCH
1:00-2:00pm: Amanda Charbonneau: How to statistics Slides - view recording
2:00-2:10pm: BREAK
2:15-3:30pm: Logging onto the Cloud
3:30-3:50pm: BREAK
3:50-5:00pm: Conda
5:00pm: DONE
Thursday, July 4th, 2019
9:00-10:20am: Quality Control & Adapter Trimming of Sequences
10:20-10:40am: BREAK
10:40am-12:00pm: Quality Control & Adapter Trimming of Sequences
BREAK FOR THE DAY - 4th of July Activities
Friday, July 5th, 2019
9:00-10:20am: Finding resources
10:20-10:40am: BREAK
10:50am-12:00pm: Rayna Harris - RNAseq: a five course meal - video recording
12:00-1:00pm: LUNCH
1:00-2:00pm: Read Quantification
2:00-2:10pm: BREAK
2:15-3:30pm: Introduction to R & RStudio
3:30-3:50pm: BREAK
3:50-5:00pm: Introduction to R & RStudio
5:00pm: DONE
Monday, July 8th, 2019
9:00-10:20am: Introduction to R & RStudio cont.
10:20-10:40am: BREAK
10:40am-12:00pm: Introduction to R & RStudio cont.
12:00-1:00pm: LUNCH
1:00-2:00pm: Olga Botvinnik - Single cell RNAseq (PDF) - video recording
2:00-2:10pm: BREAK
2:15-3:30pm: RNAseq - Differential expression and visualization with R
3:30-3:50pm: BREAK
3:50-5:00pm: Mapping & Variant Calling
5:00pm: DONE
Tuesday, July 9th, 2019
9:00-10:20am: Mapping & Variant Calling and/or Automation with bash
10:20-10:40am: BREAK
10:40am-12:00pm: Automation with bash and/or Workflow management with Snakemake
12:00-1:15pm: LUNCH
1:15-2:15pm: Shannon Joslin - Genome Assembly - video
2:15-2:25pm: BREAK
2:25-3:30pm: Workflow management with Snakemake (continued)
3:30-3:50pm: BREAK
3:50-5:00pm: Adding onto our Snakemake workflow
5:00pm: DONE
Wednesday, July 10th, 2019
9a:00-10:20am: Reviewing adding onto our Snakemake workflow
10:20-10:40am: BREAK
10:40am-12:00pm: Genome Assembly
12:00-1:00pm: LUNCH
1:00-2:00pm: Erich Schwarz - Doing science with big genomes - video
2:00-2:10pm: BREAK
2:15-3:30pm: Genome Assembly (continued)
3:30-3:50pm: BREAK
3:50-5:00pm: Elective Lab: (1) More Practical Use of Unix & (2) BYOD
5:00pm: DONE
Thursday, July 11th, 2019
9:00-10:20am: Elective Lab: (1) Intro to Git & (2) BYOD
10:20-10:40am: BREAK
10:40am-12:00pm: Elective Lab: (1) Intro to Git (continued) & (2) BYOD
12:00-1:00pm: LUNCH
1:00-2:00pm: Rocio Martinez-Nunez - What comes after ANGUS?
2:00-2:10pm: BREAK
2:15-3:30pm: Elective Lab: (1) Microbial Ecology (a discussion on amplicon sequencing and metagenomics) & (2) Transcriptome Assembly and Annotation & (3) BYOD
3:30-3:50pm: BREAK
3:50-5:00pm: Elective Lab: Quick Insights from Sequencing Data with Sourmash & (2) BYOD
5:00pm: DONE
Friday, July 12th, 2019
9:00-10:20am: Elective Lab: (1) Recovering “genomes” from metagenomes & (2) RNAseq in R & (3) BYOD
10:20-10:40am: BREAK
10:40am-12:00pm: Elective Lab: (1) Recovering “genomes” from metagenomes (continued) & (2) RNAseq in R (continued) & (3) BYOD
12:00-1:00pm: LUNCH
1:00-2:00pm: Phillip Brooks - Cloud platforms & Seven Bridges
2:00-2:10pm: BREAK
2:15-3:30pm: Elective Lab: (1) Seven Bridges & (2) Overview of Cluster Computing & (3) BYOD
3:30-3:50pm: BREAK
3:50-5:00pm: Elective Lab: (1) Seven Bridges & (2) BYOC (Bring Your Own Cluster* questions *for slurm and UGE & (3) BYOD
5:00pm: DONE
Saturday, July 13th, 2019
(check out of your rooms and bring luggage over. Please be sure to bring your computer!)
9:00am-9:30am: Cloud Resources with Sateesh Peri
9:30am-10:00am: 5 things I wish I knew by Mike Lee
10:00am-10:20am: What to Slack after DIBSI? by Shannon Joslin
10:20am-10:40am: BREAK
10:40am-11:00am: Survey in Room 1030
11:00am-11:30am: Classroom feedback–1up/1down (in classrooms) & TA / Instructor meeting (in Rm1030)
11:30am-12:00pm: Titus Wrap up & Outro
12:00pm: THANK YOU & HAPPY TRAVELS!
All DIBSI attendees are expected to agree with the following code of conduct. We will enforce this code as needed. We expect cooperation from all attendees to help ensuring a safe environment for everybody.
Please see the Event Guidelines and Expectations for more information.
Enforcement may result in one or more behavioral modifications intended to ensure that this workshop continues to be a safe and welcoming teaching and learning environment.
DIBSI events are neither a dating scene nor an intellectual contest.
DIBSI is dedicated to providing a harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, or religion (or lack thereof). We do not tolerate harassment of participants in any form. Sexual language and imagery is generally not appropriate for any DIBSI venue.
Harassment includes offensive verbal comments related to gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, religion, sexual images in public spaces, deliberate intimidation, stalking, following, harassing photography or recording, sustained disruption of talks or other events, inappropriate physical contact, and unwelcome sexual attention.
Attendees asked to stop any behavior are expected to comply immediately.
If you are being harassed, notice that someone else is being harassed, or if you are uncertain whether someone’s behavior might constitute harassment or have any other concerns, please contact any of the helpers, lead instructors, or organizers immediately. For more information, see the Attendee Procedure for Reporting Code of Conduct Incidents.
We expect participants to follow these guidelines at any DIBSI-organized event.
Original source and credit: http://2012.jsconf.us/#/about & The Ada Initiative. Please help by translating or improving: http://github.com/leftlogic/confcodeofconduct.com. This work is licensed under a Creative Commons Attribution 3.0 Unported License.
Everyone who participates in our ANGUS 2019 workshop shares the responsibility of creating a welcoming environment for learning and networking. As such, we would like to outline Workshop Guidelines for everyone and expectations for people based on their participation role below.
Note: All DIBSI participants must follow the event Code of Conduct. Our first priority is to provide and sustain a safe and welcoming environment for learning and teaching!
Our primary commitment is to learn from each other. We will listen to each other and not talk at each other. We acknowledge differences amongst us in backgrounds, skills, interests, and values. We realize that it is these very differences that will increase our awareness and understanding through this process.
Share responsibility for including everyone in the discussion. Be mindful of taking up much more space than others. Empower yourself to speak up when others are dominating the conversation.
Confidentiality. We want to create an atmosphere for open, honest exchange. Unless it is an actionable item or specifically indicated on notecards, we will not share beyond this room without permission.
Do not demean, devalue, or “put down” people for their experiences, lack of experiences, difference in interpretation of those experiences, or choice of computer or software.
Expect a range of emotions to be present, though not always visible. The workshop has long days and can be very intense for those around us.
Trust that people are always doing the best they can. We will give each other the benefit of the doubt. We will assume we are all trying our hardest and that our intentions are good even when the impact is not. Consider being wrong and accept responsibility for the consequences of our actions, regardless of their intentions.
Challenge the idea and not the person and expect to have your perspectives challenged by others. If we wish to challenge something that has been said, we will challenge the idea or the practice referred to, not the individual sharing this idea or practice.
Speak your discomfort. If something is bothering you, please share this with the group, your helper/instructor/coordinators, or submit a response through our online form. Often our emotional reactions to this process offer the most valuable learning opportunities.
Expect and accept a lack of closure. We will cover a myriad of topics throughout this workshop and not all questions will be answered or clear.
We are all a work in progress. We will not assume that one comment or one opinion made at one time captures the whole of a person’s character. We are all in the process of learning, including the facilitators.
Attendees are an active part of the workshop and help set the event atmosphere!
Ask questions if they have them & be aware how their questions may impact others
Raise their hand
Ask a helper
Type it in hackMD
Write it on a sticky at the end of each lesson
Submit your question to the feedback form
Have an open mind and treat others as if they have something to offer.
Be ready and willing to learn from instructors, helpers, and each other.
If you have the expertise, we invite others to contribute to the conversation, especially if directly related to their own experiences. This can be done through the:
Stickies
Group notes on the hackMD
Telling an instructor/helper
The feedback form
Actively participate in events they attend
Use stickies
Contribute to group notes
If you have the capacity, help the people around you
Respect others at the event
When receiving help from others during a lesson:
Use a whisper voice to lessen the impact on others
Be open to “driving” the computer
If more extensive help is necessary, be open to the helper/instructor “driving” your computer
Communicate personal space boundaries when necessary
Avoid disruptive behavior including:
Speaking to those around them during the lesson
Save specific questions regarding your project for outside of formal instruction or BYOD in week 2
Be an active and responsive bystander by either:
Speaking up in the moment of an event.
Submitting a climate response form.
Talking to instructors and course coordinators regarding events that you may witness.
Have an open mind and treat others as they have something to offer.
Help event attendees learn and grow in a safe/brave and welcoming environment.
Welcome and encourage questions
When helping someone at their computer:
Be kind when approaching a learner at their computer and give them the benefit of a doubt that they are doing the best they can.
Be aware of the physical space between those that surround them.
Praise errors as opportunities for learners
Prioritize the opportunity for the learner to “drive” the computer.
If it makes more sense for the instructor/helper to “drive” the learners computer, always ask for permission to touch someone else’s computer.
Be an active and responsive by-stander by either:
Speaking up in the moment.
Submitting a climate response form.
Talking to instructors and course coordinators regarding events that you may witness.
Teach attendees
Ask leading questions and help learners get to answers on their own
When actively teaching, use a microphone.
Instructors will repeat questions from learners
Provide individualized or group help as needed
Avoid disruptive behavior including:
Intellectually agressive behavoir such as challenging other people (including instructors/helpers/attendees) in their content
Be mindful of the language they use:
Avoid minimizing language including “just”, “only”, “simply”, “easy”, “obviously”,etc.
Instead consider motivating language such as “This gets easier with practice” & “I remember when it was hard for me too”
Comfort & empathize with others who may be stressed
Some other potential roles:
Have breaks when they are scheduled
Give the instructor who is teaching a few minute warning before the lessons will be finished
Note takers/supporters
slack monitoring
Room coordinating
Actively watching for sticky notes - and promoting sticky awareness for other instructors & helpers
Writing challenge questions
Assisting with curriculum tweaks
Provide the infrastructure needed for a safe and welcoming environment
Coordinate the overall event
Addressing individual attendee, instructor, and helper needs
Lay out clear and unambiguous guidelines for participant conduct, and providing incident response guidelines and training.
Be an active and responsive by-stander by either:
Speaking up in the moment
Submitting a climate response form
Talking to the lead-instructors and other coordinators regarding events that you may witness
Always use the microphone .
Speakers to repeat questions from the audience
Understand what cloud computing is
Get to know some popular cloud compute resources
Perform a command-line blast
Bioinformatics worldwide involves researchers connecting daily to cloud computing services to perform their analyses. Biology - like lots of areas - is dealing with a deluge of data due to the rapid advancement of data collection methods. It is now common that data collected for an experiment doesn’t fit on a researcher’s laptop and that the resources needed for running an analysis can far exceed a desktop computer’s computing power. Cloud computing is a powerful avenue for dealing with this.
The cloud is basically lots of servers (think big computers) stacked together in a giant, powerful infrastructure. You can borrow part of this infrastructure for your computing needs.
There are two main approaches to accessing computing time and power: 1) we can borrow computing time and resources from a commercial provider; or 2) you may have access to a computing infrastructure through your institution or somewhere else. Some countries have built national infrastructures where you can apply for computing time for your research projects. Most academic institutions or departments also have their own computing resources.
Academic
Commercial
We all have an IP address to link into our cloud computers stored in this spreadsheet here.
That will open a screen where you need to enter a password. (The password will be told to you in the room or in slack.) Then a screen like this will open (minus the blue arrows):
Now click the files tab at the top-left (that the smaller blue arrow points to above) and then click the “Terminal” icon at the bottom, and we’ll be in our appropriate command-line environment:
Installing blast and creating a working directory:
cd ~/
conda install -y -c bioconda blast
mkdir opening_example
cd opening_example/
Note: takes 5-10 minutes
Given one or more query sequences (usually in FASTA format), BLAST looks for matching sequence regions between them and a subject set.
A sufficiently close match between subsequences (denoted by arrows in the figure above, though matches are usually longer than illustrated here) is called a high-scoring pair (HSP), while a query sequence is said to hit a target sequence if they share one or more HSPs
The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
First! We need some data and an Rscript to later visualize blast results. Let’s grab the mouse and zebrafish RefSeq
protein data sets from NCBI, and put them in our home directory. We’ll use curl
to download the files; these originally came from the NCBI Web site: ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot.
curl -o mouse.1.protein.faa.gz -L https://osf.io/v6j9x/download
curl -o mouse.2.protein.faa.gz -L https://osf.io/j2qxk/download
curl -o zebrafish.1.protein.faa.gz -L https://osf.io/68mgf/download
curl -o blastviz.R -L https://osf.io/e548g/download
To look at the files in your current directory:
ls -l
All three of the sequence files are FASTA protein files (that’s what the .faa
suggests) that are compressed with gzip
(that’s what the .gz means).
Uncompress them:
gunzip *.faa.gz
And let’s look at the first few sequences in the file:
head mouse.1.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Let’s take those first two sequences and save them to a file. We’ll do this using output redirection with ‘>’, which says “take all the output and put it into this file here.”
head -11 mouse.1.protein.faa > mm-first.fa
Now let’s BLAST these two sequences against the entire zebrafish protein data set. First we need to create a database. That’s done by calling ‘makeblastdb’:
makeblastdb -in zebrafish.1.protein.faa -dbtype prot
Next, we call BLAST to do the search:
blastp -query mm-first.fa -db zebrafish.1.protein.faa
This should run pretty quickly, but you’re going to get a lot of output!!
To save it to a file instead of watching it go past on the screen,
ask BLAST to save the output to a file that we’ll name mm-first.x.zebrafish.txt
:
blastp -query mm-first.fa -db zebrafish.1.protein.faa -out mm-first.x.zebrafish.txt
and then you can ‘page’ through this file at your leisure by typing:
less mm-first.x.zebrafish.txt
(Type spacebar to move down, and ‘q’ to get out of paging mode.)
Let’s do some more sequences (this one will take a little longer to run):
head -498 mouse.1.protein.faa > mm-second.fa
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.txt
will compare the first 96 sequences. You can look at the output file with:
less mm-second.x.zebrafish.txt
(and again, type ‘q’ to get out of paging mode.)
Notes:
you can copy/paste multiple commands at a time, and they will execute in order;
why did it take longer to BLAST mm-second.fa
than mm-first.fa
?
Last, but not least, let’s generate a more machine-readable version of that last file –
blastp -query mm-second.fa -db zebrafish.1.protein.faa -out mm-second.x.zebrafish.tsv -outfmt 6
See this link for a description of the possible BLAST output table formats.
Now we’ll run an R script to visualize the blast results:
Rscript blastviz.R
A pdf will be generated with the results. We can view this by clicking on the Folder icon at the left ofour screen, and then double clicking on the file at the top to open the pdf:
Things to mention and discuss:
blastp
options and -help.
command line options, more generally - why so many?
automation rocks!
Maybe the most imporant thing to keep in mind while going through these pages is that this is all about exposure, not memorization or mastering anything. Don't worry about the details 🙂
First, here are some terms that are often used interchangeably when referring to the text-based environment we were working in yesterday. It’s not important to remember the differences, but they’re here just so we see them at least once 🙂
Term | What it is | ||
shell |
what we use to talk to the computer; anything where you are pointing and clicking with a mouse is a Graphical User Interface (GUI) shell; something with text only is a Command Line Interface (CLI) shell | ||
command line |
a text-based environment capable of taking input and providing output (a "terminal" is the same idea) | ||
Unix |
a family of operating systems | ||
bash |
the most common programming language used at a Unix command-line | ||
(modified from the carpentries)
Many bioinformatics tools can only be used through a command-line interface, and/or have extra capabilities in the command-line version that are not available in the GUI (e.g. BLAST).
It makes our work less error-prone. When humans do the same thing a hundred different times (or even ten times), we’re likely to make a mistake. Your computer can do the same thing a thousand times with no mistakes. This also frees us up to do other things we can’t automate, like the science part.
The shell makes our work more reproducible. When we carry out our work in a command-line interface (rather than a GUI), we can keep an exact record of all steps, which we can use to re-do our work when we need to. It also gives us a way to unambiguously communicate what we’ve done to others.
Many bioinformatic tasks require large amounts of computing power and can’t realistically be run on our own machines. These tasks are best performed using remote computers or cloud computing, which can only be accessed through a shell.
Getting started
Working with files and directories
Redirectors and wildcards
Six glorious commands
Variables and For loops
Launching Jetstream instances
SSH Secure Shell login to instances
Jetstream instance maintenance
What we’re going to do here is walk through starting up a running computer (an “instance”) on the Jetstream service.
Jetstream is an [NSF/XSEDE] resource designed to promote and provide configurable cyberinfrastructure in the form of cloud computing to both novice and experienced users.
If you would like to read more about cloud computing, see this Carpentry Cloud Computing lesson.
NOTE: WINDOWS users will need to install a UNIX-ready terminal to SSH-login (if you haven’t already).
Options:
Login to Jetstream by clicking the “login” button towards the right-upper corner of the screen.
Fill in your Jetstream username and password and click “LOGIN”
Select the “Projects” tab and then click the “CREATE NEW PROJECT” button
Give your Project folder a name, first and last name would be best while here (Description is optional). Then click “CREATE”.
Click on your newly created project and then Click on “NEW” and then “Instance” from the drop-down menu to start up a new virtual machine.
To select an image click on “Show All” tab and Search for “ANGUS 2019” and choose the “ANGUS 2019” image created by ‘titus’.
NOTE: Here “Image” refers to the template of a virtual machine containing an installed operating system, software, and configuration stored as a file on disk. Think of it like apps that come with your phone before you add new ones on your own.
You will be presented with options to choose and configure your virtual machine here:
Instance Name: e.x., “ANGUS 2019 base image” or you can leave it default which is the image name.
Base Image Version: choose the latest version (not what’s in the below image)
Project: select your project folder
Allocation Source: TG-MCB190088
Provider: “Jetstream Indiana University”
Instance size: We recommend “m1.medium” (CPU: 6, Mem: 16GB, Disk: 60GB) for this tutorial; though depending on your allocations, choose the most suitable one.
NOTE: Choose the latest “Base Image Version” (not what’s in this below image), it should be the default.
Launch instance and wait for the build to be deployed (~ 5-10 minutes).
Note: During the build process:
scheduling-->building-->spawning-->deploying-->Networking-->N/A
. Once the virtual machine is ready, the “Activity” column will show “N/A” and the “Status” column will turn green and “Active”.
Navigate back to ‘Projects’ and click on your new instance’s name to see more information related to the instance you just created! and Copy the IP address of your instance created.
Great! We have now built our very own remote virtual machine with all the software pre-installed. Next we will use SSH-Secure-Login to access these remote instances from our laptop !!!.
macOS & LINUX users can open a Terminal window now.
Windows 10 users with ubuntu can open a terminal
If your windows doesn’t come with ubuntu distribution, users start a new session in git bash
Start a new session; Fill in your “remote host” the IP address of your virtual machine; select “specify username” and enter your Jetstream username; Click OK.
Establish a secure-login to the instance by typing the following:
$ ssh your_Jetstreamusername@ip_address
This should log you into Jetstream and you should see a screen like this; Enter ‘yes’ and then enter your Jetstream password.
Your cursor will not move or indicate you are typing as you enter your password. If you make a mistake, hit enter and you will be prompted again.
Success !!! We have established connections with our instances. Proceed to the Tutorial section.
To end your current session on an Instance and close SSH connection, type ‘exit’
Action | Description |
---|---|
Report | instance exhibiting unexpected behavior? Report here |
Image | Request an image (a type of template for a virtual machine) of a running instance |
Suspend | Suspending an instance frees up resources for other users and allows you to safely preserve the state of your instance without imaging. Your time allocation no longer counts against you in the suspended mode |
Shelve | Shelving will safely preserve the state of your instance for later use. And, it frees up resources for other users . In fact, it is the best way to reduce resource usage when compared with other actions, like "suspend" and "stop".Your time allocation no longer counts against you in the shelved mode |
Stop | Stop an instance |
Reboot | Reboot an instance |
Redeploy | Redeploying an instance will allow you to fix instances that show up as 'active - deploy_error' |
Delete | following instance will be shut down and all data will be permanently lost |
Why should I use a package and environment management system as part of my research workflow?
What is Conda? Why use Conda?
Manage Conda environments
Conda channels & packages
Installing software is hard. Installing scientific software (including all required dependencies of said software!) is often even more challenging.
Installing software system-wide creates complex dependencies between your reearch projects that shouldn’t really exist!
Rather than installing software system-wide, wouldn’t it be great if we could install software separately for each research project?
A environment is a directory that contains a specific collection of packages/tools that you have installed. For example, you may have one environment with Python 2.7 and its dependencies, and another environment with Python 3.4 for legacy testing. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.
Conda is open source package and runs on Windows, Mac OS and Linux.
Conda can quickly install, run, and update packages and their dependencies.
Conda can create, save, load, and switch between project specific software environments on your local computer.
Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because Conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
We have already installed conda on these instances, but please see the stellar conda documentation after the class if you’d like to install it on your personal computer or elsewhere.
If you need to re-access your terminal environment, see here.
Check your current version of python by exectuting python --version
To create a new environment named, for instance mynewenv (you can name it what ever you like), that includes, let’s say, a Python version 3.4., run:
conda create --name mynewenv
Inside a new Conda installation, the root environment is activated by default, so you can use it without activation.
In other cases, if you want to use an environment (for instance manage packages, or run Python scripts inside it) you need to first activate it.
conda activate mynewenv
The command prompt changes upon the environment’s activation, it now contains the active environment’s name.
The directories of the active environment’s executable files are added to the system path (this means that you can now access them more easily). You can leave an environment with this command:
conda deactivate
It needs to be mentioned that upon deactivating an environment, the base environment becomes active automatically.
Channels are the locations of the repositories (directories) online containing Conda packages. Upon Conda’s installation, Continuum’s (Conda’s developer) channels are set by default, so without any further modification, these are the locations where your Conda will start searching for packages.
Channels in Conda are ordered. The channel with the highest priority is the first one that Conda checks, looking for the package you asked for. You can change this order, and also add channels to it (and set their priority as well).
If multiple channels contain a package, and one channel contains a newer version than the other one, the order of the channels’ determines which one of these two versions are going to be installed, even if the higher priority channel contains the older version.
Bioconda Channel
See the bioconda paper and the bioconda web site
Bioconda is a community-enabled repository of 3,000+ bioinformatics packages, installable via the conda package manager. Note: bioconda is not available for windows systems
Add channels
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
Searching, installing and removing packages
To list out all the installed packages in the currently active environment, run:
conda list
Try running a program pre-installed on this instance:
fastqc
To search for all the available versions of a certain package, you can use the search command. For instance, to list out all the versions of the seaborn package (it is a tool for data visualization), run:
conda search samtools
Similarly to the conda list command, this one results in a list of the matching package names, versions, and channels:
Loading channels: done
# Name Version Build Channel
samtools 0.1.12 0 bioconda
samtools 0.1.12 1 bioconda
samtools 0.1.12 2 bioconda
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
samtools 1.9 h91753b0_3 bioconda
samtools 1.9 h91753b0_4 bioconda
samtools 1.9 h91753b0_5 bioconda
samtools 1.9 h91753b0_8 bioconda
To install a package (for samtools) that is inside a channel that is on your channel list, run this command (if you don’t specify which version you want, it’ll automatically install the latest available version from the highest priority channel):
conda install samtools
You can also specify the package’s version:
conda install samtools=1.9
This will save the list of conda-installed software you have in a particular
environment to the file packages.txt
:
conda list --export > packages.txt
(it will not record the software versions for software not installed by conda.)
conda install --file=packages.txt
will install those packages in your local environment.
Conda commands | action |
---|---|
conda install |
install a package |
conda list |
list installed packages |
conda search |
To search a package |
conda info |
list of information about the environment |
conda list |
list out all the installed packages in the currently active environment |
conda remove |
Remove a conda package |
conda config --get channels |
list out the active channels and their priorities |
conda update |
update all the installed packages |
conda config --remove channels unwanted_channel |
remove unwanted_channel |
conda env list |
list the different environments you have set up |
conda activate myNewEnvironment |
activate the myNewEnvironment Conda environment (this also works for activating our base environment |
conda info --envs |
list the locations of Conda directories |
Conda environments are expceptionally useful! However, they can become quite large depending on how many packages we load into them. We can check how large any of our Conda enviroments are by finding the path to the environment directory and then estimating the file space usage of that directory.
First, let’s find where we put out mynewenv
directory
conda info --envs
This will print out a list of the locations of our Conda environments.
# conda environments:
#
base * /opt/miniconda
mynewenv /opt/miniconda/envs/mynewenv
Next, let’s use the command du
to estimate the space our mynewenv
directory is taking up!
du -sh /opt/miniconda/envs/mynewenv/
We can see our mynewenv
environment is taking up about 12K of space.
12K /opt/miniconda/envs/mynewenv/
QUESTION: How much space is your base
environment taking up?
Conda Documentation
Image credits: Gergely Szerovay. Read original article here
Learning objectives:
Install software (fastqc, multiqc, trimmomatic) via conda
download data
visualize read quality
quality filter and trim reads
If you don’t have one running currently, start up a Jetstream m1.medium or larger instance (as detailed here), and then ssh
login through your terminal (as shown here).
You should now be logged into your Jetstream computer! You should see something like this:
(base) dibada@js-171-9:~$
Change to your home directory:
cd ~/
Let’s make sure our conda channels are loaded, just in case.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
and install FastQC, MultiQC, and trimmomatic:
conda install -y -c bioconda fastqc multiqc trimmomatic
Make a “data” directory:
cd ~/
mkdir data
cd data
and download some data from the Schurch et al, 2016 yeast RNAseq study. We will download three wildtype yeast datasets, and three mutant datasets (SNF2 knockout; SNF2 is a global transcription regulator).
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -L https://osf.io/2wvn3/download -o ERR458495.fastq.gz
curl -L https://osf.io/xju4a/download -o ERR458500.fastq.gz
curl -L https://osf.io/nmqe6/download -o ERR458501.fastq.gz
curl -L https://osf.io/qfsze/download -o ERR458502.fastq.gz
curl -L https://osf.io/9c5jz/download -o err_md5sum.txt
We also downloaded a file holding some md5sum values for our sequence files (“err_md5sum.txt”). An md5sum hash is like a unique fingerprint for a file. If we compute the md5sum for the files we downloaded, and they don’t match what is specified in the “err_md5sum.txt” file that came with them, then we would know if any of the files were corrupted or cut short while we were downloading them.
If we look at what’s in that file:
head err_md5sum.txt
We can see a value for each of our sequence files:
2b8c708cce1fd88e7ddecd51e5ae2154 ERR458493.fastq.gz
36072a519edad4fdc0aeaa67e9afc73b ERR458494.fastq.gz
7a06e938a99d527f95bafee77c498549 ERR458495.fastq.gz
107aad97e33ef1370cb03e2b4bab9a52 ERR458500.fastq.gz
fe39ff194822b023c488884dbf99a236 ERR458501.fastq.gz
db614de9ed03a035d3d82e5fe2c9c5dc ERR458502.fastq.gz
Those are the md5sums they are supposed to have if they transferred properly. We can have the computer calculate md5sums for the files we have now and check that they match up like this:
md5sum -c err_md5sum.txt
And if all is well, it will print out “OK” for each of the files:
ERR458493.fastq.gz: OK
ERR458494.fastq.gz: OK
ERR458495.fastq.gz: OK
ERR458500.fastq.gz: OK
ERR458501.fastq.gz: OK
ERR458502.fastq.gz: OK
Often the sequencing facility will provide an md5sum file with your data, or you may see them with large datasets in public repositories.
If we wanted to create our own md5sum file to keep with our data, we could do so like this for one file:
md5sum ERR458493.fastq.gz > my_md5sum.txt
Or we could use the *
wildcard to help specify all of our .fastq.gz files here (overwriting the previous file):
md5sum *.fastq.gz > my_md5sum.txt
Back to our data, if we type:
ls -l *.fastq.gz
We should see something like:
-rw-rw-r-- 1 dibada dibada 59532325 Jul 4 03:23 ERR458493.fastq.gz
-rw-rw-r-- 1 dibada dibada 58566854 Jul 4 03:24 ERR458494.fastq.gz
-rw-rw-r-- 1 dibada dibada 58114810 Jul 4 03:24 ERR458495.fastq.gz
-rw-rw-r-- 1 dibada dibada 102201086 Jul 4 03:24 ERR458500.fastq.gz
-rw-rw-r-- 1 dibada dibada 101222099 Jul 4 03:24 ERR458501.fastq.gz
-rw-rw-r-- 1 dibada dibada 100585843 Jul 4 03:24 ERR458502.fastq.gz
These are six data files from the yeast study.
One problem with these files is that they are writeable – which means we can edit or delete them. By default, UNIX makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. Let’s fix that before we go on any further:
chmod a-w *
Here we are using the chmod
command (for change mode, which handles permissions of files/directories), and providing the argument a-w
, which means for all users (a
), take away (-
) the permission to write (w
).
If we take a look at their permissions now:
ls -l
We can see that the ‘w’ in the original permission string
(-rw-rw-r--
) has been removed from each file and now it should look like
-r--r--r--
.
We’ll talk about what these files are below.
First, make a new working directory:
mkdir ~/quality
cd ~/quality
Now, we’re going to make a “link” to our quality-trimmed data in our current working directory:
ln -fs ~/data/* .
and you will see that they are now linked in the current directory when you do an
ls
. These links save us from having to specify the full path (address) to their location on the computer, without us needing to actually move or copy the files. But note that changing these files here still changes the original files!
These are FASTQ files – let’s take a look at them:
gunzip -c ERR458493.fastq.gz | head
FASTQ files are sequencing files that contain reads and quality information about each base-call within each read. A score of 0 is the lowest quality score, and a score of 40 is the highest. Instead of typing the scores as numbers, they are each encoded as a character.
Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | |
Quality score: 0........10........20........30........40
A score of 10 indicates a 1 in 10 chance that the base is inaccurate. A score of 20 is a 1 in 100 chance that the base is inaccurate. 30 is 1 in 1,000. And 40 in 1 in 10,000.
Links:
We’re going to use FastQC summarize the data. We already installed ‘fastqc’ above, with the conda command.
Now, run FastQC on two files:
fastqc ERR458493.fastq.gz
fastqc ERR458500.fastq.gz
Now let’s use ls
:
ls *fastqc.zip
to list the files, and you should see:
ERR458493_fastqc.zip
ERR458500_fastqc.zip
Inside each of those zip files are directories holding reports from the fastqc program. But these are also nicely summarized in HTML files:
ls *.html
Let’s transfer the HTML files to our local computers so we can open and view
them in our browsers. We’ll use the command scp
to do this.
scp
stands for secure copy. We can use this command to transfer files between
two computers. We need to run this command on our local computers (i.e. from
your laptop).
The scp
command looks like this:
scp <file I want to move> <where I want to move it>
First we will make a new directory on our computer to store the HTML files we’re
transfering. Let’s put it on our desktop for now. Make sure the terminal program
you used this morning to ssh
onto your instance is open. If you’re in it now,
you can open a new tab you can use the pull down menu at the top of your
screen or the cmd + t
or maybe ctrl + t
keyboard shortcut). Type:
mkdir ~/Desktop/fastqc_html
Now we can transfer our HTML files to our local computer using scp
, which will look something like this:
scp username@ipaddress:~/quality/*.html ~/Desktop/fastqc_html/
The first part of the command is the address for your remote
computer. Make sure you replace everything after username@ with your IP address (the one you used to log in with the ssh
command).
The second part starts with a : and then gives the absolute path of the files
you want to transfer from your remote computer. Don’t forget the :
. We use a
wildcard (*.html) to indicate that we want all of the HTML files.
The third part of the command gives the absolute path of the location you want
to put the files. This is on your local computer and is the directory we just
created ~/Desktop/fastqc_html
.
You will be prompted for your password. Enter the password we set for your instance during the morning session.
You should see a status output like this:
ERR458493_fastqc.html 100% 642KB 370.1KB/s 00:01
ERR458500_fastqc.html 100% 641KB 592.2KB/s 00:01
Now we can go to our new directory and open the HTML files.
You can also download and look at these copies of them:
These files contain a lot of information! Depending on your application, some of these modules are more helpful than others. For example, for RNA-seq data, failing duplication levels aren’t necessarily a problem – we expect to have duplicated sequences given that some transcripts are very abundant.
Questions:
What should you pay attention to in the FastQC report?
Is one sample better than another?
Links:
There are some important things to know about FastQC - the main one is that it only calculates certain statistics (like duplicated sequences) for subsets of the data (e.g. duplicate sequences are only analyzed within the first 100,000 sequences in each file). You can read more about each module at their docs site.
Now we’re going to do some trimming! Let’s switch back to our instance terminal, as we will be running these commands on our remote computers. We’ll be using Trimmomatic, which (as with fastqc) we’ve already installed via conda.
The first thing we’ll need is a file holding the adapters we need to trim off. These are artificial and added as part of the sequencing process. The trimmomatic program comes with standard adapter files, so here we are going to copy them over to our current working directory.
cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .
(you can look at the contents of this file with cat TruSeq2-SE.fa
)
Now, let’s run Trimmomatic on a couple samples:
trimmomatic SE ERR458493.fastq.gz \
ERR458493.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
trimmomatic SE ERR458500.fastq.gz \
ERR458500.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
CODE BREAKDOWN
Note that trimmomatic is actually a java program that we call from the command line, so it’s syntax is a little different than what we’ve been using. The arguments are followed by a colon, and then what you are specifying to them.
SE
- this is specified because we are working with single-ended data, rather than paired-end data where there are forward and reverse reads
ERR458493.fastq.gz
- the first positional argument we are providing is the input fastq file (note it is okay to provide compressed .gz files to trimmomatic)
ERR458493.qc.fq.gz
- the second positional argument specifies the name of the output files the program will generate
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15
- here we are specifying the argument we’re addressing “ILLUMINACLIP”, first specifying the file holding the adapter sequences, then 3 numbers: “2” which states how many mismatches between the adapter sequence and what’s found are allowed; “0” which is only relevant when there are forward and reverse reads; and “15” which specifies how accurate the match must be
LEADING:2 TRAILING:2
- both of these are stating the minimum quality score at the start or end of the read, if lower, it will be trimmed off
SLIDINGWINDOW:4:2
- this looks at 4 basepairs at a time, and if the average quality of that window of 4 drops below 2, the read is trimmed there
MINLEN:25
- after all the above trimming steps, if the read is shorter than 25 bps, it will be discarded
You should see output that looks like this:
...
Input Reads: 1093957 Surviving: 1092715 (99.89%) Dropped: 1242 (0.11%)
TrimmomaticSE: Completed successfully
We can also run the same process for all 6 samples more efficiently using a for
loop, as follows. We’re going to use a
new command called basename
in the for loop, which we introduce below.
basename
is a function in UNIX that is helpful for removing a uniform part of a name from a list of files.
In this case, we will use basename to remove the .fastq.gz
extension from the files that we’ve been working with.
basename ERR458493.fastq.gz .fastq.gz
We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.
ERR458493
If we try the same thing but use .fasta
as the file extension instead, nothing happens.
This is because basename only works when it exactly matches a string in the file.
basename ERR458493.fastq.gz .fasta
ERR458493.fastq.gz
Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.
Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis,
then give our variable name from the for loop, in this case ${filename}
, and finally state that .fastq.gz
should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re
creating a new variable called name. The line > echo $name will print to the terminal the variable name each
time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.
for filename in *.fastq.gz
do
name=$(basename ${filename} .fastq.gz)
echo ${name}
done
A note on variables: We can either refer a variable as $filename
or ${filename}
. They mean the same thing,
but using {}
explicitly tells bash when we are done refering to the variable. This makes it safer to refer
to variables, especially when we’re combining it with other text. This is different than ()
, which tells bash
that we want to execute something.
Now we will use basename
inside of our for loop to run trimmomatic on all 6 of our files.
for filename in *.fastq.gz
do
# first, make the base by removing fastq.gz
base=$(basename $filename .fastq.gz)
echo $base
trimmomatic SE ${base}.fastq.gz \
${base}.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25
done
The for
loop above will go through each for the filenames that end with fastq.gz
and
run Trimmomatic for it.
Questions:
How do you figure out what the parameters mean?
How do you figure out what parameters to use?
What adapters do you use?
What version of Trimmomatic are we using here? (And FastQC?)
Do you think parameters are different for RNAseq and genomic data sets?
What’s with these annoyingly long and complicated filenames?
For a discussion of optimal trimming strategies, see MacManes, 2014 – it’s about RNAseq but similar arguments should apply to metagenome assembly.
Links:
Let’s take a look at the output files:
gunzip -c ERR458493.qc.fq.gz | head
It’s hard to get a feel for how different these files are by just looking at them, so let’s run FastQC again on the trimmed files:
fastqc ERR458493.qc.fq.gz
fastqc ERR458500.qc.fq.gz
And now view my copies of these files:
Alternatively, if there’s time, you can use scp
to view the files from your
local computer.
MultiQC is a great tool that aggregates multiple fastqc results into a single report for easy comparison.
Run Mulitqc on both the untrimmed and trimmed files
multiqc .
And now you should see output that looks like this:
[INFO ] multiqc : This is MultiQC v1.0
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '.'
Searching 15 files.. [####################################] 100%
[INFO ] fastqc : Found 4 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : multiqc_report.html
[INFO ] multiqc : Data : multiqc_data
[INFO ] multiqc : MultiQC complete
You can view the output html file multiqc_report.html by going to RStudio, selecting the file, and saying “view in Web browser.”
Questions:
is the quality trimmed data “better” than before?
Does it matter that you still have adapters!?
Learning objectives:
Install read quantification program Salmon
Learn quantification of RNA-seq data
Boot an m1.medium Jetstream instance and log in.
You should have the quality trimmed data on your instance in the ~/quality/
folder. If not, you can get this data by running:
cd ~
mkdir -p quality
cd quality
curl -L https://osf.io/wfz34/download -o ERR458493.qc.fq.gz
curl -L https://osf.io/jxh4d/download -o ERR458494.qc.fq.gz
curl -L https://osf.io/zx7n3/download -o ERR458495.qc.fq.gz
curl -L https://osf.io/96mrj/download -o ERR458500.qc.fq.gz
curl -L https://osf.io/wc8yn/download -o ERR458501.qc.fq.gz
curl -L https://osf.io/sdtz3/download -o ERR458502.qc.fq.gz
Salmon is a tool for fast transcript quantification from RNA-seq data. It requires a set of target transcripts (either from a reference or de-novo assembly) to quantify and FASTA/FASTQ file(s) containing your reads.
Salmon runs in two phases, indexing and quantification. The indexing step is independent of the reads, and only needs to be run one for a particular set of reference transcripts. The quantification step is specific to the set of RNA-seq reads and is thus run more frequently.
Salmon is installed through conda.
Let’s make sure our conda channels are loaded.
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda install -y -c bioconda salmon
We will be using the same data as before
(Schurch et al, 2016),
so the following commands will create a new folder quant
and link the data in:
cd ~
mkdir -p quant
cd quant
ln -fs ~/quality/*qc.fq.gz .
ls
If you don’t have the data from the previous lesson, you can download it
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_rna_from_genomic.fna.gz
salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz
for i in *.qc.fq.gz
do
salmon quant -i sc_index --libType A -r ${i} -o ${i}_quant --seqBias --gcBias --validateMappings
done
What do all of these flags do?
Flag | Meaning |
---|---|
-i | path to index folder |
--libType | The library type of the reads you are quantifying. A allows salmon to automatically detect the library type of the reads you are quantifying. |
-r | Input file (for single-end reads) |
-o | output folder |
--seqBias | learn and correct for sequence-specific biases in the input data |
--gcBias | learn and correct for fragment-level GC biases in the input data |
--validateMappings | Enables selective alignment, which improves salmon's sensitivity |
As Salmon is running, a lot of information is printed to the screen. For example, we can see the mapping rate as the sample finishes:
[2019-06-29 18:39:18.367] [jointLog] [info] Mapping rate = 86.2687%
When it finished, Salmon outputs a folder for each input RNA-seq sample. Let’s take a look at one of the output folders.
ls ERR458493.qc.fq.gz_quant
You should see output like this:
aux_info/ lib_format_counts.json
cmd_info.json logs/
libParams/ quant.sf
The information we saw scroll through our screens is captured in a log file in
aux_info/
.
less ERR458493.qc.fq.gz_quant/aux_info/meta_info.json
We see information about our run parameters and performance. To exit out of the
file, press q
.
We can also look at the count file:
less -S ERR458493.qc.fq.gz_quant/quant.sf
We see our transcript names, as well as the number of reads that aligned to each transcript.
PRACTICE! How can we use the
grep
command to find the percent of reads mapped in ALL our json files? Make sure to print out the name of each file before you print the percent of reads mapped!Solutioncd ~/quant/ for infile in *_quant/aux_info/meta_info.json do echo ${infile} grep "percent_mapped" ${infile} done
First we use
echo
to print the name of the file. Then, we usegrep
to find and print the line containing the percent of reads mapped.
In our next lesson, we will be reading these quant files into R and performing differential expression with them. “Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference” Patro et al., 2016.
Also see seqanswers and biostars.
We will be using R and RStudio on our cloud instances. We have pre-installed RStudio on the workshop image. Start up a Jetstream m1.medium or larger using the ANGUS 2019 image as per Jetstream startup instructions. Connect to your instance.
We will need our instance username to connect to RStudio. To determine what your username is, run:
echo My username is $USER
The link for the RStudio server interface is built with our instance’s IP address (by default in port 8787), which we can get from the terminal by running the following command:
echo http://$(hostname -i):8787/
hostname
is a command in the terminal that will get various information about our computers. Using it with the-i
option returns our computer’s IP address. So surrounding it within parentheses preceded by the$
means the command line will execute that command first, put the output of that command in it’s place (our IP address), and then run theecho
command which then prints out the proper link to our RStudio server 😊
Now go to the web address printed to your terminal in your Web browser, which should look something like this:
and log in with the username and password we will provide.
We will be using R and RStudio in the cloud. To use RStudio on your computer, you can download R and RStudio. If you do not have R and RStudio installed:
Download R from CRAN:https://cran.r-project.org/
Go to RStudio download page and install the version that is compatible with your operating system: https://www.rstudio.com/products/rstudio/download/#download This link provides futher instructions on installing and setting up R and RStudio on your computer.
R is a programming language and free software environment for statistical computing and graphics. Many bioinformatics programs are implemented in R. RStudio is a graphical integrated development environment (IDE) that displays and organizes more information about R. R is the underlying programming language. Once you start RStudio, you see this layout (it will be similar either on your instance or on your computer):
Scripts are like a digital lab book. They record everything your run, as
well as notes to yourself about how and why you ran it. To start a new
script you can click on the icon that looks like a page with a ‘+’
underneath File and select –>R script. This generates a .R
file.
.R
files are plain text files that use the file extension .R
so that
humans can remember that the file contains R code.
Although scripts are a record of our code, they also include comments about our code. These comments are very important as they remind your future self (and whoever you pass your script) why you did what you did. Think of comments like the notes you take in your lab book when you do experiments.
When you execute a line of your script, it is sent to the console. While the script is a permanent record of everything you ran, the console is the engine that runs those things. You can run code from a script or the console, but it is only recorded in scripts.
There are two additional panes in RStudio:
Environment/History/Connections pane
Environment Pane: This pane records the objects that are currently loaded in our RStudio environment. This includes any data object that we created and the basic information about them.
History Pane: This has output similar to the history
command that we used in the shell. It records all of the
commands that we have executed in the R console.
Files/Plots/Packages/Help/Viewer pane provides a space for us to interact with plots and files we create and with built-in help files in R.
Packages Pane: A place to organize the packages within
RStudio. Allows easy installation or loading (library()
) of
the packages we have.
And now we’re ready to start the fun part!
An R object is something that you create and assign a value or function
to. You do this using the assignment operator which is <-
(NOTE:
shortcut Alt + -). For example you can assign a numeric value to an
object by typing in your script (remember to comment what you are
doing):
a <- 34 # assign the number 34 to the object a
You can also assign a character value to an object:
b <- 'mouse' # assign the word 'character' to the object b
To run a line of code, put your cursor on the line that you want to run
and press Enter+Cmd
or Enter+Ctrl
. Or select the line and click the
button that says Run
on the right top corner of your script.
Now check your Console and Environment tabs, you will see that the commands have run and that you have created new objects!
You can do all sorts of stuff with your objects: operations,
combinations, use them in functions. Below we show how to perform
multiplication using an object. Because a
stores the value 34
, we
can multiply it by 2:
a2 <- a*2 # multiply a by 2
Functions are “canned scripts” that automate more complicated sets of
commands including operations assignments, etc. Many functions are
predefined, or can be made available by importing R packages (more on
that later). A function usually takes one or more inputs called
arguments. Functions often (but not always) return a value. A typical
example would be the function sqrt()
. The input (the argument) must be
a number, and the return value (in fact, the output) is the square root
of that number. Executing a function (‘running it’) is called calling
the function. An example of a function call is:
sqrt(a) ## calculate the square root of a
Here, the value of a
is given to the sqrt()
function, the sqrt()
function calculates the square root, and returns the value which is then
assigned to the object b
. This function is very simple, because it
takes just one argument.
The return ‘value’ of a function need not be numerical (like that of
sqrt()
), and it also does not need to be a single item: it can be a
set of things, or even a dataset. We’ll see that when we read data files
into R.
Similar to the shell, arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.
Let’s try a function that can take multiple arguments: round()
.
round(3.141592) ## round number
Arguments are the ‘conditions’ of a function. For example, the function
round()
has an argument to determine the number of decimals, and its
default in R is 0. Here, we’ve called round() with just one argument,
3.14159, and it has returned the value 3. That’s because the default is
to round to the nearest whole number. If we want more digits we can see
how to do that by getting information about the round function. We can
use args(round) to find what arguments it takes, or look at the help for
this function using ?round.
args(round) ## show the arguments of the function 'round'
?round ## shows the information about 'round'
We can change this to two decimals by changing the digits
argument:
round(3.141592, digits = 2)
Note that we do not need to specify the name of the argument as R detects this by the position within the function:
round(3.141592, 2)
Though, it is preferablt to show the argument names. Plus, we have the added benefit of having the ability to change the order of the arguments if the arguments are named:
round(digits = 2, x= 3.141592)
It’s best practice to put the non-optional arguments (the data that you want the function to work on) first, and then specify the names of all optional arguments. This helps other to interpret your code more easily.
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.
yeast_strain <- c('WT','snf2')
concentration <- c(43.903, 47.871)
There are many functions to inspect a vector and get information about its contents:
length(yeast_strain) # number of elements in a vector
class(concentration) # type of object b
str(yeast_strain) # inspect the structure of a
An atomic vector is the simplest R data type, and it is a vector of a single type. The 6 main atomic vector types that R uses are:
'character'
letters or words
'numeric'
(or double
) for numbers
'logical'
for TRUE
and FALSE
(booleans)
'integer'
for integer numbers (e.g., 2L
, the L
indicates
to R that it’s an integer)
'complex'
to represent complex numbers with real and imaginary
parts, for example 2 + 4i
'raw'
for bitstreams that we will not use or discuss further
factor
represents categorical data (more below!)
You can check the type of vecor you are using with the typeof()
function.
typeof(concentration)
Challenge: Atomic vectors can be a character, numeric (or double), integer, and logical. But what happens if we try to mix these types in a single vector?
yeast_strain_concentration <- c(yeast_strain, concentration) # combine two vectors
yeast_strain_concentration
## [1] "WT" "snf2" "43.903" "47.871"
# Solution: All indices are "coerced" to the same "type", e.g. character.
logical → numeric → character ← logical
R uses many data structures, the main ones are:
vector
contain elements of the same type in one dimension
list
can contain elements of different types and multiple data
structures
data.frame
contains rows and columns. Columns can contain
different types from other columns.
matrix
with rows and columns that are of the same type
array
stores data in more than 2 dimensions
We will use vectors
, matrices
, data.frames
, and factors
today.
We can extract values from a vector based on position or conditions:
this is called subsetting. One way to subset in R is to use []
.
Indices start at 1.
strains <- c('WT', 'snf2', 'snf1', 'snf1_snf2')
strains[2] # subsets the second element
strains[c(2,3)] # subsets the second and third elements
more_strains <- strains[c(1, 2, 3, 1, 4, 3)] # to create an object with more elements than tha original one
more_strains
We can also subset based on conditions, using a logical vector with
TRUE
and FALSE
or with operators:
strains[c(TRUE, FALSE, FALSE, TRUE)]
concentration >= 43
concentration > 40
concentration < 47
concentration <= 47
concentration[ concentration < 45 | concentration > 47] # | is OR
concentration[ concentration < 45 & concentration == 47] # & is AND , == is mathematical 'equal'
more_strains[more_strains %in% strains]
As R was designed to analyze datasets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.
There are in-built functions and arguments to deal with missing data. For example, this vector contains missing data:
concentration <- c(43.903, 47.871, NA, 48.456, 53.435)
We can use the following functions to remove or ignore it:
mean(concentration, na.rm = TRUE) # mean of concentration removing your NA
concentration[!is.na(concentration)] # extract all elements that are NOT NA
na.omit(concentration) # returns a 'numeric' atomic vector with incomplete cases removed
concentration[complete.cases(concentration)] # returns a 'numeric' atomic vector with elements that are complete cases
Note, if your data includes missing values, you can code them as blanks (”“), or as”NA” before reading them into R.
Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
strains <- factor(c('WT', 'snf2', 'snf1', 'snf1_snf2')) # We make our strains object a factor
class(strains)
nlevels(strains)
strains # R orders alphabetically
Once created, factors can only contain a pre-defined set of values,
known as levels. By default, R always sorts levels
in alphabetical
order. We can reorder them using the levels
argument.
strains <- factor(strains, levels = c('WT', 'snf1', 'snf2', 'snf1_snf2')) # we reorder them as we want them
strains
We convert objects using as.XXX
functions:
strains <- as.character(strains)
class(strains)
strains <- as.factor(strains)
class(strains)
And we can also rename factors based on position:
levels(strains)[4] <- 'wild_type'
This tutorial is a continuation from our Salmon lesson – we are using sequencing data European Nucleotide Archive dataset PRJEB5348. This is an RNA-seq experiment using comparing two yeast strains, SFF2 and WT. Although we’re only working with 6 samples, this was a much larger study. They sequenced 96 samples in 7 different lanes, with 45 wt and 45 mut strains. Thus, there are 45 biological replicates and 7 technical replicates present, with a total of 672 samples! The datasets were compared to test the underlying statistical distribution of read counts in RNA-seq data, and to test which differential expression programs work best.
We will get the information about the RNA quality and quantity that the authors measured for all of their samples, and we will explore how the authors made their data avaialble and ‘linkable’ between technical and biological replicates.
We are going to explore those datasets using RStudio, combining a both base R and an umbrella package called tidyverse.
library('tidyverse')
First we are going to read the files from a url and assign that to
experiment_info, using read_tsv()
:
experiment_info <- read_tsv(file = 'https://osf.io/pzs7g/download/')
Let’s investigate your new data using some exploratory functions:
class(experiment_info) # type of object that sample_details is
dim(experiment_info) # dimensions
summary(experiment_info) # summary stats
head(experiment_info) # shows the first 6 rows and all columns
tail(experiment_info) # last 6 rows and all columns
str(experiment_info) # structure
What do you notice? The last 4 columns are empty! We can get rid of them by subsetting, and having a look that we have done things correctly:
sum(is.na(experiment_info$X10)) # checks how many NAs the column X10 has so that you know all your rows are NAs (or not!)
experiment_info <- experiment_info[ , 1:9] # subset all rows and columns 1 to 9
Challenge: How many columns are in the experiment_info
data frame
now?
dim(experiment_info) # dimensions
experiment_info <- rename(experiment_info, units = X9)
We can now explore this table and reshape it using dplyr
. We will
cover in this lesson:
%>%
: pipes, they allow ‘concatenating’ functions
select()
: subsets columns
filter()
: subsets rows on conditions
mutate()
: create new columns using information from other
columns
group_by()
and summarize()
: create summary statistics on grouped
data
arrange()
: sort results
count()
: counts discrete values
We have now a table with many columns – but what if we don’t need all of
the columns? We can select columns using select
, which requires the
names of the colums that you want to keep as arguments:
# Let's confirm the arguments by looking at the help page
?select # Notice how the help page is different from the shell manual pages?
# Now lets select specific columns
select(experiment_info, Sample, Yeast Strain, Nucleic Acid Conc., Sample, 260/280, Total RNA)
What happened? R cannot parse the spaces in the column names properly.
We can add back commas around the name of the column
. These tell R
that these words belong together
:
select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, 260/280, `Total RNA`)
Also, R gets confused when column names start with a number. So, we will also need to specify where the numbers start and stop:
select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, `260/280`, `Total RNA`)
So let’s store that in an object named experiment_info
:
experiment_info <- select(experiment_info, Sample, `Yeast Strain`, `Nucleic Acid Conc.`, A260, A280, `260/280`, `Total RNA`)
NOTE: When choosing names from columns, make sure that you are consistent and choose something simple, all lowercase, all uppercase, or CamelCase. Having spaces in column makes it hard for R to deal with, and can lead to unexpected results or silent errors.
We select all columns except for by using -
before the name of the
columns we do not want to include, for example all columns except for
A260 and A280:
# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)
The filter
function works on rows, so we can also filter
the data
based on conditions in a column:
filter(experiment_info, `Nucleic Acid Conc.` > 1500)
Select the columns
Sample
,Yeast Strain
,A260
andA280
and assign that to a new object calledexperiment_data
.
experiment_data <- select(experiment_info, Sample, `Yeast Strain`, A260, A280)
head(experiment_data)
## # A tibble: 6 x 4
## Sample `Yeast Strain` A260 A280
## <dbl> <chr> <dbl> <dbl>
## 1 1 snf2 43.9 19.9
## 2 2 snf2 47.9 22.1
## 3 3 snf2 34.5 15.6
## 4 4 snf2 34.0 15.4
## 5 5 snf2 28.1 12.7
## 6 6 snf2 48.2 22.1
Filter rows for only WT strains that had more than 1500 ng/uL concentration and make a new tibble called wild_type.
wild_type <- filter(experiment_info, `Yeast Strain` %in% 'WT' & `Nucleic Acid Conc.` > 1500)
head(wild_type)
## # A tibble: 6 x 7
## Sample `Yeast Strain` `Nucleic Acid Co… A260 A280 `260/280` `Total RNA`
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 49 WT 2052. 51.3 23.6 2.17 61.6
## 2 50 WT 1884. 47.1 21.6 2.18 56.5
## 3 51 WT 2692. 67.3 30.6 2.2 80.8
## 4 52 WT 3022 75.5 34.3 2.2 90.7
## 5 53 WT 3235 80.9 36.9 2.19 97.0
## 6 54 WT 3735. 93.4 43.2 2.16 112.
What if we want to select
and filter
at the same time? We use a
“pipe” in R, which is represented by %>%
. Note: This is the R
version of the shell “redirector” |
!
Pipes are available via the magrittr
package, installed automatically
with dplyr
. If you use RStudio, you can type the pipe with
Ctrl + Shift + M
if you have a PC or Cmd + Shift + M
if you have a
Mac as a shortcut to get to produce a pipe.
We can combine the two subsetting activities from Exercise 1 and 2 using
pipes %>%
:
experiment_info_wt <- experiment_info %>%
filter(`Yeast Strain` == 'WT' & `Nucleic Acid Conc.` > 1500) %>%
select(Sample, `Yeast Strain`, A260, A280)
Using pipes, subset experiment_info to include all the samples that had a concentration less than 1500 and where total RNA was more or equal than 40 ug and retain only the columns that tell you the sample, yeast strain, nucleic acid concentration and total RNA. Assign that to a new table called samples_sequenced.
samples_sequenced <- experiment_info %>%
filter(`Nucleic Acid Conc.` < 1500 & `Total RNA` >= 40) %>%
select(Sample, `Yeast Strain`,`Nucleic Acid Conc.`,`Total RNA`)
samples_sequenced
## # A tibble: 5 x 4
## Sample `Yeast Strain` `Nucleic Acid Conc.` `Total RNA`
## <dbl> <chr> <dbl> <dbl>
## 1 3 snf2 1379. 41.4
## 2 4 snf2 1359. 40.8
## 3 22 snf2 1428. 42.8
## 4 24 snf2 1386. 41.6
## 5 90 WT 1422. 42.7
What if I want to create a new column? I use the function mutate
for
this. Let’s convert our units into micrograms/microliter:
experiment_info %>%
mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000)
Or create more than one column! And pipe that into head so that we have a peep:
experiment_info %>%
mutate(concentration_ug_uL = `Nucleic Acid Conc.` / 1000,
half_concentration = concentration_ug_uL / 1000) %>%
head()
Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.
library_start <- experiment_info %>%
mutate(RNA_100 = 100/ `Nucleic Acid Conc.`,
water = 50 - RNA_100) %>%
select(Sample, `Yeast Strain`, RNA_100, water)
head(library_start)
## # A tibble: 6 x 4
## Sample `Yeast Strain` RNA_100 water
## <dbl> <chr> <dbl> <dbl>
## 1 1 snf2 0.0569 49.9
## 2 2 snf2 0.0522 49.9
## 3 3 snf2 0.0725 49.9
## 4 4 snf2 0.0736 49.9
## 5 5 snf2 0.0891 49.9
## 6 6 snf2 0.0519 49.9
Pretty difficult to pipette! Can redo the table considering that you have done a 1:10 dilution of your samples? Include the columns of A260 and A280 values in addition to sample, yeast strain, RNA_100 and water.
library_start_diluted <- experiment_info %>%
mutate(diluted_RNA = 1000/ `Nucleic Acid Conc.`,
water = 50 - diluted_RNA) %>%
select(Sample, `Yeast Strain`, `A260`, `A280`, diluted_RNA, water)
head(library_start_diluted)
## # A tibble: 6 x 6
## Sample `Yeast Strain` A260 A280 diluted_RNA water
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 snf2 43.9 19.9 0.00569 50.0
## 2 2 snf2 47.9 22.1 0.00522 50.0
## 3 3 snf2 34.5 15.6 0.00725 50.0
## 4 4 snf2 34.0 15.4 0.00736 50.0
## 5 5 snf2 28.1 12.7 0.00891 50.0
## 6 6 snf2 48.2 22.1 0.00519 50.0
Based on the tibble library_start_diluted, create a new tibble called
seq_samples
that includes only samples with a A260/A280 ratio of 2.2 or more and that only contains the columns for sample, yeast strain, A260280 ratio, diluted RNA and water.
seq_samples <- library_start_diluted %>%
mutate(A260_A280 = `A260`/`A280`) %>%
filter(A260_A280 >= 2.2) %>%
select(Sample, `Yeast Strain`, A260_A280, diluted_RNA, water)
head(seq_samples)
## # A tibble: 6 x 5
## Sample `Yeast Strain` A260_A280 diluted_RNA water
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 snf2 2.21 0.00569 50.0
## 2 3 snf2 2.21 0.00725 50.0
## 3 4 snf2 2.21 0.00736 50.0
## 4 5 snf2 2.22 0.00891 50.0
## 5 7 snf2 2.21 0.00991 50.0
## 6 11 snf2 2.20 0.00963 50.0
The approach of split-apply-combine allows you to split the data into
groups, apply a function/analysis and then combine the results. We
can do this usign group_by()
and summarize()
.
group_by()
takes as argument the column names that contain
categorical variables that we want to calculate summary stats for. For
example, to determine the average RNA concetration per strain:
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`))
or summarize using more than one column:
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
NOTE: Our table does not have missing values. However, we have built-in functions in R that can deal with this. We can filter values that are not NA:
experiment_info %>%
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`, na.rm = TRUE), # remove NA values
mean_total_RNA = mean(`Total RNA`))
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter to keep only the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`))
We can now arrange the data in ascending order (default of arrange()
)
or descending using desc()
:
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter to keep only the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(mean_concentration) # arrange new table in ascending mean concentrations
experiment_info %>%
filter(!is.na(`Nucleic Acid Conc.`)) %>% # filter to keep only the values that are not NAs
group_by(`Yeast Strain`) %>%
summarize(mean_concentration = mean(`Nucleic Acid Conc.`),
mean_total_RNA = mean(`Total RNA`)) %>%
arrange(desc(mean_concentration)) # arrange new table in descending mean concentrations
Another useful function is count()
to count categorical values:
experiment_info %>%
count(`Yeast Strain`)
On many ocassions we will have more than one table that we need to
extract information from. We can easily do this in R using the family of
join
functions.
We are going to first download extra information about the samples that we are are investigating and assigning them to an object called ena_info:
ena_info <- read_tsv(file = 'https://osf.io/6s4cv/download')
sample_mapping <- read_tsv(file = 'https://osf.io/uva3r/download')
Explore your new dataset ena_info and sample_mapping. What commands can you use?
head(ena_info)
## # A tibble: 6 x 14
## study_accession run_accession tax_id scientific_name instrument_model
## <chr> <chr> <dbl> <chr> <chr>
## 1 PRJEB5348 ERR458493 4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348 ERR458494 4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348 ERR458495 4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348 ERR458496 4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348 ERR458497 4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348 ERR458498 4932 Saccharomyces … Illumina HiSeq …
## # … with 9 more variables: library_layout <chr>, read_count <dbl>,
## # experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## # fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## # sample_title <chr>
tail(ena_info)
## # A tibble: 6 x 14
## study_accession run_accession tax_id scientific_name instrument_model
## <chr> <chr> <dbl> <chr> <chr>
## 1 PRJEB5348 ERR459201 4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348 ERR459202 4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348 ERR459203 4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348 ERR459204 4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348 ERR459205 4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348 ERR459206 4932 Saccharomyces … Illumina HiSeq …
## # … with 9 more variables: library_layout <chr>, read_count <dbl>,
## # experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## # fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## # sample_title <chr>
str(ena_info)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 672 obs. of 14 variables:
## $ study_accession : chr "PRJEB5348" "PRJEB5348" "PRJEB5348" "PRJEB5348" ...
## $ run_accession : chr "ERR458493" "ERR458494" "ERR458495" "ERR458496" ...
## $ tax_id : num 4932 4932 4932 4932 4932 ...
## $ scientific_name : chr "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ...
## $ instrument_model: chr "Illumina HiSeq 2000" "Illumina HiSeq 2000" "Illumina HiSeq 2000" "Illumina HiSeq 2000" ...
## $ library_layout : chr "SINGLE" "SINGLE" "SINGLE" "SINGLE" ...
## $ read_count : num 1093957 1078049 1066501 985619 846040 ...
## $ experiment_alias: chr "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-1" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-2" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-3" "ena-EXPERIMENT-DUNDEE-14-03-2014-09:53:55:631-4" ...
## $ fastq_bytes : num 59532325 58566854 58114810 53510113 46181127 ...
## $ fastq_md5 : chr "2b8c708cce1fd88e7ddecd51e5ae2154" "36072a519edad4fdc0aeaa67e9afc73b" "7a06e938a99d527f95bafee77c498549" "fa059eab6996118828878c30bd9a2b9d" ...
## $ fastq_ftp : chr "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458493/ERR458493.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458494/ERR458494.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458495/ERR458495.fastq.gz" "ftp.sra.ebi.ac.uk/vol1/fastq/ERR458/ERR458496/ERR458496.fastq.gz" ...
## $ submitted_ftp : chr "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458493/120903_0219_D0PT7ACXX_1_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458494/120903_0219_D0PT7ACXX_2_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458495/120903_0219_D0PT7ACXX_3_SA-PE-039.sanfastq.gz" "ftp.sra.ebi.ac.uk/vol1/run/ERR458/ERR458496/120903_0219_D0PT7ACXX_4_SA-PE-039.sanfastq.gz" ...
## $ sample_alias : chr "wt_sample1" "wt_sample2" "wt_sample3" "wt_sample3" ...
## $ sample_title : chr "great RNA experiment data" "great RNA experiment data" "great RNA experiment data" "great RNA experiment data" ...
## - attr(*, "spec")=
## .. cols(
## .. study_accession = col_character(),
## .. run_accession = col_character(),
## .. tax_id = col_double(),
## .. scientific_name = col_character(),
## .. instrument_model = col_character(),
## .. library_layout = col_character(),
## .. read_count = col_double(),
## .. experiment_alias = col_character(),
## .. fastq_bytes = col_double(),
## .. fastq_md5 = col_character(),
## .. fastq_ftp = col_character(),
## .. submitted_ftp = col_character(),
## .. sample_alias = col_character(),
## .. sample_title = col_character()
## .. )
class(ena_info)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
dim(ena_info)
## [1] 672 14
summary(ena_info)
## study_accession run_accession tax_id scientific_name
## Length:672 Length:672 Min. :4932 Length:672
## Class :character Class :character 1st Qu.:4932 Class :character
## Mode :character Mode :character Median :4932 Mode :character
## Mean :4932
## 3rd Qu.:4932
## Max. :4932
## instrument_model library_layout read_count
## Length:672 Length:672 Min. : 821888
## Class :character Class :character 1st Qu.:1252120
## Mode :character Mode :character Median :1457518
## Mean :1499348
## 3rd Qu.:1656280
## Max. :2804885
## experiment_alias fastq_bytes fastq_md5
## Length:672 Min. : 44752808 Length:672
## Class :character 1st Qu.: 68068787 Class :character
## Mode :character Median : 78775952 Mode :character
## Mean : 81343477
## 3rd Qu.: 89792915
## Max. :151800385
## fastq_ftp submitted_ftp sample_alias
## Length:672 Length:672 Length:672
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## sample_title
## Length:672
## Class :character
## Mode :character
##
##
##
# and the same with sample_mapping!
We have run accession numbers in both our tibbles sample_mapping
and
ena_info
. We want to merge both datasets to link the information
present in both tables in one big table that contains everything. We
need a column that has the same name in both tables.
Challenge: Which function would you use if the column names were not the same?
sample_mapping <- rename(sample_mapping, run_accession = RunAccession) # would rename a column called Sample into sample_number to match with the column sample_number in ena_info
yeast_metadata_right <- right_join(ena_info, sample_mapping, by = "run_accession") # this will join both tables matching rows from experiment_info in ena_info
That is a big table!
How would you merge both tables so that only the rows that are common between both tables are preserved?
yeast_metadata_inner <- inner_join(ena_info, sample_mapping, by = "run_accession")
head(yeast_metadata_inner)
## # A tibble: 6 x 17
## study_accession run_accession tax_id scientific_name instrument_model
## <chr> <chr> <dbl> <chr> <chr>
## 1 PRJEB5348 ERR458493 4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348 ERR458494 4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348 ERR458495 4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348 ERR458496 4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348 ERR458497 4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348 ERR458498 4932 Saccharomyces … Illumina HiSeq …
## # … with 12 more variables: library_layout <chr>, read_count <dbl>,
## # experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## # fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## # sample_title <chr>, Lane <dbl>, Sample <chr>, BiolRep <dbl>
yeast_metadata_left <- left_join(ena_info, sample_mapping, by = "run_accession")
head(yeast_metadata_left)
## # A tibble: 6 x 17
## study_accession run_accession tax_id scientific_name instrument_model
## <chr> <chr> <dbl> <chr> <chr>
## 1 PRJEB5348 ERR458493 4932 Saccharomyces … Illumina HiSeq …
## 2 PRJEB5348 ERR458494 4932 Saccharomyces … Illumina HiSeq …
## 3 PRJEB5348 ERR458495 4932 Saccharomyces … Illumina HiSeq …
## 4 PRJEB5348 ERR458496 4932 Saccharomyces … Illumina HiSeq …
## 5 PRJEB5348 ERR458497 4932 Saccharomyces … Illumina HiSeq …
## 6 PRJEB5348 ERR458498 4932 Saccharomyces … Illumina HiSeq …
## # … with 12 more variables: library_layout <chr>, read_count <dbl>,
## # experiment_alias <chr>, fastq_bytes <dbl>, fastq_md5 <chr>,
## # fastq_ftp <chr>, submitted_ftp <chr>, sample_alias <chr>,
## # sample_title <chr>, Lane <dbl>, Sample <chr>, BiolRep <dbl>
We do not want all the columns; we want to create a new tibble called
yeast_metadata
that contains only run accession, experiment alias, read count, fastq bytes and md5, lane, yeast strain and biological replicate. Rename the column names so that all of the column names are lower_case (lowercase followed by underscore). And include only data from lane number 1. Use pipes!
yeast_metadata <- yeast_metadata_inner %>%
rename(yeast_strain = Sample, lane = Lane, biol_rep = BiolRep) %>%
filter(lane == 1) %>%
select(run_accession, experiment_alias, read_count, fastq_bytes, fastq_md5, lane, yeast_strain, biol_rep)
head(yeast_metadata)
## # A tibble: 6 x 8
## run_accession experiment_alias read_count fastq_bytes fastq_md5 lane
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 ERR458493 ena-EXPERIMENT-… 1093957 59532325 2b8c708c… 1
## 2 ERR458500 ena-EXPERIMENT-… 1885330 102201086 107aad97… 1
## 3 ERR458507 ena-EXPERIMENT-… 1594160 86382961 93bca10a… 1
## 4 ERR458514 ena-EXPERIMENT-… 1594695 86462515 48fa134b… 1
## 5 ERR458521 ena-EXPERIMENT-… 1977253 106992945 98a9d336… 1
## 6 ERR458528 ena-EXPERIMENT-… 1371825 74704461 e4bf39d6… 1
## # … with 2 more variables: yeast_strain <chr>, biol_rep <dbl>
samples <- c('ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501', 'ERR458502') # create a vector that includes the samples that we want to subset
salmon_samples <- yeast_metadata_inner %>%
filter(run_accession %in% samples) # filter rows based on these sample names
We have learned how to manipulate datasets and combine them. We are now
going to the basics of constructing a plot in ggplot2
which is part of
tidyverse
.
The basic syntax of ggplot2
is:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
<GEOM_FUNCTION>()
NOTE: The “<” and “>” used here don’t mean anything special. They are just here to signifiy inputing something there.
We are going to explore our yeast_metadata
tibble.
We build plots in ‘layers’:
ggplot(data = yeast_metadata)
Specifying data
binds the plot to a specific data frame…
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes))
defines the mapping using aes
(aesthetics of the plot) by selecting
the variables to be plotted and how they will be plotted (e.g. x/y,
size, shape, color…)…
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point()
sets what type of plot we want to have (boxplot, lines, bars):
geom_point()
for scatter plots, dot plots, etc.;
geom_boxplot()
for boxplots;
geom_line()
for trend lines, time series, etc.
We can modify plots to extract more information. We can add transparency:
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 0.1)
or change the color of the points:
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 0.1, color = "red")
Or color each strain differently:
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes)) +
geom_point(alpha = 0.5, aes(color = yeast_strain))
We can also specify the color inside the mapping:
ggplot(data = yeast_metadata, mapping = aes(x = read_count, y = fastq_bytes, color = yeast_strain)) +
geom_point(alpha = 0.5)
We can use boxplots to see the distribution of reads within strains:
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_boxplot()
This is useful, but if we add dots to the boxplots we will have a better idea of the number of measurements and their distribution:
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_boxplot(alpha = 0.25) + geom_point(alpha = 0.25)
With categorical bins on the x-axis, we can also use geom_jitter()
to spread out the points to be more clear:
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_boxplot(alpha = 0.1) + geom_jitter(alpha = 0.7, color = "tomato")
Replace the boxplot with a violin plot.
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
geom_violin(alpha = 0.1) +
geom_jitter(alpha = 1, color = "tomato")
Represent the read_count on log10 scale. Check out
scale_y_log10()
.
ggplot(data = yeast_metadata, mapping = aes(x = yeast_strain, y = read_count)) +
scale_y_log10() +
geom_boxplot(alpha = 0.1) +
geom_jitter(alpha = 1, color = "tomato")
Try to make a histogram plot for the read counts, coloring each yeast strain.
# Basic histogram
ggplot(data = yeast_metadata, aes(x=read_count)) +
geom_histogram()
# Change colors
p<-ggplot(data = yeast_metadata, aes(x=read_count)) +
geom_histogram(color="black", fill="white")
p
# Change colors based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black")
# Facet based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black") +
facet_grid(yeast_strain~.)
# Change to custom colors that are color blind friend
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_histogram(color="black") +
facet_grid(yeast_strain~.) +
# Add blue and yellow colors that are more colorblind friendly for plotting
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
# Density based on yeast strain
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
facet_grid(yeast_strain~.) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
# A white background might be preferable for the yellow color
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
facet_grid(yeast_strain~.) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
theme_bw()
What if we want to add the mean read counts in a vertical line?
# To do so, we need to calculate the mean_read_count in a new data frame
mean_yeast_data <- yeast_metadata %>%
group_by(yeast_strain) %>%
summarise(mean_read_count = mean(read_count))
head(mean_yeast_data)
## # A tibble: 2 x 2
## yeast_strain mean_read_count
## <chr> <dbl>
## 1 SNF2 1665717.
## 2 WT 1596525.
# Add the mean read_count with vline
ggplot(data = yeast_metadata, aes(x=read_count, fill = yeast_strain)) +
geom_density(alpha = 0.5) +
facet_grid(yeast_strain~.) +
geom_vline(data = mean_yeast_data, mapping = aes(xintercept = mean_read_count),
linetype="dashed", size=0.5) +
theme_bw() +
scale_fill_manual(values = c("cornflowerblue", "goldenrod2"))
This lesson was inspired by and adapted from the Data Carpentry Ecology R lessons
=============
There are many amazing resources and cheat sheets to continue learning R, including:
Learning objectives:
Create a gene-level count matrix of Salmon quantification using tximport
Perform differential expression of a single factor experiment in DESeq2
Perform quality control and exploratory visualization of RNA-seq data in R
You should still have your jetstream instance running, you can following the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
You should have the salmon counts on your instance in the ~/quant/
folder. If
not, you can get this data by running:
cd ~
mkdir -p quant
cd quant
wget -O salmon_quant.tar.gz https://osf.io/5dbrh/download
tar -xvf salmon_quant.tar.gz
When we finished running salmon, we had a quant
folder that contained
transcript-level read counts. So far in our RNA-seq pipeline, we have been
working from the command line in bash. For the rest of the RNA-seq tutorial,
we will be working in R. As we saw in the R introduction lesson, R is really
good and working with data in table format. When we produced counts for our
reads, we essentially transformed our data to this format. Most of the software
tools written to analyze RNA-seq data in this format are written in R.
We first need to read our data into R. To do that, we will use a package called
tximport
.
This package takes transcript-level counts and summarizes them to the gene
level. It is compatible with many count input formats, including salmon.
First, launch RStudio from your instance. Run the following link to produce a url that navigates to RStudio when entered in your browser.
echo http://$(hostname -i):8787/
From R, start a new script.
Load the libraries tximport
, DESeq2
, and tidyverse
so we have
access to the functions. We’ve pre-installed these packages for you, but we’ll
go over how to install them in the future.
library(tximport)
library(DESeq2)
library(tidyverse)
We need a couple files to import our counts.
First, we need a file that tells tximport
where our files are. We’ve added
other information to this file about sample names and condition, which we will
also use later for differential expression. We can use the read_csv()
function to read this file into R as a dataframe directly from a url.
# read in the file from url
samples <- read_csv("https://osf.io/cxp2w/download")
# look at the first 6 lines
head(samples)
you should see output like this,
# A tibble: 6 x 3
sample quant_file condition
<chr> <chr> <chr>
1 ERR458493 ~/quant/ERR458493.qc.fq.gz_quant/quant.sf wt
2 ERR458494 ~/quant/ERR458494.qc.fq.gz_quant/quant.sf wt
3 ERR458495 ~/quant/ERR458495.qc.fq.gz_quant/quant.sf wt
4 ERR458500 ~/quant/ERR458500.qc.fq.gz_quant/quant.sf snf2
5 ERR458501 ~/quant/ERR458501.qc.fq.gz_quant/quant.sf snf2
6 ERR458502 ~/quant/ERR458502.qc.fq.gz_quant/quant.sf snf2
Second, we need a file that specifies which transcripts are associated with which genes. In this case, we have associated transcript names with yeast ORF names.
# read in the file from url
tx2gene_map <- read_tsv("https://osf.io/a75zm/download")
# look at the first 6 lines
head(tx2gene_map)
the output should look something like this,
# A tibble: 6 x 2
TXNAME GENEID
<chr> <chr>
1 lcl|BK006935.2_mrna_1 YAL068C
2 lcl|BK006935.2_mrna_2 YAL067W-A
3 lcl|BK006935.2_mrna_3 YAL067C
4 lcl|BK006935.2_mrna_4 YAL065C
5 lcl|BK006935.2_mrna_5 YAL064W-B
6 lcl|BK006935.2_mrna_6 YAL064C-A
We can now use tximport()
to read in our count files
txi <- tximport(files = samples$quant_file, type = "salmon", tx2gene = tx2gene_map)
Let’s look at this new object:
summary(txi)
Length Class Mode
abundance 36774 -none- numeric
counts 36774 -none- numeric
length 36774 -none- numeric
countsFromAbundance 1 -none- character
txi
is a bit like a list, where it has multiple objects in it. Let’s take a
look at the counts:
head(txi$counts)
We have all of our counts in one place! One thing they’re missing is
informative column names. We can set these using our samples
files, which will
also ensure that we assign the correct names to each column.
colnames(txi$counts) <- samples$sample
Given a uniform sampling of a diverse transcript pool, the number of sequenced reads mapped to a gene depends on:
its own expression level,
its length,
the sequencing depth,
the expression of all other genes within the sample.
In order to compare the gene expression between two conditions, we must therefore calculate the fraction of the reads assigned to each gene relative to the total number of reads and with respect to the entire RNA repertoire which may vary drastically from sample to sample. While the number of sequenced reads is known, the total RNA library and its complexity is unknown and variation between samples may be due to contamination as well as biological reasons. The purpose of normalization is to eliminate systematic effects that are not associated with the biological differences of interest.
Normalization aims at correcting systematic technical biases in the data, in order to make read counts comparable across samples. The normalization proposed by DESeq2 relies on the hypothesis that most features are not differentially expressed. It computes a scaling factor for each sample. Normalized read counts are obtained by dividing raw read counts by the scaling factor associated with the sample they belong to.
DESeq2
¶Differential expression analysis with DESeq2 involves multiple steps as displayed in the flowchart below. Briefly,
DESeq2 will model the raw counts, using normalization factors (size factors) to account for differences in library depth.
Then, it will estimate the gene-wise dispersions and shrink these estimates to generate more accurate estimates of dispersion to model the counts.
Finally, DESeq2 will fit the negative binomial model and perform hypothesis testing using the Wald test or Likelihood Ratio Test.
We’re now ready to use DESeq2
, the package that will perform differential
expression.
We’ll start with a function called DESeqDataSetFromTximport
which will
transform our txi
object into something that other functions in DESeq2
can
work on. This is where we also give information about our samples contain
in the samples
data.frame, and where we provide our experimental design. A design formula tells the statistical software the known sources of variation to control for, as well as, the factor of interest to test for during differential expression testing. Here our experimental design has one factor with two levels.
dds <- DESeqDataSetFromTximport(txi = txi,
colData = samples,
design = ~condition)
Note: DESeq stores virtually all information associated with your experiment in one specific R object, called DESeqDataSet. This is, in fact, a specialized object of the class “SummarizedExperiment”. This, in turn,is a container where rows (rowRanges()) represent features of interest (e.g. genes, transcripts, exons) and columns represent samples (colData()). The actual count data is stored in theassay()slot.
After running this command, you should see red output messages that look something like this:
using counts and average transcript lengths from tximport
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
The first thing we notice is that both our counts and average transcript
length were used to construct the DESeq
object. We also see a warning
message, where our condition was converted to a factor. Both of these
messages are ok to see!
Now that we have a DESeq2 object, we can can perform differential expression.
dds <- DESeq(dds)
Everything from normalization to linear modeling was carried out by the use of a single function! This function will print out a message for the various steps it performs:
estimating size factors
using 'avgTxLength' from assays(dds), correcting for library size
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
And look at the results! The results() function lets you extract the base means across samples, log2-fold changes, standard errors, test statistics etc. for every gene.
res <- results(dds)
head(res)
log2 fold change (MLE): condition wt vs .
Wald test p-value: condition wt vs .
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ETS1-1 150.411380876596 -0.212379164913443 0.226307045512225 -0.938455824177912 0.348010209041573 0.670913203685731
ETS2-1 0 NA NA NA NA NA
HRA1 1.08884236321432 3.97189996253477 3.31146245569429 1.19943982928292 0.230356968252411 NA
ICR1 28.8713313668145 -0.179654414034232 0.547534328578008 -0.328115343015677 0.742824453568867 0.907583917807416
IRT1 29.6053852845893 0.347498338858063 0.523744585645556 0.66348817416364 0.507017950920217 0.794457978516333
ITS1-1 23.8676180072013 -7.25640067170996 1.28178177600513 -5.66118258782368 1.50333365625958e-08 2.62112769016311e-07
The first thing that prints to the screen is information about the “contrasts” in the differential expression experiment. By default, DESeq2 selects the alphabetically first factor to the be “reference” factor. Here that doesn’t make that much of a difference. However, it does change how we interpret the log2foldchange values. We can read these results as, “Compared to SNF2 mutant, WT had a decrease of -0.2124 in log2fold change of gene expression.
Speaking of log2fold change, what do all of these columns mean?
baseMean | giving means across all samples |
---|---|
log2FoldChange | log2 fold changes of gene expression from one condition to another. Reflects how different the expression of a gene in one condition is from the expression of the same gene in another condition. |
lfcSE | standard errors (used to calculate p value) |
stat | test statistics used to calculate p value) |
pvalue | p-values for the log fold change |
padj | adjusted p-values |
We see that the default differential expression output is sorted the same way as our input counts. Instead, it can be helpful to sort and filter by adjusted p value or log2 Fold Change:
res_sig <- subset(res, padj<.05)
res_lfc <- subset(res_sig, abs(log2FoldChange) > 1)
head(res_lfc)
Looking at our results is great, but visualizing them is even better!
The MA plot provides a global view of the relationship between the expression change between conditions (log ratios, M), the average expression strength of the genes (average mean, A) and the ability of the algorithm to detect differential gene expression: genes that pass the significance threshold are colored in red
plotMA(res)
Question
Why are more genes grey on the left side of the axis than on the right side?
Although it’s helpful to plot many (or all) genes at once, sometimes we want to see how counts change for a specific gene. We can use the following code to produce such a plot:
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
Question
What gene is plotted here (i.e., what criteria did we use to select a single
gene to plot)?
DESeq2
automatically normalizes our count data when it runs differential
expression. However, for certain plots, we need to normalize our raw count data.
One way to do that is to use the vst()
function. It perform variance
stabilized transformation on the count data, while controlling for library
size of samples.
vsd <- vst(dds)
An MDS (multi-dimensional scaling) plot gives us an idea of how our samples relate to each other. The closer two samples are on a plot, the more similar all of their counts are. To generate this plot in DESeq2, we need to calculate “sample distances” and then plot them.
# calculate sample distances
sample_dists <- assay(vsd) %>%
t() %>%
dist() %>%
as.matrix()
head(sample_dists)
Next, let’s calculate the MDS values from the distance matrix.
mdsData <- data.frame(cmdscale(sample_dists))
mds <- cbind(mdsData, as.data.frame(colData(vsd))) # combine with sample data
head(mds)
And plot with ggplot2
!
ggplot(mds, aes(X1, X2, shape = condition)) +
geom_point(size = 3) +
theme_minimal()
Question
How similar are the samples between conditions?
Heatmaps are a great way to look at gene counts. To do that, we can use a
function in the pheatmap
package. Here we demonstrate how to install a
package from the CRAN repository and then load it into our environment.
install.packages("pheatmap")
library(pheatmap)
Next, we can select a subset of genes to plot. Although we could plot all ~6000 yeast genes, let’s choose the 20 genes with the largest positive log2fold change.
genes <- order(res_lfc$log2FoldChange, decreasing=TRUE)[1:20]
We can also make a data.frame that contains information about our samples that will appear in the heatmap. We will use our samples data.frame from before to do this.
annot_col <- samples %>%
column_to_rownames('sample') %>%
select(condition) %>%
as.data.frame()
head(annot_col)
And now plot the heatmap!
pheatmap(assay(vsd)[genes, ], cluster_rows=TRUE, show_rownames=TRUE,
cluster_cols=FALSE, annotation_col=annot_col)
We see that our samples do cluster by condition, but that by looking at just the counts, the patterns aren’t very strong. How does this compare to our MDS plot?
Question
When are heatmaps useful?
What other types of heatmaps have you seen in the wild?
Here are some helpful notes or resources for anyone performing differential expression.
Introduction to differential gene expression analysis using RNA-seq (Written by Friederike Dündar, Luce Skrabanek, Paul Zumbo). Click here
Introduction to DGE - click here
Making a tx2gene file is often a little different for each organism. If your
organism has a transcriptome (or *rna_from_genomic.fna.gz file) on RefSeq,
you can often get the information from the *feature_table.txt.gz
. You
might also be able to parse a gtf or gff file to produce the information you
need. This information is sometimes also found in the fasta headers of the
transcriptome file itself.
Learning objectives:
define and explore the concepts and implications of shotgun sequencing;
explore coverage;
understand the basics of mapping-based variant calling;
learn basics of actually calling variants & visualizing.
You should still have your jetstream instance running, you can following the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
conda install -y bwa samtools bcftools
After installing the necessary software, we will create the working directory for the mapping as follows:
cd ~/
mkdir mapping
cd mapping
Next, we will create links from the previously downloaded and quality-trimmed yeast dataset:
ln -fs ~/quality/*.qc.fq.gz .
ls
Goal: perform read alignment (also known as mapping) to determine where our reads originated from.
Here we are going to align our transcripts to the reference’s open reading frames to look for single-nucleotide variants. It’s important to think about what reference is appropriate for your experiment. Many biologically important variants exist in non-coding regions, so if we were looking at genomic sequences, it would be important to use a different reference such as the whole genome.
curl -O https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/orf_coding.fasta.gz
gunzip orf_coding.fasta.gz
Let’s take a look at our reference:
head orf_coding.fasta
To bring up the manual page for bwa we can type:
bwa
Notice that it brings up a description of the different commands within the bwa
software. We will be using index
and mem
.
We can also have a look at the bwa options page.
Our first step is to index the reference sequences for use by BWA. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a reference, which saves time during alignment. Indexing the reference only has to be run once. The only reason you would want to create a new index is if you are working with a different reference or you are using a different tool for alignment.
bwa index orf_coding.fasta
We use an algorithm called bwa mem
to perform mapping. To find more information via the help page for the bwa mem
function we can type the command below. Without any input, the help page will come up by default:
bwa mem
Then, when we are ready perform mapping with our sample ERR458493
we can type:
bwa mem -t 4 orf_coding.fasta ERR458493.qc.fq.gz > ERR458493.sam
While we are running bwa with the default parameters here, your use case might require a change of parameters. NOTE: Always read the manual page for any tool before using and make sure the options you use are appropriate for your data.
What is the difference between Salmon and bwa mem?
Standard alignment tools (Hisat2, STAR, BWA) try to find the read origin by FULL-ALIGNMENT of reads to a genome or transcriptome.
Ultra-fast alignment-free methods, such as Sailfish, Salmon and Kallisto, have been developed by exploiting the idea that precise alignments are not required to assign reads to their origins
Salmon’s “quasi-mapping” approach uses exact matching of k-mers (sub-strings in reads) to approximate which read a transcipt originated from. The idea behind it being that it may not be important to exactly know where within a transcript a certain read originated from. Instead, it may be enough to simply know which transcript the read represents.
Salmon therefore does not generate a BAM file because it does not worry about finding the best possible alignment. Instead, it yields a (probabilistic) measure of how many reads originated from each transcript. This is enough information for read quantification, and is really fast.
However, BWA
mem
produces an alignment, where an entire read is mapped exactly against a reference sequence. This produces more information that is important for things like variant calling.
We can peek at our “.sam” file:
head ERR458493.sam
tail ERR458493.sam
The SAM file is a tab-delimited text file that contains information for each individual read and its alignment to the reference. While we do not have time to go in detail of the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.
The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.
PRACTICE! Using the
bwa
command we ran above as the foundation, construct a for loop to generate “.sam” alignment files for all of our quality controlled samples!Solutionfor filename in *.qc.fq.gz do name=$(basename $filename .qc.fq.gz) echo Working on: $name bwa mem -t 4 orf_coding.fasta $filename > ${name}.sam done
Goal: make it possible to vizualize some of our mapping results.
Before we indexed the reference for BWA, now we need to index the reference for samtools. To see the manual for samtools
we can type:
samtools
Although both tools use different indexing methods, they both allow the tools to find specific sequences within
the reference quickly. We can see that an indexing function is samtools faidx
.
samtools faidx orf_coding.fasta
Next, we will convert our file format to a .bam
file with the samtools view
command. Let’s see the different parameters for this function:
samtools view
We can see that:
-S
: ignored (input format is auto-detected)
-b
: output BAM
So let’s convert our file format for sample ERR458493
:
samtools view -S -b ERR458493.sam > ERR458493.bam
You can sort on many different columns within a sam or bam file. After mapping, our
files are sorted by read number. Most programs require mapping files to be sorted by
position in the reference. You can sort a file using the samtools sort
command.
samtools sort ERR458493.bam -o ERR458493.sorted.bam
samtools index ERR458493.sorted.bam
bcftools
!¶Goal: find places where the reads are systematically different from the reference.
A variant call is a conclusion that there is a nucleotide difference vs. some reference at a given
position in an individual genome or transcriptome, often referred to as a Single Nucleotide Polymorphism (SNP).
The call is usually accompanied by an estimate of variant frequency and some measure of confidence. Similar
to other steps in this workflow, there are number of tools available for variant calling. In this workshop
we will be using bcftools
, but there are a few things we need to do before actually calling the variants.
First, let’s look at the manual page for bcftools
, making note of the mpileup
and call
commands:
bcftools
Let’s see the different parameters for the bcftools mpileup
command by typing:
bcftools mpileup
Note: The parameters we will use for bcftools mpileup
are:
-O
: Output file type, which can be:
‘b’ compressed BCF. We will use this output format!
‘u’ uncompressed BCF
‘z’ compressed VCF
‘v’ uncompressed VCF
-f
: faidx indexed reference sequence file
And to find help for the bcftools call
function:
bcftools call
Note: The parameters for bcftools call
that we will use:
-m
: alternative model for multiallelic and rare-variant calling (conflicts with -c)
-v
: output variant sites only
-o
: write output to a file
Now, let’s go ahead and run our command on our sample ERR458493
!
bcftools mpileup -O b -f orf_coding.fasta ERR458493.sorted.bam | \
bcftools call -m -v -o variants.vcf
Next, we will use a perl script from samtools
called vcfutils.pl
that will filter out our variants and
we can write the output to a new file.
vcfutils.pl varFilter variants.vcf > variants_filtered.vcf
To look at the entire variants.vcf
file you can do cat variants.vcf
; all of the lines starting with #
are comments. You
can use tail variants.vcf
to see the last ~10 lines, which should
be all of the called variants.
The first few columns of the VCF file represent the information we have about a predicted variation:
column | info |
---|---|
CHROM | contig location where the variation occurs |
POS | position within the contig where the variation occurs |
ID | a . until we add annotation information |
REF | reference genotype (forward strand) |
ALT | sample genotype (forward strand) |
QUAL | Phred-scaled probablity that the observed variant exists at this site (higher is better) |
FILTER | a . if no quality filters have been applied, PASS if a filter is passed, or the name of the filters this variant failed |
The Broad Institute’s VCF guide is an excellent place to learn more about VCF file format.
tview
:¶Now that we know the locations of our variants, let’s view them with samtools
!
Samtools implements a very simple text alignment viewer called tview. This alignment viewer works with short indels and shows MAQ consensus. It uses different colors to display mapping quality or base quality, subjected to users’ choice. Samtools viewer is known to work with an 130 GB alignment swiftly. Due to its text interface, displaying alignments over network is also very fast.
In order to visualize our mapped reads we use tview, giving it the sorted bam file and the reference file:
samtools tview ERR458493.sorted.bam orf_coding.fasta
tview
commands of relevance:
left and right arrows scroll
q
to quit
CTRL-h and CTRL-l do “big” scrolls
Typing g
allows you to go to a specific location, in this format chromosome:location. Here are some locations you can try out:
YLR162W:293
(impressive pileup, shows two clear variants and three other less clear)
YDR034C-A:98
(impressive pileup, shows two clear variants)
YDR366C:310
(impressive pileup, less clear variants)
YLR256W:4420
(impressive pileup, less clear variants)
YBL105C:2179
(less depth, shows two clear variants)
YDR471W:152
(impressive pileup, shows one clear variant)
Get some summary statistics as well:
samtools flagstat ERR458493.sorted.bam
Learn what a bash/shell script is and how to write one
Incorporate for loops within a shell script
We have now worked through two workflows, RNA-seq analysis and variant calling. In both cases, we pasted all of our commands into the terminal one command at a time, and used for loops to run each step of the workflow over our six samples. This is necessary at first, and a great way to understand how each tool works and establish a workflow we want to use.
Let’s imagine now that we decided to download all 672 samples from this dataset instead of only working with our six samples. Since we have already figured out how each tool in our workflow works, and we know how we want to run them, one way we can automate (and document) our workflow is with a bash script (remember that bash is just the most common programming language used in our terminal environment).
Here we are going to automate the quality assessment and trimming we did here and the read quantification we performed here in one script. But first, let’s see what a bash script is!
You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
Next, we’re going to access RStudio again on our instances. RStudio will be convenient for being able to edit a text file alongside an active terminal window (command-line window). We’ll see this in action in a second. As earlier, run the following link to produce a url that navigates to RStudio when entered in your browser.
echo http://$(hostname -i):8787/
Previously, we’ve been writting R code in the RStudio text-editor at the top left, but we can also write any kind of files there. So that will be where we build up our bash script. To start with a clean new document, select “File” from the top menu bar inside RStudio, then “New File”, then “Text File”.
At the bottom left window is our R “console” right now. But if we select “Tools” from the top menu bar within RStudio, then “Terminal”, and then “New Terminal”, we will open a command-line terminal in that bottom left window that looks something like this:
This is the same as our terminal that we used to ssh
into and then get the link to the RStudio environment, but we are just accessing it through RStudio here. Running commands like cd
, ls
, and others we’ve been using at the command line all work the same as if we were in a terminal outside of RStudio. If something isn’t working there, double-check it is on the “Terminal” tab, and not the “Console” tab (the “Console” tab is the R environment, while the “Terminal” tab is our command-line environment).
First, in the Terminal of RStudio, let’s run these commands:
ls
echo "I'm here to show something random being done"
pwd
Those 3 commands just execute as we are used to when running them at the command line one-by-one. Now let’s copy those 3 commands, and put them into our text editor window at the top-left:
And save the file in our home directory (which should be where it starts) as “my-first-bash-script.sh”:
Now if we check at the command line, (our terminal window at the bottom left), we can see this file is there:
ls *.sh
And we can run it by providing the command bash
, and then the file we just made as a positional argument like so:
bash my-first-bash-script.sh
And all three lines are executed as though we entered them one at a time! This is all a bash/shell script is 🙂
Next let’s build up a larger script!
For the sake of speed while running things in this tutorial, we are going to be working with only 2 samples. But it would be the same process even if we had 200.
In the Terminal window of RStudio (bottom left), let’s make a new directory to work in, download just two of our data files and the reference transcriptome we used for read quantification, and copy over the adapter file we used for Trimmomatic:
mkdir scripting/
cd scripting/
curl -L https://osf.io/5daup/download -o ERR458493.fastq.gz
curl -L https://osf.io/8rvh5/download -o ERR458494.fastq.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/146/045/GCA_000146045.2_R64/GCA_000146045.2_R64_rna_from_genomic.fna.gz
cp /opt/miniconda/pkgs/trimmomatic-*/share/trimmomatic-*/adapters/TruSeq2-SE.fa .
When we write a bash script, we need to add all the commands that we would run in our
workflow if we were doing it in real-time. This includes making and changing directories, moving files, and
running analysis programs like fastqc
.
Let’s start by just running fastqc
inside a script.
Don’t enter this command in your Terminal, instead copy and paste it into a new text document in the top left panel (File -> New File -> Text File).
fastqc *fastq.gz
And then save the file as “workflow.sh”, and be sure to save it in the “scripting/” directory we are working in:
Now we have a bash script that automates fastqc
! Let’s run the bash script from our terminal in RStudio:**
bash workflow.sh
Let’s add more to our script. Next we’ll organize our output files, and add some echo
commands to tell us where we are up to if we happen to be watching. Add this text onto our file in the text-editor panel where our script is building, not in the Terminal window.
echo "Running FastQC..."
fastqc *.fastq.gz
echo "Moving FastQC results..."
mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/
And don’t forget to save the file after making changes, cmd + s
or ctrl + s
should do the trick, and the file name at the top should change from red to black when saved:
Now, let’s run this again.
bash workflow.sh
We now see that our echo messages print to the terminal and tell us where we
are in the workflow. After it finishes, if we run ls
on the “fastqc_untrimmed” directory, we see our fastqc
output files are in there:
ls fastqc_untrimmed/
Now let’s add our Trimmomatic command in a for loop like we did here:
echo "Running FastQC..."
fastqc *.fastq.gz
echo "Moving FastQC results..."
mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/
echo "Running Trimmomatic..."
for filename in *.fastq.gz
do
# first, getting the unique part of the file name alone by removing ".fastq.gz"
base=$(basename $filename .fastq.gz)
echo "On sample: $base"
trimmomatic SE ${base}.fastq.gz \
${base}.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:15 TRAILING:15 \
SLIDINGWINDOW:10:20 \
MINLEN:25
done
And don’t forget to save the file again:
And now we are ready to run it in our terminal window:
bash workflow.sh
PRACTICE! Add some commands at the end of our script that will:
1) run FastQC on the output files from Trimmomatic (those that end with “.qc.fq.gz”)
2) create a new directory called “fastqc_trimmed”
3) move the output files from FastQC into “fastqc_trimmed”.Solution
Our entire file should now look something like this:echo "Running FastQC..." fastqc *.fastq.gz
echo "Moving FastQC results..." mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/ mv *.html fastqc_untrimmed/
echo "Running Trimmomatic..." for filename in *.fastq.gz do
# first, getting the unique part of the file name by removing ".fastq.gz" base=$(basename $filename .fastq.gz) echo On sample: $base
trimmomatic SE ${base}.fastq.gz \ ${base}.qc.fq.gz \ ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \ LEADING:15 TRAILING:15 \ SLIDINGWINDOW:10:20 \ MINLEN:25
done
echo "Running FastQC on trimmed files..." fastqc *.qc.fq.gz
echo "Moving trimmed FastQC results..." mkdir fastqc_trimmed/
mv *.zip fastqc_trimmed/ mv *.html fastqc_trimmed/
Lastly, let’s add on read quantification with Salmon like we did here. We only need to run the indexing command once, and then we can put a for loop to run the quantification command on our quality trimmed fastq files just like the one used here. And then let’s add a message at the end for our future selves:
echo "Running FastQC..."
fastqc *.fastq.gz
echo "Moving FastQC results..."
mkdir fastqc_untrimmed/
mv *.zip fastqc_untrimmed/
mv *.html fastqc_untrimmed/
echo "Running Trimmomatic..."
for filename in *.fastq.gz
do
# first, getting the unique part of the file name alone by removing ".fastq.gz"
base=$(basename $filename .fastq.gz)
echo "On sample: $base"
trimmomatic SE ${base}.fastq.gz \
${base}.qc.fq.gz \
ILLUMINACLIP:TruSeq2-SE.fa:2:0:15 \
LEADING:15 TRAILING:15 \
SLIDINGWINDOW:10:20 \
MINLEN:25
done
echo "Running FastQC on trimmed files..."
fastqc *.qc.fq.gz
echo "Moving trimmed FastQC results..."
mkdir fastqc_trimmed/
mv *.zip fastqc_trimmed/
mv *.html fastqc_trimmed/
echo "Indexing ref. transcriptome for Salmon..."
salmon index --index sc_index --type quasi --transcripts GCA_000146045.2_R64_rna_from_genomic.fna.gz
echo "Quantifying with Salmon..."
for filename in *.qc.fq.gz
do
# getting the unique part of the file name alone by removing ".qc.fq.gz"
base=$(basename $filename .qc.fq.gz)
salmon quant -i sc_index --libType A \
-r ${filename} -o ${base}_quant --seqBias \
--gcBias --validateMappings
done
echo "Job Done!! (That's a nice-lookin' script!)"
Stellar! Don’t forget to save it, and now let’s run it!
bash workflow.sh
Bash scripting can be very powerful!!
Identify cases where workflow managers are helpful for automation
Understand the components of a Snakefile: rules, inputs, outputs, and actions.
Write and run a Snakefile.
You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
You should now be logged into your Jetstream computer! You should see something like this:
diblynn@js-17-71:~$
Make sure you’re in your home directory:
cd ~
Install snakemake
using conda.
conda install -y -c conda-forge -c bioconda snakemake-minimal
Type the following in your terminal to display a link to Rstudio web-server for your instance’s $(hostname -i):
echo http://$(hostname -i):8787/
Copy and paste the generated link into your browser to open Rstudio and login with your room’s jetstream username and password. We’re going to again work with the text editor and the terminal of Rstudio.
In both our RNA-seq workflow and our mapping and variant calling workflow, we
performed many steps. We performed steps like quality assessment and filtering using
fastqc
and trimmomatic
. We performed these steps on 6 files using for loops.
In our last lesson, we automated these steps using a bash script. We put all of our commands into one file so that we only had to run one command to orchestrate our quality control workflow. Bash scripting for automation is really powerful!
Let’s revisit the first part of our bash script for running and organizing our fastqc results:
cd ~/data/
echo "Running FastQC ..."
fastqc *.fastq.gz
mkdir -p ~/fastqc_untrimmed
echo "Saving FastQC results..."
mv *.zip ~/fastqc_untrimmed/
mv *.html ~/fastqc_untrimmed/
cd ~/fastqc_untrimmed/
echo "Running multiqc..."
multiqc .
We can run it using this command:
bash qc.sh
Oh crap! We realize just after we’ve finished Running FastQC
that we wanted
to move our summary file to a subdirectory! Quick, press ctrl - C
and cancel
the run!
Even if we made this change though, we’re in a bit of a pickle. We want to re-start our script that automates the runs and the file movements for us, but we already ran the first part of our file! Let’s add comment characters to the lines we know already ran and then re-run the script:
cd ~/data/
# echo "Running FastQC ..."
# fastqc *.fastq.gz
mkdir -p ~/data/fastqc_untrimmed
echo "Saving FastQC results..."
mv *.zip ~/data/fastqc_untrimmed/
mv *.html ~/data/fastqc_untrimmed/
cd ~/data/fastqc_untrimmed/
echo "Running multiqc..."
multiqc .
Now we can re-run the script:
bash qc.sh
This (maybe) worked out ok this time. However, it’s hard to be sure we know where we stopped our command. For this reason and many others, we use workflow managers to keep track of the things that have and haven’t run yet!
We’ll be using snakemake for automation.
The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. It orchestrates and keeps track of all the different steps of workflows that have been run so you don’t have to! It has a lot of wonderful features that can be invoked for different applications, making it very flexible while maintaining human interpretability.
There are many different tools that researchers use to automate computational workflows. We selected snakemake for the following reasons:
It was written by a bioinformatician for bioinformatics workflows.
It’s free, open-source, and conda-installable
Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC schedulers. It works on laptops, the cloud, and clusters without modification to the main workflow (as long as you have enough compute resources!).
Snakemake is written using Python, but supports bash and R code as well.
Anything that you can do in Python, you can do with Snakemake (since you can pretty much execute arbitrary Python code anywhere).
Like other workflow management systems, Snakemake allows you to:
Keep a record of how your scripts are used and what their input dependencies are
Run multiple steps in sequence, parallelising where possible
Automatically detect if something changes and then reprocess data if needed
Our goal is to automate the first two steps (FastQC MultiQC) of our example workflow using snakemake!
Snakemake workflows are built around rules. The diagram below shows the anatomy of a snakemake rule:
Let’s make a rule to run fastqc
on one of our samples below. We’ll put this
rule in a file called Snakefile
.
# This rule will run fastqc on the specified input file.
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
Let’s try and run our Snakefile! Return to the command line and run snakemake
.
snakemake
You should see output that starts like this:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 fastqc_raw
1
[Tue Jul 2 19:10:26 2019]
rule fastqc_raw:
input: data/ERR458493.fastq.gz
output: fastqc_raw/ERR458493_fastqc.html, fastqc_raw/ERR458493_fastqc.zip
jobid: 0
Let’s check that the output file is there:
ls fastqc_raw/*fastqc*
Yay! Snakemake ran the thing!
We can also use better organization. Let’s specify a different output folder for our fastqc results
# This rule will run fastqc on the specified input file
# (replace the prior fastqc_raw rule with this new rule)
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
If we look in our directory, we should now see a fastqc_raw
directory, even
though we didn’t create it:
ls
Snakemake created this directory for us. We can look inside it to see if it really ran our command:
ls fastqc_raw
We told snakemake to do something, and it did it. Let’s add another rule to our Snakefile telling snakemake to do something else. This time, we’ll run multiqc.
# Run fastqc on the specified input file
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
# Run multiqc on the results of the fastqc_raw rule
rule multiqc_raw:
input: "fastqc_raw/ERR458493_fastqc.zip"
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
We see output like this:
Building DAG of jobs...
Nothing to be done.
Complete log: /Users/tr/2019_angus/.snakemake/log/2019-07-02T191640.002744.snakemake.log
However, when we look at the output directory fastqc_raw
, we see that our
multiqc file does not exist! Bad Snakemake! Bad!!
Snakemake looks for a rule all
in a file as the final file it needs to
produce in a workflow. Once this file is defined, it will go back through all
other rules looking for which ordered sequence of rules will produce all of the
files necessary to get the final file(s) specified in rule all
. For this point
in our workflow, this is our fastqc sample directory.. Let’s add a rule all.
rule all:
"""
The "all" rule is the first rule that Snakemake will run.
Snakemake will automatically figure out which rules need to
be run to generate the specified input file.
"""
input:
"fastqc_raw/multiqc_report.html"
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: "fastqc_raw/ERR458493_fastqc.html"
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
And it worked! Now we see output like this:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 multiqc_raw
2
Snakemake now has two processes it’s keeping track of.
So far we’ve been using snakemake to process one sample. However, we have 6! Snakemake is can be flexibly extended to more samples using wildcards.
We already saw wildcards previously.
When we specified the output file path with {input}
, {input}
was a
wildcard. The wildcard is equivalent to the value we specified for {input}
.
rule fastqc_raw:
input: "data/ERR458493.fastq.gz"
output:
"fastqc_raw/ERR458493_fastqc.html",
"fastqc_raw/ERR458493_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
We can create our own wildcard too. This is really handy for running our workflow on all of our samples.
# Create a list of strings containing all of our sample names
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501',
'ERR458502']
rule all:
input:
"fastqc_raw/multiqc_report.html"
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
We can run this again at the terminal.
snakemake
And we have now run these rules for each of our samples!
Note that we added new syntax here as well. We define a variable at the top
of the snakefile call SAMPLES
. Snakemake solves the values for the wildcard
{sample}
the last time that see that wildcard. However, we need to expand
the wildcard using the expand
function, and tell snakemake in which variable
to look for the values.
Indentation is important, use two or four spaces for each indentation.
Define your target (final output) files in rule all
Use unique extensions or directories for each rule to avoid wildcard collisions
snakemake -n –p -r
snakemake --cores 8
snakemake --cluster-config cluster.yml --cluster \
"sbatch -A {cluster.account} -t {cluster.time}"
First we need to install a tool called Graphviz, which codes for the dot command:
conda install -c conda-forge graphviz
snakemake --dag | dot -Tpng > dag.png
The DAG png file should look something as shown above.
Snakemake can automatically generate detailed self-contained HTML reports that encompass runtime statistics, provenance information, workflow topology and results.
To create the report, run
snakemake --report report.html
View sample report here
You can specify software on a per-rule basis! This is really helpful when you have incompatible software requirements for different rules, or want to run on a cluster, or want to make your workflow repeatible.
For example, if you create a file env_fastqc.yml
with the following content:
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- fastqc==0.11.8
and then change the fastqc rule to look like this:
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
conda:
"env_fastqc.yml"
shell:'''
fastqc -o fastqc_raw {input}
'''
you can now run snakemake like so,
snakemake --use-conda
and for that rule, snakemake will install the specified software and and dependencies in its own environment, with the specified version.
This aids in reproducibility, in addition to the practical advantages of isolating software installs from each other.
Here are some great Snakemake Workflows. Check out the RNAseq-STAR-DESEq2 workflow here.
In our previous lesson, we use snakemake to automate the first two rules of our quality control/quality analysis. Given the things you learned during that lesson, we will now automate the rest of the quality control workflow with using snakemake. For this lesson, we would like you to work with others in your room to make the snakefile. We have provided the basic structure for each of the rules, and would like you to fill in the remaining necessary details for the full workflow to run on all samples.
We’ve added in the rules we created from our previous lesson as well.
Let’s make sure we’re in a good working directory and we have the necessary software installed. If you’ve already ran the installation command on your current instance, you don’t need to run it again.
cd ~
conda install -y fastqc multiqc trimmomatic
SAMPLES=['ERR458493', 'ERR458494', 'ERR458495', 'ERR458500', 'ERR458501',
'ERR458502']
rule all:
input:
"" # add the path to the final file
rule fastqc_raw:
input: "data/{sample}.fastq.gz"
output:
"fastqc_raw/{sample}_fastqc.html",
"fastqc_raw/{sample}_fastqc.zip"
shell:'''
fastqc -o fastqc_raw {input}
'''
rule multiqc_raw:
input: expand("fastqc_raw/{sample}_fastqc.html", sample = SAMPLES)
output: "fastqc_raw/multiqc_report.html"
shell:'''
multiqc -o fastqc_raw fastqc_raw
'''
# Add the trimmomatic commands from our trimming lesson.
rule trim:
input:
reads = "",
adapters = ""
output: "" # the output should be the trimmed file
shell: '''
trimmomatic SE
'''
# Use the commands above as a template to fill in these rules, this time
running the analyses on the trimmed reads.
rule fastqc_trimmed:
rule multiqc_trimmed:
When complete, this snakemake workflow automates all of the bash steps of our quality control workflow!
You can generate a dag of your workflow using the following command:
snakemake --dag | dot -Tpng > dag.png
Remember, “dag” stands for Directed Acyclic Graph. It shows the steps of your workflow that are executed by Snakemake. This dag can be helpful to communicate your workflow to others, or to visualize the decisions snakemake is making as you are constructing your workflow.
You can open it to view using the “Files” tab in the RStudio pane and opening
the dag.png
file you created.
Here we’re going to run through some of the typical steps for taking a newly sequenced bacterial isolate genome from raw fastq files through to some assemblies and discussing how we might choose which assembly to move forward with 🙂
Before we get started, a public service announcement:
ATTENTION!
This is not an authoritative or exhaustive workflow for working with a newly sequenced genome. No such thing exists! All genomes, datasets, and goals are different, and new tools are constantly being developed. The point of this page is just to give examples of some of the things we can do. Don’t let anything here, or anywhere, constrain your science to doing only what others have done!
Now that that’s out of the way, let’s get to it!
We should still have our jetstream instances running. You can follow the instructions here to log in to the JetStream website and find your instance and IP address. Then ssh
into it from your laptop following the instructions here, so we are working in a terminal on our JetStream instance.
Throughout this process we’ll be using a variety of tools. They are going to be installed using conda with the specific versions used at the time this was put together.
First, we will create an environment, activate it, and then install all the needed programs:
conda create -y -n de_novo_example
conda activate de_novo_example
conda install -y -c bioconda -c conda-forge fastqc=0.11.5 \
trimmomatic=0.36 spades=3.11.1 megahit=1.1.1 \
quast=5.0.2 bowtie2=2.2.5 anvio=5.5.0 \
centrifuge=1.0.4
CODE BREAKDOWN
conda create
- creating our conda environment
-y
- say yes to all questions automatically (just do it and leave me alone)
-n de_novo_example
- names the environment we are creating “de_novo_example”
conda activate de_novo_example
- activates our new environment (puts us in it)
conda install
- installs the programs we need in our current conda environment
-y
- same as above, just do it, no questions
-c bioconda
and-c conda-forge
- specifies to search these channels for the following programs
fastqc=0.11.5
and rest of positional arguments - names the program wanted, followed by an equals sign preceding the version we want (the version used when this was put together)
The practice data we’re going to use here was provided by colleagues at the J. Craig Venter Institute. In working out the details for a rather large-scale project, which in part involves sequencing a bunch of Burkholderia isolates from the International Space Station and performing de novo genome assemblies, Aubrie O’Rourke and her team put an already sequenced isolate – Burkholderia cepacia (ATCC 25416) – through their pipeline in order to test things out and to have something to benchmark their expectations against. Here we are going to be assembling that genome. The sequencing was done on Illumina’s Nextseq platform as paired-end 2x150 bps, with about a 350-bp insert size.
The data files are rather large for this example, at 1.7 GB compressed, and 2.6 GB uncompressed. This is because the download comes with starting and output files, so that we can skip some of the more time-consuming steps and just copy over results.
Copying and pasting this code block will get us the data:
cd ~
curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz
tar -xzvf genomics_de_novo_temp.tar.gz
rm genomics_de_novo_temp.tar.gz
cd genomics_de_novo_temp/
CODE BREAKDOWN
cd ~
- this moves us into our home directory in case we aren’t already there, so we are all starting in the same place
curl
- this let’s us download the data
-L
- this specifies to follow “soft links” if there are any (these are like redirects when we click to download a file and are sent somewhere else first)
https:...
- this is the link to download the data
-o
- let’s us specify the name of the output file, which we are naming “genomics_de_novo_temp.tar.gz”
tar
- program that unpacks “tarballs” (.tar) files, and can also optionally decompress them if they are gzipped (.gz)
-x
- exctract files from a .tar
-z
- decompress if gzipped (.gz)
-v
- print verbose output
-f genomics_de_novo_tmp.tar.gz
- specifies the file we want to act onThen we delete the downloaded file with
rm
and change into our new directory with the lastcd
command
There are two subdirectories here, we are going to change into the “working/” directory:
cd working/
Assessing the quality of our sequence data and filtering appropriately should pretty much always be one of the first things we do with our data. A great tool we’ve already used for this is FastQC. Here we’ll start with that on the raw reads.
Remember that FastQC scans the fastq files we give it to generate a broad overview of some summary statistics. And it has several screening modules that test for some commonly occurring problems. But as we talked about before, its modules are expecting random sequence data, and any warning or failure notices the program generates should be interpreted within the context of our experiment (e.g. high duplication levels are to be expected with RNAseq or amplicon data, but FastQC would still flag them as something possibly wrong). If a module “fails” or throws a warning, it’s a good idea to look into what might be expected to cause that looking through their manual here. Here we’ll provide our forward and reverse reads as 2 positional arguments:
fastqc B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz -t 6
Looking at our output from the forward reads (B_cepacia_raw_R1_fastqc.html), not too much stands out other than the quality scores are pretty mediocre from about 2/3 of the way through the read on:
FASTQC LEGEND
The read length is stretched across the x-axis, quality scores on the y-axis; the blue line is the mean quality score of all reads at the corresponding positions; red line is the median; the yellow boxplots represent the interquartile range; and the whiskers represent the 10th and 90th percentiles.
The reverse reads look pretty similar:
Also as we’ve seen, Trimmomatic is a pretty flexible tool that enables us to trim up our sequences based on several quality thresholds. The summary from FastQC didn’t look all that terrible other than semi-low quality scores towards the end of each read, so for a first pass let’s run Trimmomatic with pretty generic, but stringent settings (particularly the minimum length being our full read size), and see what we get (takes < 1 min.):
trimmomatic PE B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz \
BCep_R1_paired.fastq.gz BCep_R1_unpaired.fastq.gz \
BCep_R2_paired.fastq.gz BCep_R2_unpaired.fastq.gz \
LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:151 \
-threads 6
CODE BREAKDOWN
PE
- specifies our data are paired-end
B_cepacia_raw_R1.fastq.gz
andB_cepacia_raw_R2.fastq.gz
- first and second positional arguments are our input forward and reverse reads
BCep_R1_paired.fastq.gz
andBCep_R1_unpaired.fastq.gz
- third and fourth positional arguments are our output files for our forward reads: the third is for those that remain paired (i.e. their partner in the reverse files also passed the filtering thresholds); the fourth is for those that passed filtering but their partner did not 🥺 (gah, that emoji really tugs at the heart strings!)
BCep_R2_paired.fastq.gz
andBCep_R2_unpaired.fastq.gz
- fifth and sixth positional arguments are the same as 3rd and 4th above, but now for the reverse reads
LEADING:20
- this states, starting from the left, if a base has a quality score less than 20, cut it off (will stop at first base above 20)
TRAILING:20
- same as above, except working from the end of the read in
SLIDINGWINDOW:5:20
- starting at base 1, look at a window of 5 bps, and if the average quality score of those 5 cut the read at the first position of that window and only keep up to that point
MINLEN:151
- after performing all steps above, if the read is now shorter than 151, throw it away. For our purposes this means if any trimming happened the read will be removed. - Note that not all things happen in the order in which we provide them at the command line, but Trimmomatic does work that way
-threads 6
- specifies how many threads will be used, here 6 (“threads” are complicated)
The output printed to our terminals from Trimmomatic states that only about 14% of the read pairs (both forward and reverse from the same fragment) passed, leaving us with ~350,000 read pairs. That sounds low, but since we know what we’re working with here (meaning it’s an isolate of a known genus and not a metagenome or something completely unknown), we can pretty quickly estimate if this could even possibly be enough coverage for us to assemble the genome we’re expecting.
Assuming those reads were perfect quality and perfectly evenly distributed (which they’re not), that would be 350,000 read pairs * 300 bps per paired read = 105 Mbps covered. Most Burkholderia are around 8.5 Mbps, meaning we’d rougly expect around 12X coverage right now, if all were perfect (105,000,000 / 8,500,000). This confirms that this is a little low and we should probably adjust our stringency on filtering – there aren’t solid rules on a minimum level of coverage for all situations, but a general target to shoot for might be around >50X coverage for de novo assembly of a typical prokaryotic genome.
So, let’s go back and alter how we are filtering a bit.
Since the worst part of the reads quality-wise is at the end, and this is likely getting chopped off in a lot of reads during the sliding window scan, when we had the highly stringent MINLEN:151
above, those would all be completely thrown out. So let’s try allowing a shorter minimum length of 140 in order to retain some of those reads (takes < 1 min.):
trimmomatic PE B_cepacia_raw_R1.fastq.gz B_cepacia_raw_R2.fastq.gz \
BCep_R1_paired.fastq.gz BCep_R1_unpaired.fastq.gz \
BCep_R2_paired.fastq.gz BCep_R2_unpaired.fastq.gz \
LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 \
MINLEN:140 -threads 6
These settings allowed ~49% of paired reads through (~1.2 million read pairs), which by the same quick estimation we did above suggests we have around an estimated 40X coverage (1,200,000 * 280 bps per read pair ≈ 336 Mbps covered; 336,000,000 / 8,500,000 ≈ 40X coverage). While “throwing away” half the data here may still feel a bit uncomfortable, it’s always important to think about what we’re doing with it when making these decisions. Here, we are trying to assemble a good genome. If we can successfully do that with less “good data”, that’s better than doing it with more “bad data” – especially when bad sequence data could severely inhibit our assembly efforts.
And just for a peek at the FastQC output after our trimming:
fastqc BCep_R1_paired.fastq.gz BCep_R2_paired.fastq.gz -t 6
These are the forward reads, the reverse looked basically the same. Things still don’t look perfect, but they look much cleaner than before – now our interquartile boxes (yellow) are much more snuggly sitting up top telling us our distribution of qualities across the ends of the reads is much better.
Even though we’ve done our best to quality trim and filter our reads, there can still be errors in there that may hurt our assembly attempts. Conceptually, read-error correction tools try to find bases that have a lot of coverage where the base is the same, but maybe in a tiny fraction of reads that base is different, and they will change that base to be what the others are. This paper by Heydari et al. goes a bit into trying to systematically evaluate Illumina read-error correction tools, which might be a good place to start if you’d like to look into this more. In our experience employing a read-error correction step seems to consistently improve assembly, but as with most of these things, that isn’t necessarily always the case – which is discussed a bit in the Heydari et al. paper. Here we’ll be trying different 4 different assemblies all with error-corrected reads, but feel free to experiment with that too 🙂
The read-error correction we are going to use here is employed by the SPAdes assembler, and is called BayesHammer (initial publication here if you’d like to look into more sometime). So even though we want to try different assemblies, we are still going to run the error-correction step within SPAdes first, and take the error-corrected reads from that to be used with whatever assemblers we try. If we run the SPAdes program with the --error-correction-only
flag specified, it will stop after that point and not go into assembly.
NOTICE
This was the most computationally intensive step, taking about 45 minutes with the resources specified in the following command (which is more than are available on our instances). So for the sake of this tutorial, instead of running this command (which is commented out below), we are just going to copy over the result files in the next code block:
## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
# spades.py -1 BCep_R1_paired.fastq.gz -2 BCep_R2_paired.fastq.gz \
# -o spades_error_corrected_reads -t 50 -m 500 \
# --only-error-correction
CODE BREAKDOWN
-1
and-2
- specify the input forward and reverse reads
-o
- specifies the output directory where the results will be stored
-t
- specifies how many threads will be used (“threads” are complicated)
-m
- specifies the maximum amount of memory to possibly be used in gigabytes; program would quit if it needed more than that (our instances have 16 GB available to them)
--only-error-correction
- telling SPAdes to only runn the error-correction step, and not perform an assembly
And here we are going to copy over the result files:
cp ../downloaded_results/BCep_R?_err_corr.fq.gz .
Now that we have our reads quality filtered and we have run an error-correction step, we’re ready to move on to assembling them! There are lots of assembly programs out there, and once again, there is no one-size-fits-all. The reason each new assembly paper shows it doing better than all the others is not because everyone is lying about things, it’s just that the data have a lot to say about which assembler is going to work the “best” – and what’s “best” is not really a straightforward criterion to shoot for either.
Two of the most commonly used prokaryotic assemblers today are SPAdes and MEGAHIT. SPAdes uses much more memory than MEGAHIT, so it is often more suitable for working with one or a few genomes (like from an isolate or enrichment culture). But if working with high-diversity metagenomic samples, sometimes the memory requirements for SPAdes get too high, and MEGAHIT (which uses much less memory) can handle things. This is not to say SPAdes is always better when both can be run, it’s not necessarily.
When working with a new dataset, it’s not uncommon to generate a few assemblies testing different programs with different parameters and then compare the results to try to feel a little confident we are doing the best that can currently be done with the data.
Here we’ll run a couple with SPAdes and a couple with MEGAHIT.
NOTICE
We’re going to be comparing a total of 4 different assembly variations here, with each taking about 5-10 minutes to run. So for the sake of time while doing this tutorial, we will only run the first one together, and for the rest we’ll copy over the results like we did with the error-correction step above.
As mentioned above, SPAdes tends to do well with isolate or enrichment cultures that aren’t too diverse. Since this is an isolate culture that was sequenced, it should run just fine. We’re going to try one SPAdes run with default settings and then one with an additional option set, for a total for 2 assemblies from SPAdes.
Let’s first run a SPAdes assembly with default settings. Note that we are providing the --only-assembler
flag because we already ran the read-error correction step above, and we are using the outputs from that here as the input (takes < 5 min.):
spades.py -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
-o spades-default-assembly/ -t 6 --only-assembler
While that’s running, let’s talk about some of the others 🙂
In the SPAdes documentation, they have this note suggesting specific assembly settings when using 2x150 paired-end Illumina data as we are here. Most of the suggestions there are defaults in the version we are using, and were run with our command above, but one (running things in --careful
mode) was not. This tries to find and fix mismatches after the initial assembly is finished. Here’s how we would run one with careful mode on (though again we’re skipping it for time and will copy over the results once our other assembly finishes):
## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
# spades.py -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
# -o spades-careful-assembly -t 6 --only-assembler \
# --careful
Next we’ll run some assemblies with MEGAHIT.
Here’s how we could run one with default MEGAHIT settings:
## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
# megahit -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
# -o megahit-default-assembly -t 6
The MEGAHIT documentation has an assembly-tips page that notes here that the default settings are largely tuned for metagenomic assembly, and that for a generic assembly (like our case here with an isolate) when there is greater than 40X coverage it is suggested to set the --min-count
parameter (which deals with the frequency of kmers and filtering out reads with unique kmers) to 3
. The coverage we estimated was around 40X, so let’s also try a run with that parameter set:
## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
# megahit -1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
# -o megahit-min-count-3-assembly/ -t 6 --min-count 3
Once our first assembly finishes, let’s copy over the results for the other 3 with this code block:
cp -r ../downloaded_results/assemblies/spades-careful-assembly/ spades-careful-assembly/
cp -r ../downloaded_results/assemblies/megahit-default-assembly/ megahit-default-assembly/
cp -r ../downloaded_results/assemblies/megahit-min-count-3-assembly/ megahit-min-count-3-assembly/
Now that we have a handful of assemblies, let’s see how they compare 🙂
Let’s just get this out there right off the bat, there is no individual metric that exists to determine if we have a good assembly or not, especially if we have no reference, and especially especially if we’re working with a metagenomic assembly.
There are, however, some general things we can look at, like N50 (half of the total assembly size can be found in contigs >= this size) or largest contig, or fraction of reads that successfully recruit to your assembly, or how many genes we can identify, and more. But: 1) these don’t really have any context to tell us if they’re “good” or not unless we’re comparing multiple assemblies of the same data; and 2) it’s possible to have an assembly with “worse” overall summary statistics compared to some other assembly, but that may actually enable us to do whatever it is we want to do better than the assembly with more appealing summary metrics. So, as usual, we have to keep in mind what our goals are, and know that picking the “best” assembly we can get out of our data is not necessarily a trivial or straightforward task. Having a reference genome like we do in this case makes things a lot easier as we’ll see next, and then we’ll talk a little bit about what we could do if we didn’t have a reference 🙂
QUAST is a really nice tool for comparing multiple assemblies, and for metagenome assemblies there is a comparable MetaQUAST. We can provide QUAST with all of our assemblies, a fasta file of our reference genome, and a .gff (general feature format) file of our reference genome that contains information about its genes. Then QUAST will compare our assemblies to the reference, and also generate several reference-independent summary statistics. Our two reference files for our Bulkholderia cepacia ATCC 25416 came from NCBI here. And here’s how we can run QUAST (takes < 1 min.):
quast -o quast-B-cep-out -r reference_genome/BCep_ref.fna \
-g reference_genome/BCep_ref.gff -t 6 -m 1000 \
-l "SPAdes-default, SPAdes-careful, MEGAHIT-default, MEGAHIT-min-count-3" \
spades-default-assembly/contigs.fasta \
spades-careful-assembly/contigs.fasta \
megahit-default-assembly/final.contigs.fa \
megahit-min-count-3-assembly/final.contigs.fa
CODE BREAKDOWN
Remember the backslashes (
\
) don’t have anything to do with the actual command, they are just there so this is not one long line and we can copy and paste (otherwise the “enter” character that comes after them would try to execute the command on each line when pasted into our terminal.
-o quast-B-cep-out
- setting the output directory
-r reference_genome/BCep_ref.fna
- providing a fasta file of our reference genome
-g reference_genome/BCep_ref.gff
- providing the gene coordinates of our reference (this is so QUAST can see if it finds them in our assemblies)
-l "..."
- within these quotes are labels for each of the input assemblies (so the output has something cleaner than file names), these need to be provided here in the same order in which their input fasta files are entered next as positional argumentsthe end is 4 positional arguments for our 4 input assemblies
There is more information about QUAST in its documentation here. The output directory contains lots of information, but it is also nicely summarized in an html file called “report.html” (we can access this from here for now). Here’s a portion of it:
The columns here hold information about each of our 4 assemblies, and each row presents a different metric. The majority of the rows starting from the top are in relation to the reference we provided, then the last few starting with “# contigs” are reference-independent. In the interactive html page, we can highlight the row names to get some help on what they mean, and there is more info in the QUAST manual. The cells are shaded across the assemblies for each row from red to blue, indicating “worst” to “best”, but this is only a loose guide to help your eye, as differences can be negligible or up to interpretation, and some rows are more important than others.
It seems all of them did pretty well. They each reconstructed over 98% of the reference genome, which is pretty stellar, but none of them got 100%. This is common when using solely short Illumina reads because they aren’t able to assemble repetitive regions that extend longer than the paired-read fragment length. We can also see that our assemblies aligned across about 8,180 genomic features (genes, transcripts, tRNAs, and such) out of the 8,418 that are annotated in the reference genome. And looking at mismatches per 100 kbp we can see we’re down around 2-4 mismatches per 100 kbp.
QUICK QUESTION! What might be causing the 2-4 mismatches per 100 kbps?
SolutionThis could be a mix of sequencing and/or assembly error, but it is also very likely some actual biological variation too. There is no reason to expect the genome to be exactly identical to the reference even after just 1 culturing transfer.
Moving down to the reference-independent section in the table we can see which assemblies cover the most bps with the fewest contigs. This doesn’t mean everything either, but fewer contigs covering about the same amount of bps does provide more information on synteny, which can be very helpful for functional annotation.
Another metric to assess the quality of an assembly is to see how well the reads that went into the assembly recruit back to it. It’s sort of a way of checking to see how much of the data that went in actually ended up getting used. This tends to be more informative with mixed community metagenomes – e.g. if one assembly enables recruitment of 85% of our input reads, and another enables only 10% of the input reads to recruit, it is safe to say the first one does a better job of representing what was recovered with our metagenomic sequencing. But in this case – with an isolate genome when the few assemblies tested performed pretty similarly – it turns out that they all pretty much recruited reads just as efficiently (> 99% with the default settings of bowtie2 v2.2.5 at least).
As far as selecting which of our assemblies to move forward with, since they all performed reasonably well, mostly for the reason of maximizing insight into synteny as mentioned above, based on the QUAST output we might choose to move forward with the SPAdes assembly done with the --careful
setting as that gave us the most genes on the fewest contigs.
It’s pretty nice here because we have a known reference genome. When that’s not that case, it’s much harder to be confident we are making the best decisions we can. If we know something about the organism, even if we don’t have an ideal reference, we might have a rough idea of what size genome to expect. We can use the detected copy numbers of genes known to most often be present in single-copy in its nearest relatives to get a rough estimate of percent completion (and percent redundancy which gives us one window into if there may be contamination).
Also with no reference, we might go further down the path of processing and analysis with multiple assemblies. Some might be better than others depending on our particular purpose, and that might not always be apparent based on overall summary statistics. For instance, if recovering representative genomes from a metagenome is the goal, we might want to start that process and see which assembly is better suited for that. As mentioned above, in some cases it is possible to be able to better recover representative genomes from an assembly that has seemingly worse overall summary statistics. And if we just chose based on the assembly-level summary statistics, we wouldn’t find that out. Or, if our question is more about the functional potential of the whole community, or maybe looking for specific genes in a metagenomic assembly, then seeing which assembly yields the prediction of more open-reading frames, or better annotation results, might help us decide. The bottom line is it’s difficult, and there is no one answer, but we do the best we can 🙂
That’s as far as we can go for now. Some more things to do to check the quality of this assembly and begin exploring it – along with a mention of some common tasks after we’re happy with our new genome – are presented in this lesson 🙂
Here we’re going to take off from a de novo prokaryotic genome assembly performed in this lesson. That involved logging into our JetStream instances as noted here and setting up a specific conda environment as noted here. If you didn’t do that lesson, be sure to set up that conda environment as shown there. If you did, then be sure to activate that conda environment with the following after logged in:
conda activate de_novo_example
The data used here is described here. If you did not do the previous lesson, the following code block will download the data and copy over the required assembly. But otherwise you will have all of this already:
cd ~
curl -L https://ndownloader.figshare.com/files/16197626 -o genomics_de_novo_temp.tar.gz
tar -xzvf genomics_de_novo_temp.tar.gz
rm genomics_de_novo_temp.tar.gz
cp -r ~/genomics_de_novo_temp/downloaded_results/assemblies/spades-careful-assembly/ ~/genomics_de_novo_temp/working/spades-careful-assembly/
Now let’s change directories into our working directory:
cd ~/genomics_de_novo_temp/working/
Now that we’ve selected an assembly we’re going to move forward with, we can start to take a closer look at it. To do that we’re going to use a wonderful platform called anvi’o - which stands for analysis and visualization of omics data.
anvi’o is very expansive, but to try to summarize it in a sentence, it is a powerful and user-friendly data visualization and exploration platform. There is a ton of excellent documentation at the anvi’o site, including workflows, tutorials, and blog posts.
Here we’re going to put our isolate-genome assembly into the anvi’o framework and see just a few of the ways it can help us begin to look at our assembled genome - including a visual way of checking for possible contamination (which in this case would be anything in there that’s not our target Burkholderia cepacia). Many of the major steps we’re going to be performing here are laid out in the metagenomic workflow presented here, as much of the processing is the same even though we are not working with a metagenome here.
For us to get our assembly into anvi’o, first we need to generate what it calls a contigs database. This contains the contigs from our assembly and some information intrisic to them (like GC content, tetranucleotide frequency, and open-reading frame coordinates). The following command will organize our contigs in an anvi’o-friendly way, generate some basic stats about them, and use the program Prodigal to identify open-reading frames (takes < 2 min.):
anvi-gen-contigs-database -f spades-careful-assembly/contigs.fasta \
-o contigs.db -n B_cepacia_assembly
Now that we have our contigs.db
that holds our sequences and some basic information about them, we can start adding more information. This is one of the places where the flexibility comes into play, but for now we’ll just move forward with some parts of a general anvi’o workflow, including:
using the program HMMER with profile hidden Markov models (HMMs) to scan for bacterial single-copy genes (from Campbell et al. 2013)
if new to HMMs, see the bottom of page 7 here for a good explanation of what exactly a “hidden Markov model” is in the realm of sequence data
to summarize, instead of treating all columns of an alignment equally (like something like BLAST does), HMMs can weight individual columns in an alignment differently. So the highly-conserved stretches within, say, a conserved domain of a protein will matter more than a stretch in between two domains that may vary more.
using NCBI COGs to functionally annotate the open-reading frames Prodigal predicted
using a tool called centrifuge for taxonomic classification of the identified open-reading frames
generate coverage information for our assembly so we can visualize it
The following block of code would take ~45 minutes to run, mostly because of needing to download and setup some files for their first use. So we are going to skip this code block, copy over the results, and pick up with mapping to generate the coverage information.
## DON'T RUN THIS CODE BLOCK; WE WILL COPY OVER THE RESULTS ##
# HMM searching for bacterial single-copy genes
# anvi-run-hmms -I Campbell_et_al -c contigs.db -T 6
# functional annotation with DIAMOND against NCBI's COGs
# anvi-setup-ncbi-cogs -T 6 # only needed the first time
# anvi-run-ncbi-cogs -c contigs.db --num-threads 6
# exporting Prodigal-identified open-reading frames from anvi'o
# anvi-get-sequences-for-gene-calls -c contigs.db -o gene_calls.fa
# setting up centrifuge for taxonomy (only needed the first time)
# wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
# tar -xzvf p_compressed+h+v.tar.gz && rm -rf p_compressed+h+v.tar.gz
# running centrifuge taxonomy (this taxonomically classifies each identified coding sequence)
# centrifuge -f -x p_compressed+h+v gene_calls.fa -S centrifuge_hits.tsv -p 6
# importing the taxonomy results into our anvi'o contigs database
# anvi-import-taxonomy-for-genes -p centrifuge -c contigs.db \
# -i centrifuge_report.tsv \
# centrifuge_hits.tsv
Let’s copy over the new “contigs.db” that has the information stored from running those commands:
cp ../downloaded_results/anvio/contigs.db .
The last thing we want to add right now is mapping information from recruiting our reads to the assembly we built with them. Here is generating that mapping information with bowtie2 (this is another aligner like bwa
that we used) and preparing the output file for anvi’o (altogether takes < 5 min.):
# building bowtie index from our selected assembly fasta file
bowtie2-build spades-careful-assembly/contigs.fasta \
spades-careful-assembly.btindex
# mapping our reads (takes ~1 min.)
bowtie2 -q -x spades-careful-assembly.btindex \
-1 BCep_R1_err_corr.fq.gz -2 BCep_R2_err_corr.fq.gz \
-p 6 -S spades-careful-assembly.sam
# converting to a bam file (takes < 1 min.)
samtools view -bS spades-careful-assembly.sam > B-cep-assembly.bam
# sorting and indexing our bam file (takes < 1 min.)
anvi-init-bam B-cep-assembly.bam -o B-cep.bam
We can now integrate this mapping information into anvi’o with the anvi-profile
program, which generates another type of database anvi’o calls a “profile database” that holds extrinsic information about our contigs. For our purposes right now, this is mostly about storing coverage information for each contig. We’ll see what that looks like and why it’s helpful soon!
# integrating the read-mapping info into anvi'o (takes ~2 min.)
anvi-profile -i B-cep.bam -c contigs.db -M 1000 -T 6 --cluster-contigs -o B-cep-profiled/
Ok, great, so we’ve just generated and put quite a bit of information about our assembly into our contigs and profile databases. And this being anvi’o, it’s now very easy to access specifics of that information when we want it (you can see all of the available programs by typing anvi-
and hitting tab twice).
For example, one of the commands we just ran, anvi-run-hmms
, searched for bacterial single-copy genes. We can ask anvi’o to give us which genes were identified using anvi-get-sequences-for-hmm-hits
(and adding the flag -h
will tell us all about how to use the program). Including adding the -l
flag to see which HMM sets we have:
anvi-get-sequences-for-hmm-hits -c contigs.db -l
Which shows us “Campbell_et_al” which we specified above. And adding the -L
flag will list the names of the genes in that set:
anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al -L
And say we want to pull out the “RecA” gene, which is a commonly targeted bacterial single-copy gene involved in DNA repair:
anvi-get-sequences-for-hmm-hits -c contigs.db --hmm-sources Campbell_et_al \
--gene-names RecA \
--get-aa-sequences \
--no-wrap -o RecA.faa
If we print out what’s in that file with cat
:
cat RecA.faa
Copy it and paste it into NCBI’s protein BLAST here and BLAST it, we can see that the top hit is 100% identical across the entire query to B. cepacia, and all top hits are to the same species (which is a nice little sanity check). Here are just a few from the top:
We can also generate some summary statistics on our assembly, including estimated percent completion and redundancy based on the presence/absence of the single-copy marker genes we scanned for above. Here are the two steps needed to summarize our isolate-genome assembly:
# this is adding all contigs to a group called "DEFAULT"
anvi-script-add-default-collection -p B-cep-profiled/PROFILE.db
# and here is our summary command
anvi-summarize -c contigs.db -p B-cep-profiled/PROFILE.db \
-C DEFAULT -o B-cepacia-assembly-summary/
A lot was generated with that, now found in our new directory, “B-cepacia-assembly-summary/”. If we glance at the “bins_summary.txt” file, we can see some summary statistics of our assembled genome, including the completion/redundancy estimates (column -t
will keep the columns lined up for us):
column -t B-cepacia-assembly-summary/bins_summary.txt
## output from above command ##
# bins taxon total_length num_contigs N50 GC_content percent_completion percent_redundancy
# EVERYTHING Burkholderia 8472061 92 194480 66.44548680306582 99.28057553956835 1.4388489208633093
Which shows us, in the second column, the majority of taxonomy calls by Centrifuge were actually for the right genus, that’s always nice. Also, based on the Campbell et al. 2013 bacterial single-copy marker genes, our assembly is estimated to be ~99.3% complete with ~1.4% redundancy. This approach doesn’t actually add up to 100% completion and 0% redundancy in many, if any, organisms, so for comparison’s sake, running the ATCC 25416 reference genome through the same pipeline gives the same results based on this single-copy gene set. Though we are still ~130 Mbps short, as we saw with the QUAST output above.
But great, things look pretty much as they should so far. We can also visually inspect our assembly, and how the reads that went into it recruit to it. In theory, if all the DNA in the assembly came from the same organisms (i.e. it’s a clean assembly), there should be pretty even coverage across the whole thing. And if there is contamination in there (something other than our target B. cepacia, it will likely have very different coverage). So let’s finally take a look with anvi-interactive
.
To be able to visualize our work from the server, we need to ssh
in slightly different than we did earlier. So now, let’s type exit
in our terminal to disconnect, press the up arrow to bring up our last login command, and modify it to look like this:
ssh -L 8080:localhost:8080 dibada@XXX.XXX.XXX.XXX
Where the -L 8080:localhost:8080
stays exactly the same as it appears above, but dibada
above should be your room name, and the IP address at the end should be yours. You will also need to enter the password again.
Next we need to re-activate our conda environment and change back into our working directory:
conda activate de_novo_example
cd ~/genomics_de_novo_temp/working/
Next, let’s copy-and-paste this command to get the link we will need to view our anvi’o interactive interface:
echo "http://$(hostname -i):8080"
Now let’s start up the anvi’o interactive interface:
anvi-interactive -c contigs.db -p B-cep-profiled/PROFILE.db \
--server-only -P 8080
Now if we copy the link that was generated above and paste it into our browser, after the page loads and we click “Draw” in the bottom left corner, it should look something like this:
So there is a lot going on here at first glance, especially if we’re not yet familiar with how anvi’o organizes things. So here’s a quick crash course 🙂
At the center of the figure is a hierarchical clustering of the contigs from our assembly (here clustered based on tetranucleotide frequency and coverage). So each tip (leaf) represents a contig (or a fragment of a contig as each is actually broken down into a max of ~20,000bps, but for right now we’ll just be referring to them as contigs). Then radiating out from the center are layers of information (“Parent”, “Length”, “GC content”, etc.), with each layer displaying information for each contig.
One of the first things that might jump out here is the outer layer (purple in this image, but may be different on yours), labeled “Taxonomy”. There is actually a color for each contig for whatever taxonomy was assigned to the majority of genes in that particular contig. This solid bar all around tells us that the genes in almost the entire assembly were identified as Burkholderia – minus two contigs which were not classified as anything (and have no color there). The next thing that stands out is how stable the mean coverage is across almost all contigs, other than that area near the top in this image (yours may be oriented differently too). That likely holds things found in multiple copies in the genome. For instance, according to IMG, B. cepacia ATCC 25416 has 7 16S copies and 9 23S copies, which would complicate assembly if they aren’t all identical, and would inflate their coverage compared to the parts of the genome that exist in single copy. Overall this is great and shows the culture really seems to have been axenic.
Just for a quick comparison, here is the same type of figure (taxonomy at the inner-part of the circle here), but from an enrichment culture of a cyanobacterium (so “contaminant” genomes in there too), rather than axenic:
Here the highlighted contigs, labeled “Bin 1”, represent the targeted cultivar from this sequencing run, demonstrating a nice example of how anvi’o can help you manually curate bins you’re trying derive from assemblies.
While we didn’t need much (any) manual curation in this case, it was still a good idea to visually inspect the coverage of our assembly to make sure nothing weird was going on. And if we wanted, we could further explore those parts with higher coverage to find out which parts of the genome seem to exist in greater than 1 copy.
This and the prior lesson were all basically to get a high-quality draft of our isolate genome, that we could feel confident about investigating further. Once we feel comfortable with our assembled genome, we can go a lot of different ways. Going into individual approaches are beyond the scope of this particular page, but here are just a few examples.
Pull available reference genomes of close relatives and build a phylogenomic tree to get a robust estimate of where our newly acquired isolate(s) fit in evolutionarily with what is already known.
Pull available metagenomes from studies and recruit the reads to a reference library containing our isolate (and its close relatives if it has any) to begin assessing their distributions across samples. This example is done with ocean samples, but the same principle can be applied to any environments.
Start investigating differences in the genetic complement of our new isolate as compared to its known close relatives. This example figure is combining pangenomics (the core of the figure showing the presence or absence of genes within each genome) with metagenomics (distributions of the genomes across samples in the top right corner) to try to associate genomic variability with ecological delineations:
And so much more! This is where just pure data crunching slows down, and the actual science begins. The above are just some of the ways to get to the point where we can then consider our experimental design and our questions and let them guide where we go next 🙂
In many of our lessons we have been working with RNASeq data from Saccharomyces cerevisiae. In the transcriptome assembly lesson, which we didn’t run through this year (yet, at least), we would have assembled these RNA reads into contigs using Trinity, a popular transcriptome assembler. Here we will compare the results of that assembly to the reference RNA sequences from NCBI (specifically the GCF000146045.2_R64_rna.fna.gz file from here) while getting some more practice at the command line!
Explore some more bash commands in practical usage examples (e.g. grep
, sed
, cut
, comm
, cat
)
Practice stringing together multiple commands with pipes (|
) step-by-step
Use blast to compare an assembled transcriptome to its reference transcriptome
Run a “remote” blast from the command line
You should still have your jetstream instance running, you can follow the instructions here to log in to JetStream and find your instance. Then ssh
into it following the instructions here.
We will be using blast here, and need one more package for being able to BLAST remotely (send sequences to their servers from the command line). So let’s make a new environment, install those, and enter our new environment:
conda create -y -n blast_env -c conda-forge -c bioconda blast gnutls
conda activate blast_env
Let’s set up a new working directory and download the files we’ll be working with:
cd ~
mkdir more-unix-fun/
cd more-unix-fun/
curl -L https://ndownloader.figshare.com/files/16200176 -o our-transcriptome-assembly.fa
curl -L https://ndownloader.figshare.com/files/16218335 -o ref-transcripts.fa
Out of curiousity, let’s see how many contigs we assembled vs how many open-reading frames there are in the reference we have:
grep -c ">" our-transcriptome-assembly.fa
grep -c ">" ref-transcripts.fa
QUICK QUESTION! What might be some of the reasons for the large difference between the number of transcripts we assembled and the number of reference transcripts?
Possible Causes(This is not an exhaustive list of possibilities.)
- not all transcripts may have been expressed at the time of sampling
- even expressed transcripts at low relative abundance may not have been amplified and sequenced
- not all transcripts that were expressed and sequenced may have assembled successfully
Even though we assembled fewer transcripts than the reference holds, let’s use BLAST and the command line to try to find out if all of what we did assemble can actually be found in our reference!
Here we are going to make our blast database using our original reference fasta. Then we are going to blast the assembled transcripts against it.
makeblastdb -dbtype nucl -in ref-transcripts.fa -out ref-transcripts.blastdb
CODE BREAKDOWN
makeblastdb
- this is our command to make a blast database out of our reference fasta file
-dbtype nucl
- here we are specifying there are nucleotides
-in
- specifying the input file
-out
- naming the prefix of the output database files
Now we’re going to try to align our assembled contigs against the reference. There are a lot of options for blastn
that can be seen by running blastn -help
, and there’s an explanation of what we’ve done here in the following code breakdown block.
blastn -query our-transcriptome-assembly.fa -db ref-transcripts.blastdb \
-max_target_seqs 1 -max_hsps 1 \
-out assembly-to-ref-blastout.tsv \
-outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore"
CODE BREAKDOWN
blastn
- this is our base command, to blast nucleotide queries against a nucleotide database (that’s what a blastn is)
-query
- specifies our input fasta
-db
- specifies our database we just made
-max_target_seqs 1
- sets the maximum number of reference sequences returned to 1
-max_hsps 1
- sets maximum highest-scoring pairs returned (since BLAST is a local alignment, you can get more than one alignment returned between one query and one reference)
-out
- specifies name of output file
-outfmt
- specifies the output format
Now let’s take a look (remember we can exit less
by pressing the q
key):
column -t assembly-to-ref-blastout.tsv | less -NS
Code breakdown:
column
– thecolumn
command with the-t
flag will attempt to keep columns together based on the tabs. This can be convenient sometimes.
less
– since there are a lot of lines here, we are piping (|
) the output intoless
the
-NS
flags we are providing add line numbers and prevent lines from softwrapping at the edge of the terminal.
Let’s see how many of our assembled contigs successfully aligned to the orf reference fasta.
wc -l assembly-to-ref-blastout.tsv
Remember that wc -l
tells us how many lines there are in the file. 3,262 of our contigs successfully aligned to the orf reference fasta. But we had a total of 3,323 contigs from our assembly as we saw with grep -c ">" our-transcriptome-assembly.fa
, so some didn’t successfully align to the reference.
Here is one way we can do a quick calculation at the command line:
echo "3323-3262" | bc
This tells us we have 61 contigs from our assembly that did not successfully align to the reference. Let’s find out what they are!
Here are the steps we’ll take to get the sequences that didn’t align, and then try to find out what they are:
get all the names of the contigs from our assembly
get all the names of the contigs from our assembly that were reported as “hits” in the blast output
compare these to figure out which contigs from our assembly are not in the list of contigs reported as successfully aligning to the reference
use this new list of contig names to pull out their sequences in fasta format
blast the sequences against NCBI’s nr database “remotely” (from the command line, sending our sequences to the NCBI servers)
grep ">" our-transcriptome-assembly.fa | tr -d ">" | cut -f1 -d " " | sort > all-assembly-contig-IDs.txt
Code breakdown:
grep
- Just like we usedgrep
above to count how many sequences we had by providing the-c
flag, here we are leaving off the-c
flag so that it will pull out the lines that match. Since fasta format dictates the only place the “>” character appears is in front of sequence headers, we can pull out all head lines with that.
tr
- We then “pipe” (|
) that output into a command calledtr
, which is useful for manipulating indvidual characters. Here we are using to delete all of the “>” in the file (so our names are just names without the “>” in front of them, like they are in the blast output file).
cut
- We then pipe that into thecut
command, which is good for manipulating columns. Here, we’re going to use it to cut the first column (-f1
) setting the delimiter to a space (-d " "
)
this is because blast automatically cut the trailing space off of our sequence headers, and we want to match their format
you can see this by running
head assembly_to_orf_coding_blastnout_with_header.tsv
andhead yeast-transcriptome-assembly.fa
sort
- We use sort here because the command we’re going to use later to compare our two lists of headers needs them to be sorted in the same fashion, and running this on both will ensure that
>
– We then redirect the output to a new file called “all-assembly-contigs.txt”
cut -f1 assembly-to-ref-blastout.tsv | sort > all-assembly-contig-IDs-with-hits.txt
Code breakdown:
cut
– Here we are usingcut
to cut the first column (the sequence name) from the blast output file.
Note that here we didn’t need to provide the
-d
flag to set the delimiter. This is because by defaultcut
sets the delimiter to tabs.
sort
– We then pipe the output fromcut
intosort
as mentioned above, to compare the two lists of names the way we are going to below requires that the lists be sorted in the same fashion, and running
sort
on both of them ensures that
>
– We then redirect the output to a new file called “all-assembly-contig-hits.txt”
We can use the comm
command (compare with an extra “m” for some reason…) to quickly find out which sequences we assembled that didn’t successfully align to the reference transcriptome.
comm -23 all-assembly-contig-IDs.txt all-assembly-contig-IDs-with-hits.txt > all-assembly-contig-IDs-that-did-not-hit-ref.txt
Code breakdown:
comm
– this command compares the lines in two files and by default returns 3 columns:
The lines unique to file 1 (first file given)
The lines unique to file 2 (second file given)
The lines common to both files
Looking at the manual for it with
man comm
we can see you can suppress columns by provide them as argumentsSo by providing the flags
-23
we are saying to keep those, all we want are the lines (contig names) in file 1 that are not in file 2. This gives us all the contig names that we assembled but that did not successfully align to the reference.
>
– We then redirect the output to a new file called “all-assembly-contigs-that-did-not-hit-ref.txt”
And this should hold the 61 contig names that we’re looking for (sanity check):
wc -l all-assembly-contig-IDs-that-did-not-hit-ref.txt
Stellar 🙂
Now let’s use grep
to pull out the sequences!
NOTE: This requires that the fasta file is in “true” fasta form – meaning that each entry (sequence) takes up two lines. Sometimes, fasta files are formatted with what are known as softwraps. There is a one-liner at the end of this page to convert them if that is the case with your data.
If we wanted to pull one sequence, we could do it like this:
grep -A1 "TRINITY_DN501_c0_g1_i1" our-transcriptome-assembly.fa
for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
grep -A1 "$header" our-transcriptome-assembly.fa
done
Code breakdown: The special words of the loop are
for
,in
,do
, anddone
.
for
– here we are specifying the variable name we want to use (“header”), this could be anything
in
– here we are specifying what we are going to loop through, in this case it is every line of the “all-assembly-contigs-that-did-not-hit-ref.txt” file
$(cat ...)
– this is a special type of notation in shell. The operation within the parentheses here will be performed and the output of that replaces the whole thing. It would be the same as if we typed out each line in the file with a space in between them
do
– indicates we are about to state the command(s) we want to be performed on each item we are looping through
grep
– just like with the individual example above, for each sequence name in our file we are pulling out the header line and the following line containing the sequence
done
– tells the loop it is finished
This just printed to the screen for right now, but let’s grab one and take it to NCBI and do a web blast to see what we get.
>TRINITY_DN939_c0_g1_i1 len=900 path=[0:0-899]
TTCGACTATGATACTAATTCGCTGTGCACCCAGTGGGAGGATGTTATTTCGCTTACATGCCATATGCTGTATATTCTTGCGTTAACGGTTTCCGCGTGCGATCTCTCTTGTGTTCAGACGAGGCCCAATTGAGCACCATCCCCCTCGGGTAGTTTCCCGATCAAACTGGAAGATAGGCGTCTTTACTTACGCGCTCCTCTTGACCGAGACCCCCAATGCGCGATGTATCGAACCTTCACTAACCCTAGAAATTAGTGGTGGGAATCAGCGAAGTTACAATGTGGGGTTGGACCCAGGATGTTAGCCTGCAAGCTATACAATTCTCTTAGATTAGACGAGAACGGAGAATTTAACCCCTGCAGCATTGGAGGTATGGTCTTGGGCATACCCGATACATGCAACGCAGCTCGGGATGTTCATGGTAGCACCTAACTGTATGGCATAGTTATGCAGAAGTGCGCTGCTTAAGAGCGATACCCCATAAAGAACGATTTTGGTGGTATTGCCCAAAGATAATGTCCCACGTTATCATCTGGTCAACGATGAGGTGGGTTGTTTTGTGATTGTTTGAGATGCTGAGTGCTGTTTAATGCGGGACATAAGGAAGGATATTAGTAGGGAGAAACGCTTGATGCCGGAAATATCCTTGCCTGGTTAACTGCTCGAAGTTAATCTGCGACGCTCGCCCTCATTCGGATGCATCGAAGGGCTCCCCTGCAGTTGCAAAGTCTTTGTTCTGCGAACTCGTAAAGTCGTAATGCCGTTGGTGGACCGTGCTTGTTAGGGATATTAAATGTTTCCTGGCCTTTAAAGCTATTGGCACGGCGGTTTAGATGGGACACCCTATCTCGTTTTCTACTTGCGCTTCAAGCGTCCCAACGAAACGAAATTGCGGACCGG
Now let’s write these all to a file instead of printing them to a screen. We’re just adding the >
redirector and a file name at the end of the loop to save the output.
for header in $(cat all-assembly-contig-IDs-that-did-not-hit-ref.txt)
do
grep -A1 "$header" our-transcriptome-assembly.fa
done > contigs-not-in-ref.fa
We can send our sequences to NCBI through the command line to blast them instead of doing it at their website. But in order to include taxonomic information with the output, we need to have another small blast database that holds that information.
We can download and unpack that database with these commands:
curl -L https://ndownloader.figshare.com/files/16219079 -o taxdb.tar.gz
tar -xzvf taxdb.tar.gz
Now we have the taxonomy database too so we can add taxonomy to our blast output. Let’s blast them!
nohup blastn -query contigs-not-in-ref.fa -remote -db nr \
-max_target_seqs 1 -max_hsps 1 \
-out contigs-not-in-ref-blast-to-nr.tsv \
-outfmt "6 qseqid qlen sseqid slen length pident evalue bitscore ssciname" &
CODE BREAKDOWN
The new things added here different from the previous blast are:
nohup
– this tells the computer not to stop the process if we disconnect from the server (very handy)
-db
– here we are specifying “nr” which is for NCBI’s “non-redundant” database (nucleotide in this case, which it knows because we usedblastn
-remote
– this tells the computer to send the BLAST job to the NCBI servers
&
- this at the end tells the computer to run the process in the background so we can still work in our terminal. We can check on the job with the commandjobs
. This will finish when it does and it won’t be interrupted disconnect or sign off before then.
We can press enter
to get get our regular prompt back.
This may take a few minutes (the amount of traffic the blast servers deal with fluctuates). So we can download a results file as follows and move on while that continues in the background:
curl -L https://ndownloader.figshare.com/files/16219238 -o contigs-not-in-ref-blast-to-nr-DL.tsv
The 9th column contains the taxonomy information, so we can use the cut
command to look at just that.
cut -f 9 contigs-not-in-ref-blast-to-nr-DL.tsv | sort | uniq -c
CODE BREAKDOWN
cut
– this crops out the specified columns, here we’re cutting column 9 (taxonomy column, labeled as “ssciname” by blast)
sort
– sorting is required in order for the followinguniq
command to function properly
uniq
– this removes all duplicates, leaving only single copies of the originals
the
-c
flag provided touniq
also tells us how many of each were found
1 Bacillus subtilis subsp. subtilis str. NCIB 3610
2 Cloning vector pBAD-FLP
1 Methanocaldococcus jannaschii
33 Saccharomyces cerevisiae
1 Yeast expression vector pSCUDT2MFE6H
23 synthetic construct
We see there are quite a few synthetic constructs in here, we might want to remove them and look further into these other sequences if this were a real project. And there are 33 S. cerevisiae sequences that didn’t successfully align to our reference cDNA file.
QUICK QUESTION! Why might there by S. cerevisiae sequences assembled from our RNA that are not aligning to the reference cDNA file?
Possible CausesIt’s possible some of these might be non-coding RNAs, and therefore aren’t in the cDNA reference. There may (probably are) other reasons too.
Isn’t Unix grand 🙂
One-liner that removes softwraps from fasta files:
Download and unzip example file with softwraps:
curl -LO ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_rna.fna.gz
gunzip GCF_000146045.2_R64_rna.fna.gz
Look at the file before removing softwraps:
head GCF_000146045.2_R64_rna.fna
Complicated command to fix this (but easy to save somewhere for when needed, so no need to remember):
awk '!/^>/ { printf "%s", $0; n="\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' GCF_000146045.2_R64_rna.fna > GCF_000146045.2_R64_rna_no_softwraps.fna
Looking at new file:
head GCF_000146045.2_R64_rna_no_softwraps.fna
Now we’d be able to grep
out sequences if we wanted 🙂
Learning objectives
Learn about version Control
Learn about Github repositories
Create local repositories
Backup your work online using git
First, you’ll need to sign up for a free account on GitHub.com. It’s as simple as signing up for any other social network. Keep the email you picked handy; we’ll be referencing it again in the lesson.
We will be following this lesson on our Jetstream instances which comes with Git
installed. However, if you want to work on your project on your local computer, you need to have Git
installed. Instructions to install Git for Windows, Mac or Linux can be found here.
GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere. GitHub is now the largest online storage space of collaborative works that exists in the world
Why use something like Git? Say you and a coworker are both updating pages on the same website. You make your changes, save them, and upload them back to the website. So far, so good. The problem comes when your coworker is working on the same page as you at the same time. One of you is about to have your work overwritten and erased.
A version control application like Git keeps that from happening. You and your coworker can each upload your revisions to the same page, and Git will save two copies. Later, you can merge your changes together without losing any work along the way. You can even revert to an earlier version at any time, because Git keeps a “snapshot” of every change ever made.
A directory or storage space where your projects can live. Sometimes GitHub users shorten this to “repo.” It can be local to a folder on your computer, or it can be a storage space on GitHub or another online host. You can keep code files, text files, image files, you name it, inside a repository.
Basically, the purpose Git was designed to serve. When you have a Microsoft Word file, you either overwrite every saved file with a new save, or you save multiple versions. With Git, you don’t have to. It keeps “snapshots” of every point in time in the project’s history, so you can never lose or overwrite it.
This is the command that gives Git its power. When you commit, you are taking a “snapshot” of your repository at that point in time, giving you a checkpoint to which you can reevaluate or restore your project to any previous state.
How do multiple people work on a project at the same time without Git getting them confused? Usually, they “branch off” of the main project with their own versions full of changes they themselves have made. After they’re done, it’s time to “merge” that branch back with the “master,” the main directory of the project.
git init
: Initializes a new Git repository. Until you run this command inside a repository or directory, it’s just a regular folder. Only after you input this does it accept further Git commands.
git config
: Short for “configure,” this is most useful when you’re setting up Git for the first time.
git help
: Forgot a command? Type this into the command line to bring up the 21 most common git commands. You can also be more specific and type “git help init” or another term to figure out how to use and configure a specific git command.
git status
: Check the status of your repository. See which files are inside it, which changes still need to be committed, and which branch of the repository you’re currently working on.
git add
: This does not add new files to your repository. Instead, it brings new files to Git’s attention. After you add files, they’re included in Git’s “snapshots” of the repository.
git commit
: Git’s most important command. After you make any sort of change, you input this in order to take a “snapshot” of the repository. Usually it goes git commit -m “Message here.”
The -m
indicates that the following section of the command should be read as a message.
git branch
: Working with multiple collaborators and want to make changes on your own? This command will let you build a new branch, or timeline of commits, of changes and file additions that are completely your own. Your title goes after the command. If you wanted a new branch called “cats,” you’d type git branch cats
.
git checkout
: Literally allows you to “check out” a repository that you are not currently inside. This is a navigational command that lets you move to the repository you want to check. You can use this command as git checkout master
to look at the master branch, or git checkout cats
to look at another branch.
git merge
: When you’re done working on a branch, you can merge your changes back to the master branch, which is visible to all collaborators. git merge cats
would take all the changes you made to the “cats” branch and add them to the master.
git push
: If you’re working on your local computer, and want your commits to be visible online on GitHub as well, you “push” the changes up to GitHub with this command.
git pull
: If you’re working on your local computer and want the most up-to-date version of your repository to work with, you “pull” the changes down from GitHub with this command.
Login to your Jetstream computer! You should see and should see something like this:
diblynn@js-17-71:~$
Make sure you’re in your home directory:
cd ~
It’s time to introduce yourself to Git. Type in the following code:
git config --global user.name "Your Name Here"
Next, tell it your email and make sure it’s the same email you used when you signed up for a GitHub.com account
git config --global user.email "your_email@youremail.com"
Now that you’re all set up, it’s time to create a place for your project to live. Both Git and GitHub refer to this as a repository, or “repo” for short, a digital directory or storage space where you can access your project, its files, and all the versions of its files that Git saves.
On your Github profile, click the plus button and select a “New Repository”.
Give your repository a name & fill out the necessary information for your repository to be distinct and recognizeable.
Don’t worry about clicking the checkbox next to “Initialize this repository with a README.” A Readme file is usually a text file that explains a bit about the project. But we can make our own Readme file locally for practice.
Click the green “Create Repository” button and you’re set. You now have an online space for your project to live in.
To begin, let’s create a new directory called MyProject.
mkdir ~/MyProject
Then we will move into this new directory.
cd ~/MyProject
To create a local repository, we will first initiate a new repository for “MyProject” by entering the following command:
git init
touch
is a multi-purpose command, but one of its key uses is to creat new, empty files. In our case, we will create a new file called Readme.txt.
touch Readme.txt
We can check the status of our new repository by using git status
.
git status
When we want Git to track a file, we use git add
followed by the file we want Git to “see”. If we do not use git add
, Git will not “see” this file.
git add Readme.txt
Lastly, to have Git track the current “snapshot” of our file, we enter git commit
. The -m
flag allows us to add a personal message with the files we are committing. In the following example, our message is “Add Readme.txt”. Examples of other messages could include version information, changes made to a document, document descriptions, etc.
git commit -m “Add Readme.txt”
Now Git has a “snapshot” of this version of Readme.txt which you can return to at any time in the future!
This setup also makes it easy to have multiple collaborators working on the same project. Each of you can work alone on your own computers, but upload or “push” your changes up to the GitHub repository when they’re ready.
To tell Git the address off your remote repo in Github, Type the following replacing the address of the repo with your own
git remote add origin https://github.com/username/myproject.git
Git now knows there’s a remote repository and it’s where you want your local repository changes to go. To confirm, type this to check:
git remote -v
Great, Git is able to connect with our remote on Github. So, let’s go ahead and push our files to Github
git push origin master
You will be prompted for your Github username and password at this point
and you can see some output like this that git is sending packets of data to your github repo and by this you will force git to back up all of your commits since the last time you pushed to be backed up online. FOR FREE!
Counting objects: 3, done.
Writing objects: 100% (3/3), 217 bytes | 217.00 KiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/sateeshbio5/angus_test.git
* [new branch] master -> master
Note: To avoid having to type your username and password each time you push/pull from your github repos, read about Secure Login here
GitHub Issues: Issues are a great way to keep track of tasks, enhancements, and bugs for your projects. They’re kind of like email—except they can be shared and discussed with all. Read more about Mastering Issues on Github here
GitHub Pull-Requests: Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.
Look at others’ repositories:
GitHub Pages is an awesome feature that lets you host websites/blogs for you and your projects.
Hosted directly from your GitHub repository. Just edit, push, and your changes are live.
Read more about GitHub Pages here
Introductory tutorial by Lauren Orsini here
ATTENTION!!
This is meant to be a guided, but very open discussion. Please feel free to jump in at any time with questions or thoughts on things!
This page presents a broad-level overview of amplicon sequencing and metagenomics as applied to microbial ecology. Both of these methods are most often applied for exploration and hypothesis generation and should be thought of as steps in the process of science rather than end-points – like all tools of science 🙂
Amplicon sequencing
Amplicon sequencing of marker-genes (e.g. 16S, 18S, ITS) involves using specific primers that target a specific gene or gene fragment. It is one of the first tools in the microbial ecologist’s toolkit. It is most often used as a broad-level survey of community composition used to generate hypotheses based on differences between recovered gene-copy numbers between samples.
Metagenomics
Shotgun metagenomic sequencing aims to amplify all the accessible DNA of a mixed community. It uses random primers and therefore suffers much less from pcr bias (discussed below). Metagenomics enables profiling of taxonomy and functional potential. Recently, the recovery of representative genomes from metagenomes has become a very powerful approach in microbial ecology, drastically expanding the known Tree of Life by granting us genomic access to as-yet unculturable microbial populations (e.g. Hug et al. 2016; Parks et al. 2017).
Here we’ll discuss some of the things each is useful and not useful for, and then look at some general workflows for each.
Useful for:
one metric of community composition
can say something about relative abundance of gene copies recovered
can track changes in community structure (as interpreted by recovered gene copy numbers) in response to a treatment and/or across environmental gradients/time
can provide strong support for further investigation of things
Not useful for:
abundance of organisms (or relative abundance of organisms)
recovered gene copies ≠ counts of organisms
gene-copy number varies per genome/organism (16S sequence can vary per genome)
pcr bias (small scale) -> under/over representation based on primer-binding efficiency
pcr bias (large scale) -> “universal” primers, only looking for what we know and they don’t even catch all of that
cell-lysis efficiencies
function
even if we can highly resolve the taxonomy of something from an amplicon sequence, it is still only one fragment of one gene
hypothesis generation is okay, e.g. speculating about observed shifts in gene-copy numbers based on what’s known about nearest relatives in order to guide further work
but, for example, writing a metagenomics paper based on 16S data would likely not be a good idea
There is no strain-level resolving capability for a single gene, all that tells you is how similar those genes are.
As noted above, amplicon data can still be very useful. Most often when people claim it isn’t, they are assessing that based on things it’s not supposed to do anyway, e.g.:
“Why are you doing 16S sequencing? That doesn’t tell you anything about function.”
“Why are you measuring nitrogen-fixation rates? That doesn’t tell you anything about the proteins that are doing it.”
We shouldn’t assess the utility of a tool based on something it’s not supposed to do anyway 🙂
Useful for:
functional potential
insights into the genomes of as-yet unculturable microbes
much better for “relative” abundance due to no confounding copy-number problem and no drastic PCR bias (still not true abundance)
still some caveats, like cell-lysis efficiencies
Not useful for:
abundance of organisms
“activity”
neither is transcriptomics or proteomics for that matter – each gives you insight into cellular regulation at different levels
QUICK QUESTION! With all that said, do you think we should expect relative abundance information from amplicon sequencing to match up with relative abundance from metagenomic sequencing?
SolutionNo, and that’s not a problem if we understand that neither are meant to tell us a true abundance anyway. They are providing different information is all. And the relative abundance metrics they do provide can still be informative when comparing multiple samples generated the same way 🙂
All sequencing technologies make mistakes, and (to a much lesser extent), polymerases make mistakes as well during the amplification process. These mistakes artificially increase the number of unique sequences in a sample, a lot. Clustering similar sequences together (generating OTUs) emerged as one way to mitigate these errors and to summarize data – though at the cost of resolution. The field as a whole is moving towards using solely ASVs, and there is pretty good reasoning for this. This Callahan et al. 2017 paper nicely lays out the case for that, summarized in the following points:
OTUs (operational taxonomic units)
cluster sequences into groups based on percent similarity
choose a representative sequence for that group
closed reference
+ can compare across studies
- reference biased and constrained
de novo
+ can capture novel diversity
- not comparable across studies
- diversity of sample affects what OTUs are generated
ASVs (amplicon sequence variants)
attempt to identify the original biological sequences by taking into account error
+ enables single-nucleotide resolution
+ can compare across studies
+ can capture novel diversity
If you happen to work with amplicon data, I highly recommend digging into the Callahan et al. 2017 paper sometime 🙂
Learning objectives:
* Learn what transcriptome assembly is
* Learn to differentiate different types of assemblies
* Discuss how do assemblers work
* Learn to check the quality of a transcriptome assembly
What if you are working with a species with no existing reference genome or transcriptome…how do you construct a reference?
The answer is de novo assembly. The basic idea with de novo transcriptome assembly is you feed in your reads and you get out a bunch of contigs that represent transcripts, or stretches of RNA present in the reads that don’t have any long repeats or much significant polymorphism. You run a de novo transcriptome assembly program using the trimmed reads as input and get out a pile of assembled RNA.
Trinity, one of the leading de novo transcriptome assemblers, was developed at the Broad Institute and the Hebrew University of Jerusalem. We will be losely following steps from the Eel pond protocol for our guide to doing RNA-seq assembly.
We will be using a set of Nematostella vectensis mRNAseq reads from Tulin et al., 2013.
To avoid needing to go though the trimming steps as before, we’ll download a snakefile to take care of these steps for us. If you look at this file, you may notice it is very similar to the one you generated during the snakemake challenge.
Make sure you have snakemake installed in your base environment, where we will be running the snakefile:
conda install -c conda-forge snakemake-minimal
Download the snakefile and a corresponding conda environment file:
cd ~ # cd to home directory
curl -L https://osf.io/nqh6p/download -o nema_trim.snakefile
curl -L https://osf.io/xs6k7/download -o trim-env.yml
Let’s take a look at the environment file to see what software snakemake will be putting in the environment it creates:
less trim-env.yml
Run the snakefile to download and trim the Nematostella reads:
snakemake -s nema_trim.snakefile --use-conda --cores 6
Here, we run snakemake. We use the -s
command to tell snakemake where it can find our snakefile,
tell it to build and use the environment above using the --use-conda
flag, and have it execute
the workflow over 6 cores using --cores 6
.
The trimmed data should now be within the nema_trimmed
folder.
The following commands will create a new folder assembly
and link the trimmed data we prepared in our
snakemake workflow into the assembly
folder:
cd
mkdir assembly
cd assembly
ln -fs ../nema_trimmed/*.qc.fq.gz .
ls
Next we will combine our all forward reads into a single file and all reverse reads into a single file. Usually, you want to have many samples from a single individual that are combined, but that minimize polymorphism and improve assembly. See this preprint for best practices for care and feeding of your transcriptome.
Use the following command to combine all fq into 2 files (left.fq and right.fq).
cat *_1.pe.qc.fq.gz *se.qc.fq.gz > left.fq.gz
cat *_2.pe.qc.fq.gz > right.fq.gz
Note: this step just makes it easier for us to type out the trinity command. Trinity can accept comma-separated lists of files as inputs - check the trinity documentation for details.
First, we’ll create and activate an environment where Trinity is installed:
conda create -y -n trinity-env trinity
conda activate trinity-env
Trinity works both with paired-end reads as well as single-end reads (including with both types of reads at the same time). In the general case, the paired-end files are defined as --left left.fq
and --right right.fq
respectively. Our single-end reads (orphans) have been concatenated onto the left.fq
file.
So let’s run the assembler as follows:
time Trinity --seqType fq --max_memory 16G --CPU 6 --left left.fq.gz --right right.fq.gz --output nema_trinity
(This will take a few minutes)
The time
command allows us to see how long Trinity takes to run, but is not a part of the Trinity command.
You should see something like:
** Harvesting all assembled transcripts into a single multi-fasta file...
Thursday, July 11, 2019: 18:21:18 CMD: find /home/dibada/assembly/nema_trinity/read_partitions/ -name '*inity.fasta' | /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/partitioned_trinity_aggregator.pl --token_prefix TRINITY_DN --output_prefix /home/dibada/assembly/nema_trinity/Trinity.tmp
-relocating Trinity.tmp.fasta to /home/dibada/assembly/nema_trinity/Trinity.fasta
Thursday, July 11, 2019: 18:21:18 CMD: mv Trinity.tmp.fasta /home/dibada/assembly/nema_trinity/Trinity.fasta
###################################################################
Trinity assemblies are written to /home/dibada/assembly/nema_trinity/Trinity.fasta
###################################################################
Thursday, July 11, 2019: 18:21:18 CMD: /opt/miniconda/envs/trinity-env/opt/trinity-2.8.5/util/support_scripts/get_Trinity_gene_to_trans_map.pl /home/dibada/assembly/nema_trinity/Trinity.fasta > /home/dibada/assembly/nema_trinity/Trinity.fasta.gene_trans_map
at the end.
First, save the assembly:
cp nema_trinity/Trinity.fasta nema-trinity.fa
Now, look at the beginning:
head nema-trinity.fa
These are the transcripts! Yay!
Let’s capture also some statistics of the Trinity assembly. Trinity provides a handy tool to do exactly that:
TrinityStats.pl nema-trinity.fa
The output should look something like the following:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 18
Total trinity transcripts: 46
Percent GC: 46.68
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3631
Contig N20: 2497
Contig N30: 2145
Contig N40: 2097
Contig N50: 1977
Median contig length: 1593
Average contig: 1459.98
Total assembled bases: 67159
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 4447
Contig N20: 3631
Contig N30: 3631
Contig N40: 2497
Contig N50: 2151
Median contig length: 1107
Average contig: 1422.83
Total assembled bases: 25611
This is a set of summary stats about your assembly. Are they good? Bad? How would you know?
Lastly, we’ll deactivate the environment where Trinity is installed.
conda deactivate
After generating a de novo transcriptome assembly:
At the end of this lesson, you will be familiar with:
how to annotate a de novo transcriptome assembly
parse GFF3 output from the annotation output to use for DE analysis
several methods for evaluating the completeness of a de novo transcriptome assembly
What a Jupyter notebook is and how to execute a few commands in Python
dammit!
dammit is an annotation pipeline written by Camille Scott. The dammit pipeline runs a relatively standard annotation protocol for transcriptomes: it begins by building gene models with Transdecoder, then uses the following protein databases as evidence for annotation:
Pfam-A, Rfam, OrthoDB, uniref90 (uniref is optional with--full
).
If a protein dataset for your organism (or a closely-related species) is available, this can also be supplied to the dammit pipeline with the --user-databases
as optional evidence for the annotation.
In addition, BUSCO v3 is run, which will compare the gene content in your transcriptome with a lineage-specific data set. The output is a proportion of your transcriptome that matches with the data set, which can be used as an estimate of the completeness of your transcriptome based on evolutionary expectation (Simho et al. 2015).
There are several lineage-specific datasets available from the authors of BUSCO. We will use the metazoa
dataset for this transcriptome.
Annotation necessarily requires a lot of software! dammit attempts to simplify this and make it as reliable as possible, but we still have some dependencies.
dammit can be installed via bioconda, but the latest version is not up there yet. Let’s download a working conda environment.yml
file that will help us install the latest version of dammit.
cd
curl -L https://raw.githubusercontent.com/dib-lab/elvers/master/elvers/rules/dammit/environment.yml -o dammit-env.yaml
Now let’s build an environment from that yaml
file:
conda env create -f dammit-env.yml -n dammit-env
conda activate dammit-env
To make sure your installation was successful, run
dammit -h
This will give a list of dammit’s commands and options:
The version (dammit --version
) should be:
dammit 1.1
dammit has two major subcommands: dammit databases
and dammit annotate
. The databases
command checks that databases are installed and prepared, and if run with the --install
flag,
it will perform that installation and preparation. If you just run dammit databases
on its own, you should get a notification that some database tasks are not up-to-date. So, we need
to install them!
Install databases (this will take a long time, usually >10 min):
dammit databases --install --busco-group metazoa
Note: dammit installs databases in your home directory by default. If you have limited space in your home directory or on your instance, you can install these databases in a different location (e.g. on an external volume) by running dammit databases --database-dir /path/to/install/databases
before running the installation command.
We used the “metazoa” BUSCO group. We can use any of the BUSCO databases, so long as we install them with the dammit databases
subcommand. You can see the whole list by running
dammit databases -h
. You should try to match your species as closely as possible for the best results. If we want to install another, for example:
dammit databases --install --busco-group protists
Phew, now we have everything installed!
Now, let’s take a minute and thank Camille for making this process easy for us by maintaining a recipe on bioconda. This saves us a lot of hassle with having to install individual parts required for the pipeline. AND on top of the easy installation, there is this slick pipeline! Historically, transcriptome annotation involved many tedious steps, requiring bioinformaticians to keep track of parsing databases alignment ouptut and summarizing across multiple databases. All of these steps have been standardized in the dammit
pipeline, which uses the pydoit automation tool. Now, we can input our assembly fasta file -> query databases -> and get output annotations with gene names for each contig - all in one step. Thank you, Camille!
Keep things organized! Let’s make a project directory:
cd ~/
mkdir -p ~/annotation
cd ~/annotation
Let’s copy in the trinity assembly file we made earlier:
cp ../assembly/nema-trinity.fa ./
Now we’ll download a custom Nematostella vectensis protein database. Somebody has already created a proper database for us Putnam et al. 2007 (reference proteome available through uniprot). If your critter is a non-model organism, you will likely need to grab proteins from a closely-related species. This will rely on your knowledge of your system!
curl -LO ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000001593_45351.fasta.gz
gunzip -c UP000001593_45351.fasta.gz > nema.reference.prot.faa
rm UP000001593_45351.fasta.gz
Run the command:
dammit annotate nema-trinity.fa --busco-group metazoa --user-databases nema.reference.prot.faa --n_threads 6
While dammit runs, it will print out which task it is running to the terminal. dammit is written with a library called pydoit, which is a python workflow library similar to GNU Make. This not only helps organize the underlying workflow, but also means that if we interrupt it, it will properly resume!
After a successful run, you’ll have a new directory called trinity.nema.fasta.dammit
. If you
look inside, you’ll see a lot of files:
ls nema-trinity.fa.dammit/
Expected output:
annotate.doit.db nema-trinity.fa.transdecoder.cds nema-trinity.fa.x.nema.reference.prot.faa.crbl.gff3 nema-trinity.fa.x.sprot.best.csv
dammit.log nema-trinity.fa.transdecoder.gff3 nema-trinity.fa.x.nema.reference.prot.faa.crbl.model.csv nema-trinity.fa.x.sprot.best.gff3
nema-trinity.fa nema-trinity.fa.transdecoder.pep nema-trinity.fa.x.nema.reference.prot.faa.crbl.model.plot.pdf nema-trinity.fa.x.sprot.maf
nema-trinity.fa.dammit.fasta nema-trinity.fa.transdecoder_dir nema-trinity.fa.x.pfam-A.csv run_nema-trinity.fa.metazoa.busco.results
nema-trinity.fa.dammit.gff3 nema-trinity.fa.x.OrthoDB.best.csv nema-trinity.fa.x.pfam-A.gff3 tmp
nema-trinity.fa.dammit.namemap.csv nema-trinity.fa.x.OrthoDB.best.gff3 nema-trinity.fa.x.rfam.gff3
nema-trinity.fa.dammit.stats.json nema-trinity.fa.x.OrthoDB.maf nema-trinity.fa.x.rfam.tbl
nema-trinity.fa.transdecoder.bed nema-trinity.fa.x.nema.reference.prot.faa.crbl.csv nema-trinity.fa.x.rfam.tbl.cmscan.out
The most important files for you are nema-trinity.fa.dammit.fasta
,
nema-trinity.fa.dammit.gff3
, and nema-trinity.fa.dammit.stats.json
.
If the above dammit
command is run again, there will be a message:
**Pipeline is already completed!**
Camille wrote dammit in Python, which includes a library to parse gff3 dammit output. To send this output to a useful table, we will need to open the Python environment.
To do this, we will use a Jupyter notebook. In addition to executing Python commands, Jupyter notebooks can also run R (as well as many other languages). Similar to R markdown (Rmd
) files, Jupyter notebooks can keep track of code and output. The output file format for Jupyter notebooks is .ipynb
, which GitHub can render. See this gallery of interesting Jupyter notebooks.
Let’s open python in our terminal!
python
This opens python in your terminal, allowing you to run commands in the python language.
Let’s import the libraries we need.
import pandas as pd
from dammit.fileio.gff3 import GFF3Parser
Now enter these commands:
gff_file = "nema-trinity.fa.dammit/nema-trinity.fa.dammit.gff3"
annotations = GFF3Parser(filename=gff_file).read()
names = annotations.sort_values(by=['seqid', 'score'], ascending=True).query('score < 1e-05').drop_duplicates(subset='seqid')[['seqid', 'Name']]
new_file = names.dropna(axis=0,how='all')
new_file.head()
Which will give an output that looks like this:
Try commands like,
annotations.columns
and
annotations.head()
or
annotations.head(50)
**Questions
What do these commands help you to see?
How might you use this information to modify the names
line in the code above?
To save the file, add a new cell and enter:
new_file.to_csv("nema_gene_name_id.csv")
Now, we can return to the bash terminal, type:
quit()
We can look at the output we just made, which is a table of genes with ‘seqid’ and ‘Name’ in a .csv file: nema_gene_name_id.csv
.
less nema_gene_name_id.csv
Notice there are multiple transcripts per gene model prediction. This .csv
file can be used in tximport
as a tx2gene
file in downstream DE analysis.
BUSCO aims to provide a quantitative measure of transcriptome (or genome/gene set) completeness by searching for near-universal single-copy orthologs. These ortholog lists are curated into different groups (e.g. genes found universally across all metazoa, or all fungi, etc), and are currently based off of OrthoDB v9.
Metazoa database used with 978 genes
“Complete” lengths are within two standard deviations of the BUSCO group mean length
Useful links:
Website: http://busco.ezlab.org/
Paper: Simao et al. 2015
We’ve already installed and ran the BUSCO command with the dammit pipeline. Let’s take a look at the results.
Check the output:
cat nema-trinity.fa.dammit/run_nema-trinity.fa.metazoa.busco.results/short_summary_nema-trinity.fa.metazoa.busco.results.txt
Challenge: How do the BUSCO results of the full transcriptome compare?
Run the BUSCO command by itself:
run_BUSCO.py \
-i nema-trinity.fa \
-o nema_busco_metazoa -l ~/.dammit/databases/busco2db/metazoa_odb9 \
-m transcriptome --cpu 4
When you’re finished, exit out of the conda environment:
conda deactivate
Create / log into an m1.medium Jetstream instance.
Discuss k-mers and their utility
Compare RNA-seq samples quickly
Detect eukaryotic contamination in raw RNA-seq reads
Compare reads to an assembly
Build your own database for searching
Other sourmash databases
A “k-mer” is a word of DNA that is k long:
ATTG - a 4-mer
ATGGAC - a 6-mer
Typically we extract k-mers from genomic assemblies or read data sets by running a k-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 k-mers of length six from it like so:
AGGATGAGACAGATAG
becomes the following set of 6-mers:
AGGATG
GGATGA
GATGAG
ATGAGA
TGAGAC
GAGACA
AGACAG
GACAGA
ACAGAT
CAGATA
AGATAG
Today we will be using a tool called sourmash to explore k-mers!
Computers love k-mers because there’s no ambiguity in matching them. You either have an exact match, or you don’t. And computers love that sort of thing!
Basically, it’s really easy for a computer to tell if two reads share a k-mer, and it’s pretty easy for a computer to store all the k-mers that it sees in a pile of reads or in a genome.
k-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904. So, you know, a lot.)
Essentially, long k-mers are species specific. Check out this figure from the MetaPalette paper:
Here, Koslicki and Falush show that k-mer similarity works to group microbes by genus, at k=40. If you go longer (say k=50) then you get only very little similarity between different species.
So, one thing you can do is use k-mers to compare read data sets to read data sets, or genomes to genomes: data sets that have a lot of similarity probably are similar or even the same genome.
One metric you can use for this comparisons is the Jaccard distance, which is calculated by asking how many k-mers are shared between two samples vs how many k-mers in total are in the combined samples.
only k-mers in both samples
----------------------------
all k-mers in either or both samples
A Jaccard distance of 1 means the samples are identical; a Jaccard distance of 0 means the samples are completely different.
Jaccard distance works really well when we don’t care how many times we see a k-mer. When we keep track of the abundance of a k-mer, say for example in RNA-seq samples where the number of read counts matters, we use cosine distance instead.
These two measures can be used to search databases, compare RNA-seq samples, and all sorts of other things! The only real problem with it is that there are a lot of k-mers in a genome – a 5 Mbp genome (like E. coli) has 5 m k-mers!
About two years ago, Ondov et al. (2016) showed that MinHash approaches could be used to estimate Jaccard distance using only a small fraction (1 in 10,000 or so) of all the k-mers.
The basic idea behind MinHash is that you pick a small subset of k-mers to look at, and you use those as a proxy for all the k-mers. The trick is that you pick the k-mers randomly but consistently: so if a chosen k-mer is present in two data sets of interest, it will be picked in both. This is done using a clever trick that we can try to explain to you in class - but either way, trust us, it works!
We have implemented a MinHash approach in our sourmash software, which can do some nice things with samples. We’ll show you some of these things next!
To install sourmash, run:
conda install -y -c conda-forge -c bioconda sourmash
A signature is a compressed representation of the k-mers in the sequence.
Depending on your application, we recommend different ways of preparing sequencing data to create a signature.
In a genome or transcriptome, we expect that the k-mers we see are accurate. We can create signatures from these type of sequencing data sets without any preparation. We demonstrate how to create a signature from high-quality sequences below.
First, download a genome assembly:
cd ~
mkdir sourmash_data
cd sourmash_data
curl -L https://osf.io/963dg/download -o ecoliMG1655.fa.gz
gunzip -c ecoliMG1655.fa.gz | head
Compute a scaled MinHash from the assembly:
sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o ecoliMG1655.sig ecoliMG1655.fa.gz
For raw sequencing reads, we expect that many of the unique k-mers we observe will be due to errors in sequencing. Unlike with high-quality sequences like transcriptomes and genomes, we need to think carefully about how we want to create each signature, as it will depend on the downstream application.
Comparing reads against high quality sequences: Because our references that we are comparing or searching against only contain k-mers that are likely real, we don’t want to trim potentially erroneous k-mers. Although most of the k-mers would be errors that we would trim, there is a chance we could accidentally remove real biological variation that is present at low abundance. Instead, we only want to trim adapters.
Comparing reads against other reads: Because both datasets likely have many erroneous k-mers, we want to remove the majority of these so as not to falsely deflate similarity between samples. Therefore, we want to trim what are likely erroneous k-mers from sequencing errors, as well as adapters.
Let’s download some raw sequencing reads and demonstrate what k-mer trimming looks like.
First, download a read file:
curl -L https://osf.io/pfxth/download -o ERR458584.fq.gz
gunzip -c ERR458584.fq.gz | head
Next, perform k-mer trimming using a library called khmer. K-mer trimming removes low-abundant k-mers from the sample.
trim-low-abund.py ERR458584.fq.gz -V -Z 10 -C 3 --gzip -M 3e9 -o ERR458584.khmer.fq.gz
Finally, calculate a signature from the trimmed reads.
sourmash compute -k 21,31,51 --scaled 2000 --track-abundance -o ERR458584.khmer.sig ERR458584.khmer.fq.gz
We can prepare signatures like this for any sequencing data file! For the rest of the tutorial, we have prepared signatures for each sequencing data set we will be working with.
Use case: how similar are my samples to one another?
Traditionally in RNA-seq workflows, we use MDS plots to determine how similar our samples are. Samples that are closer together on the MDS plot are more similar. However, to get to this point, we have to trim our reads, download or build a reference transcriptome, quantify our reads using a tool like Salmon, and then read the counts into R and make an MDS plot. This is a lot of steps to go through just to figure out how similar your samples are!
Luckily, we can use sourmash to quickly compare how similar our samples are.
We generated signatures for the majority of the rest of the Schurch et al. experiment we have been working with this week. Below we download and compare the 647 signatures, and then produce a plot that shows how similar they are to one another.
First, download and uncompress the signatures.
curl -o schurch_sigs.tar.gz -L https://osf.io/p3ryg/download
tar xf schurch_sigs.tar.gz
Next, compare the signatures using sourmash.
sourmash compare -k 31 -o schurch_compare_matrix schurch_sigs/*sig
This outputs a comparison matrix and a set of labels. The matrix is symmetrical, and contains numbers 0-1 that captures similarity between samples. 0 means there are no k-mers in common between two samples, while 1 means all k-mers are shared.
Lastly, we plot the comparison matrix.
sourmash plot --labels schurch_compare_matrix
We see there are two major blocks of similar samples, which makes sense given that we have WT and SNF2 knockout samples. However, we also see that some of our samples are outliers! If this were our experiment, we would want to investigate the outliers further to see what caused them to be so dissimilar.
Use case: Search for the presence of unexpected organisms in raw RNA-seq reads
For most analysis pipelines, there are many steps that need to be executed before we get to the analysis and interpretation of what is in our sample. This often means we are 10-15 steps into our analysis before we find any problems. However, if our reads contain contamination, we want to know that as quickly as possible so we can remove the contamination and solve any issues that led to the contamination.
Using sourmash, we can quickly check if we have any unexpected organisms in our sequencing samples. We do this by comparing a signature from our reads against a database of known signatures from publicly available reference sequences.
We have generated sourmash databases for all publicly available Eukaryotic RNA samples
(we used the *rna_from_genomic*
files from RefSeq and Genbank…however keep in mind
that not all sequenced genomes have these files!). This database includes
fungi, plants, vertebrates, invertebrates, and protazoa. It does not include human, so we
incorporate that separately. We also built another database of the ~700 recently
re-assembled marine
transcriptomes from the
MMETSP project.
These databases allow us to detect common organisms that might be unexpectedly present
in our sequencing data.
First, let’s download and uncompress our three databases: human, MMETSP, and everything else!
wget -O sourmash_euk_rna_db.tar.gz https://osf.io/vpk8s/download
tar xf sourmash_euk_rna_db.tar.gz
Next, let’s download a signature from some sequencing reads. We’ll work with some sequencing reads from a wine fermentation.
wget -O wine.sig https://osf.io/5vsjq/download
We expected fungus and grape to be metabolically active in these samples. Let’s check which organisms we detect.
sourmash gather -k 31 --scaled 2000 -o wine.csv wine.sig sourmash_euk_rna_db/*sbt.json sourmash_euk_rna_db/*sig
If we take a look at the output, we see:
== This is sourmash version 2.0.1. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loaded query: wine_fermentation... (k=31, DNA)
downsampling query from scaled=2000 to 2000
loaded 1 signatures and 2 databases total.
overlap p_query p_match avg_abund
--------- ------- ------- ---------
2.2 Mbp 79.0% 28.5% 122.2 Sc_YJM1477_v1 Saccharomyces cerevis...
0.8 Mbp 1.4% 2.2% 6.2 12X Vitis vinifera (wine grape)
2.1 Mbp 0.4% 1.6% 9.3 GLBRCY22-3 Saccharomyces cerevisiae...
124.0 kbp 0.1% 0.7% 3.1 Aureobasidium pullulans var. pullula...
72.0 kbp 0.0% 0.1% 1.9 Mm_Celera Mus musculus (house mouse)
1.9 Mbp 0.0% 0.5% 3.7 Sc_YJM1460_v1 Saccharomyces cerevis...
1.8 Mbp 0.1% 0.5% 14.1 ASM18217v1 Saccharomyces cerevisiae...
2.1 Mbp 0.1% 0.3% 17.7 R008 Saccharomyces cerevisiae R008 ...
1.9 Mbp 0.0% 0.1% 3.1 ASM32610v1 Saccharomyces cerevisiae...
found less than 18.0 kbp in common. => exiting
found 9 matches total;
the recovered matches hit 81.2% of the query
…which is almost exactly what we expect, except we see some house mouse! And I promise Ratatouille was not making this wine.
Using this method, we have now identified contamination in our reads. We could align to the mouse genome to remove these reads, however the best strategy to remove these reads may vary on a case by case basis.
Use case: how much of the read content is contained in the reference genome?
First we’ll download some reads from an E. coli genome, then we will generate a signature from them
curl -L https://osf.io/frdz5/download -o ecoli_ref-5m.fastq.gz
sourmash compute -k 31 --scaled 2000 ~/sourmash_data/ecoli_ref-5m.fastq.gz -o ecoli-reads.sig
Build a signature for an E. coli genome:
sourmash compute --scaled 2000 -k 31 ~/sourmash_data/ecoliMG1655.fa.gz -o ecoli-genome.sig
and now evaluate containment, that is, what fraction of the read content is contained in the genome:
sourmash search -k 31 ecoli-reads.sig ecoli-genome.sig --containment
and you should see:
loaded query: /home/diblions/data/ecoli_ref-... (k=31, DNA)
loaded 1 signatures.
1 matches:
similarity match
---------- -----
9.7% /home/diblions/data/ecoliMG1655.fa.gz
Why are only 10% or so of our k-mers from the reads in the genome!? Any ideas?
Try the reverse - why is it bigger?
sourmash search -k 31 ecoli-genome.sig ecoli-reads.sig --containment
(…but 100% of our k-mers from the genome are in the reads!?)
Suppose that we have a collection of signatures (made with sourmash compute
as above) and we want to search it with our newly assembled
genome (or the reads, even!). How would we do that?
Let’s grab a sample collection of 50 E. coli genomes and unpack it –
mkdir ecoli_many_sigs
cd ecoli_many_sigs
curl -O -L https://github.com/dib-lab/sourmash/raw/master/data/eschericia-sigs.tar.gz
tar xzf eschericia-sigs.tar.gz
rm eschericia-sigs.tar.gz
cd ../
This will produce 50 files named ecoli-N.sig
in the ecoli_many_sigs
–
ls ecoli_many_sigs
Let’s turn this into an easily-searchable database with sourmash index
–
sourmash index -k 31 ecolidb ecoli_many_sigs/*.sig
What does the database look like and how does the search work?
One point to make with this is that the search can quickly narrow down which signatures match your query, without losing any matches. It’s a clever example of how computer scientists can actually make life better :).
And now we can search!
sourmash search ecoli-genome.sig ecolidb.sbt.json -n 20
You should see output like this:
# running sourmash subcommand: search
select query k=31 automatically.
loaded query: /home/tx160085/data/ecoliMG165... (k=31, DNA)
loaded SBT ecolidb.sbt.json
Searching SBT ecolidb.sbt.json
49 matches; showing first 20:
similarity match
---------- -----
75.9% NZ_JMGW01000001.1 Escherichia coli 1-176-05_S4_C2 e117605...
73.0% NZ_JHRU01000001.1 Escherichia coli strain 100854 100854_1...
71.9% NZ_GG774190.1 Escherichia coli MS 196-1 Scfld2538, whole ...
70.5% NZ_JMGU01000001.1 Escherichia coli 2-011-08_S3_C2 e201108...
69.8% NZ_JH659569.1 Escherichia coli M919 supercont2.1, whole g...
59.9% NZ_JNLZ01000001.1 Escherichia coli 3-105-05_S1_C1 e310505...
58.3% NZ_JHDG01000001.1 Escherichia coli 1-176-05_S3_C1 e117605...
56.5% NZ_MIWF01000001.1 Escherichia coli strain AF7759-1 contig...
56.1% NZ_MOJK01000001.1 Escherichia coli strain 469 Cleandata-B...
56.1% NZ_MOGK01000001.1 Escherichia coli strain 676 BN4_676_1_(...
50.5% NZ_KE700241.1 Escherichia coli HVH 147 (4-5893887) acYxy-...
50.3% NZ_APWY01000001.1 Escherichia coli 178200 gec178200.conti...
48.8% NZ_LVOV01000001.1 Escherichia coli strain swine72 swine72...
48.8% NZ_MIWP01000001.1 Escherichia coli strain K6412 contig_00...
48.7% NZ_AIGC01000068.1 Escherichia coli DEC7C gecDEC7C.contig....
48.2% NZ_LQWB01000001.1 Escherichia coli strain GN03624 GCID_EC...
48.0% NZ_CCQJ01000001.1 Escherichia coli strain E. coli, whole ...
47.3% NZ_JHMG01000001.1 Escherichia coli O121:H19 str. 2010EL10...
47.2% NZ_JHGJ01000001.1 Escherichia coli O45:H2 str. 2009C-4780...
46.5% NZ_JHHE01000001.1 Escherichia coli O103:H2 str. 2009C-327...
identifying what genome is in the signature. Some pretty good matches but nothing above %75. Why? What are some things we should think about when we’re doing taxonomic classification?
First, let’s download and upack the database we’ll use for classification
cd ~/sourmash_data
curl -L https://osf.io/4f8n3/download -o genbank-k31.lca.json.gz
gunzip genbank-k31.lca.json.gz
This database is a GenBank index of all the microbial genomes – this one contains sketches of all 87,000 microbial genomes (including viral and fungal). See available sourmash databases for more information.
After this database is unpacked, it produces a file
genbank-k31.lca.json
.
Next, run the ‘lca gather’ command to see what’s in your ecoli genome –
sourmash lca gather ecoli-genome.sig genbank-k31.lca.json
and you should get:
loaded 1 LCA databases. ksize=31, scaled=10000
loaded query: /home/diblions/data/ecoliMG165... (k=31)
overlap p_query p_match
--------- ------- --------
4.9 Mbp 100.0% 2.3% Escherichia coli
Query is completely assigned.
In this case, the output is kind of boring because this is a single genome. But! You can use this on metagenomes (assembled and unassembled) as well; you’ve just got to make the signature files.
To see this in action, here is gather running on a signature generated from some sequences that assemble (but don’t align to known genomes) from the Shakya et al. 2013 mock metagenome paper.
wget https://github.com/dib-lab/sourmash/raw/master/doc/_static/shakya-unaligned-contigs.sig
sourmash lca gather shakya-unaligned-contigs.sig genbank-k31.lca.json
This should yield:
loaded 1 LCA databases. ksize=31, scaled=10000
loaded query: mqc500.QC.AMBIGUOUS.99.unalign... (k=31)
overlap p_query p_match
--------- ------- --------
1.8 Mbp 14.6% 9.1% Fusobacterium nucleatum
1.0 Mbp 7.8% 16.3% Proteiniclasticum ruminis
1.0 Mbp 7.7% 25.9% Haloferax volcanii
0.9 Mbp 7.4% 11.8% Nostoc sp. PCC 7120
0.9 Mbp 7.0% 5.8% Shewanella baltica
0.8 Mbp 6.0% 8.6% Desulfovibrio vulgaris
0.6 Mbp 4.9% 12.6% Thermus thermophilus
0.6 Mbp 4.4% 11.2% Ruegeria pomeroyi
480.0 kbp 3.8% 7.6% Herpetosiphon aurantiacus
410.0 kbp 3.3% 10.5% Sulfitobacter sp. NAS-14.1
150.0 kbp 1.2% 4.5% Deinococcus radiodurans (** 1 equal matches)
150.0 kbp 1.2% 8.2% Thermotoga sp. RQ2
140.0 kbp 1.1% 4.1% Sulfitobacter sp. EE-36
130.0 kbp 1.0% 0.7% Streptococcus agalactiae (** 1 equal matches)
100.0 kbp 0.8% 0.3% Salinispora arenicola (** 1 equal matches)
100.0 kbp 0.8% 4.2% Fusobacterium sp. OBRC1
60.0 kbp 0.5% 0.7% Paraburkholderia xenovorans
50.0 kbp 0.4% 3.2% Methanocaldococcus jannaschii (** 2 equal matches)
50.0 kbp 0.4% 0.3% Bacteroides vulgatus (** 1 equal matches)
50.0 kbp 0.4% 2.6% Sulfurihydrogenibium sp. YO3AOP1
30.0 kbp 0.2% 0.7% Fusobacterium hwasookii (** 3 equal matches)
30.0 kbp 0.2% 0.0% Pseudomonas aeruginosa (** 2 equal matches)
30.0 kbp 0.2% 1.6% Persephonella marina (** 1 equal matches)
30.0 kbp 0.2% 0.4% Zymomonas mobilis
20.0 kbp 0.2% 1.1% Sulfurihydrogenibium yellowstonense (** 6 equal matches)
20.0 kbp 0.2% 0.5% Ruminiclostridium thermocellum (** 5 equal matches)
20.0 kbp 0.2% 0.1% Streptococcus parasanguinis (** 4 equal matches)
20.0 kbp 0.2% 0.8% Fusobacterium sp. HMSC064B11 (** 2 equal matches)
20.0 kbp 0.2% 0.4% Chlorobium phaeobacteroides (** 1 equal matches)
20.0 kbp 0.2% 0.7% Caldicellulosiruptor bescii
10.0 kbp 0.1% 0.0% Achromobacter xylosoxidans (** 53 equal matches)
10.0 kbp 0.1% 0.2% Geobacter sulfurreducens (** 17 equal matches)
10.0 kbp 0.1% 0.5% Fusobacterium sp. HMSC065F01 (** 15 equal matches)
10.0 kbp 0.1% 0.3% Nitrosomonas europaea (** 14 equal matches)
10.0 kbp 0.1% 0.5% Wolinella succinogenes (** 13 equal matches)
10.0 kbp 0.1% 0.5% Thermotoga neapolitana (** 12 equal matches)
10.0 kbp 0.1% 0.5% Thermus amyloliquefaciens (** 10 equal matches)
10.0 kbp 0.1% 0.1% Desulfovibrio desulfuricans (** 9 equal matches)
10.0 kbp 0.1% 0.4% Fusobacterium sp. CM22 (** 8 equal matches)
10.0 kbp 0.1% 0.2% Desulfovibrio piger (** 7 equal matches)
10.0 kbp 0.1% 0.5% Thermus kawarayensis (** 6 equal matches)
10.0 kbp 0.1% 0.5% Pyrococcus furiosus (** 5 equal matches)
10.0 kbp 0.1% 0.5% Aciduliprofundum boonei (** 4 equal matches)
10.0 kbp 0.1% 0.2% Desulfovibrio sp. A2 (** 3 equal matches)
10.0 kbp 0.1% 0.3% Desulfocurvus vexinensis (** 2 equal matches)
10.0 kbp 0.1% 0.0% Enterococcus faecalis
22.1% (2.8 Mbp) of hashes have no assignment.
What do the columns here mean?
Why might some of things in a metagenome be unassigned?
It is straightforward to build your own databases for use with
search
and lca gather
; this is of interest if you have dozens or
hundreds of sequencing data sets in your group. Ping us if you want us
to write that up.
There are many tools like Kraken and Kaiju that can do taxonomic classification of individual reads from metagenomes; these seem to perform well (albeit with high false positive rates) in situations where you don’t necessarily have the genome sequences that are in the metagenome. Sourmash, by contrast, can estimate which known genomes are actually present, so that you can extract them and map/align to them. It seems to have a very low false positive rate and is quite sensitive to strains.
Above, we’ve shown you a few things that you can use sourmash for. Here is a (non-exclusive) list of other uses that we’ve been thinking about –
detect contamination in sequencing data;
index and search private sequencing collections;
search all of SRA for overlaps in metagenomes
We are going to first talk about different needs in RNA-seq, from basics in library prep to the last steps of analyses.
How do you choose your library pipeline?
Choice 1 | Choice 2 |
---|---|
More Replicates | Deeper Sequencing |
RIN values < 7 | RNA values > 7 |
ERCC | No ERCC |
Poly A | Ribo depletion |
Fragment size 1 | Fragment size 2 |
Stranded | Non-stranded |
Single end | Paired end |
Short Read (50) | Medium Read (150) |
Depth 1 | Depth 2 |
RNA-seq | Targeted RNA-seq |
Biological vs Technical replicates
Schurch et al., 2016 “How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?”
How do you choose your bioinformatics pipeline?
Choice 1 | Choice 2 |
---|---|
trimmomatic | cutadapt |
Salmon, kallisto... | STAR, HISAT2, bowtie2 |
DESeq2 | edgeR |
clustering 1 (corr) | clustering 2 (PCA) |
Pathway analysis | GSET |
Visualization 1 | Visualization 100! |
Upload them before paper submission! | Upload them before paper submission! |
We have installed the ‘heaviest’ packages that you need in this lesson in our Jetstream image. For future work, these are the installations that you will need, so that you have all of them together:
source("https://bioconductor.org/biocLite.R") # calls the package from the source
biocLite('Biobase') # newer versions of R use 'library()' as function instead of biocLite
install.packages("tidyverse") # we have this in our Image
library(tidyverse)
library(tximport)
library(DESeq2)
install.packages("devtools")
library(devtools)
install.packages('pheatmap')
library(RColorBrewer)
library(pheatmap)
biocLite("GOstats")
biocLite("org.Sc.sgd.db")
library(GOstats)
library(org.Sc.sgd.db)
Bioconductor is a project to develop innovative software tools for use in computational biology. Bioconductor is an ‘umbrella package’ that includes many packages employed in biological analyses. Biobase is part of the Bioconductor project, and is used by many other packages. Biobase contains standardized data structures to represent genomic data. You can find help for Biobase typing:
browseVignettes("Biobase")
We are going to follow the lesson by Mike Love at DESeq2
This code chunk assumes that you have a count matrix called cts
and a table of sample information called coldata
. design
indicates how to model the samples, here, we want to measure the effect of the condition, controlling for batch differences. The two factor variables batch
and condition
should then be columns of coldata
.
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design= ~ batch + condition)
dds <- DESeq(dds) # creates a DESeqDataSet
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")
# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")
DESeqDataSet
class extends the RangedSummarizedExperiment
class of the SummarizedExperiment
package. Ranged referes here to counts associated with genomic ranges (exons) - we can then make use of other Bioconductor packages that explore range-based functionality (e.g. ChIP-seq peaks).
A DESeqDataSet
must have a design
formula: the variables that we will fit in the model to create our differential expression. The formula should be a tilde (~) followed by the variables with plus (+) signs between them (it will be coerced into an formula if it is not already).
Note: To get the most out of the package with default settings, put the variable of interest at the end of the formula and make sure the control level is the first level. The starting functions are:
If you have transcript quantification files, as produced by Salmon
, Sailfish
, or kallisto
, you would use DESeqDataSetFromTximport
.
If you have htseq-count files, the first line would use DESeqDataSetFromHTSeq
.
If you have a RangedSummarizedExperiment
, the first line would use DESeqDataSet
.
DESeq2
internally corrects for library size, so transformed or normalized values such as counts scaled by library size should not be used as input.
Depending on the upstream pipeline, a DESeqDataSet
can be constructed using:
From transcript abundance files and tximport
- this is the pipeline that we will use in this tutorial
From a count matrix
From htseq-count
files
From a SummarizedExperiment
object
You can import transcript abundance files from salmon
, sailfish
, kallisto
and RSEM
using tximport
, which will create gene-level count matrices for use with DESeq2.
Advantages:
correction for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013);
Salmon
, Sailfish
and kallisto
are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files;
it is possible to avoid discarding fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015).
Note: the tximport-to-DESeq2
approach uses estimated gene counts from the transcript abundance quantifiers, but not normalized counts.
Log into your Jetstream Instance and lauch R Studio. We will use both the Console (for R) and the Terminal (for bash) there.
We are first going to make a new directory using the Terminal where we will be doing all the analysis, called rna_seq_r
, where we will create tow other directories called figures
and r_script
:
mkdir rna_seq_r
cd rna_seq_r
mkdir figures
mkdir r_script
We are going to set our working directory in the R Console so that we can find and store all our analysis there:
setwd(~/rna_seq_r)
This is how you start any project in R: set your working directory, where you will find your input files (unless you download them directly as in this lesson) and where you will output all your data (and your RScript!).
We are now going to download the rest of the lesson and follow it on RMarkdown, which is a great way of making and sharing documents. Type in your Terminal:
wget https://raw.githubusercontent.com/maggimars/YeastTutorial/master/R_RNA.Rmd
and let’s get started!