Using seqtk to trim and process reads at an insanely high speed

seqtk was developed by Heng Li and is available from his GitHub page: https://github.com/lh3/seqtk

Installation

To obtain the code you will need to have git installed and you need to clone the seqtk url:

# download with github
git clone git://github.com/lh3/seqtk.git

Switch to the seqtk directory and make it:

cd seqtk
make
cp seqtk /usr/local/bin

We will use E.coli data in snapshot snap-000d346e in this tutorials. Create a volume from the snapshot and attach it to the instance.:

mkdir /ebs/
mount /dev/xvdf /ebs

Basic Usage

For just about all tools the input can be fasta, fastq file, may be gzipped or not, will unzip on the fly.

# extracts a random sample seqtk sample

# apply a seed to extract the same reads from two, paired end files seqtk -s 10 sample

# trim reads with the modified Mott trimming algorithm seqtk trimfq ...

The algorithm is described on this page: `http://www.phrap.org/phredphrap/phred.html`__. Scroll down to the Algorithm section for details.

Beyond this usage there are other interesting features - you can subtract subsequences from file (say you want to extract a certain part of your reference genome) using seqtk.

Tools

Using your Amazon EC2 instance:

cd /mnt

To see a list of tools:

seqtk

Extracts a random sample:

# sample 1000 reads from a fastq file
seqtk sample /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq

Convert fastq to fasta:

seqtk seq -A /ebs/ecoli/SRR001666_1.fastq.gz > sample.fa

Apply a seed to extract the same reads from two, paired end files:

seqtk sample -s 10 /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq
seqtk sample -s 10 /ebs/ecoli/SRR001666_2.fastq.gz 1000 > SRR001666_2_1000.fastq

Trim reads with the modified Mott trimming algorithm:

# trim with default error threshold = 0.05
seqtk trimfq /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq

# trim with an error threshold = 0.01
seqtk trimfq -q 0.01 /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq

# trim the first 3 bases and the last 5 bases
seqtk trimfq -b 3 -e 5 /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq
comments powered by Disqus

Table Of Contents

This Page