================================================================ Using seqtk to trim and process reads at an insanely high speed ================================================================ seqtk was developed by Heng Li and is available from his GitHub page: https://github.com/lh3/seqtk Installation ------------ To obtain the code you will need to have `git` installed and you need to clone the `seqtk` url:: # download with github git clone git://github.com/lh3/seqtk.git Switch to the seqtk directory and make it:: cd seqtk make cp seqtk /usr/local/bin We will use E.coli data in snapshot snap-000d346e in this tutorials. Create a volume from the snapshot and attach it to the instance.:: mkdir /ebs/ mount /dev/xvdf /ebs Basic Usage ----------- For just about all tools the input can be fasta, fastq file, may be gzipped or not, will unzip on the fly. # extracts a random sample seqtk sample # apply a seed to extract the same reads from two, paired end files seqtk -s 10 sample # trim reads with the modified Mott trimming algorithm seqtk trimfq ... The algorithm is described on this page: `http://www.phrap.org/phredphrap/phred.html`__. Scroll down to the Algorithm section for details. Beyond this usage there are other interesting features - you can subtract subsequences from file (say you want to extract a certain part of your reference genome) using seqtk. Tools ----- Using your Amazon EC2 instance:: cd /mnt To see a list of tools:: seqtk Extracts a random sample:: # sample 1000 reads from a fastq file seqtk sample /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq Convert fastq to fasta:: seqtk seq -A /ebs/ecoli/SRR001666_1.fastq.gz > sample.fa Apply a seed to extract the same reads from two, paired end files:: seqtk sample -s 10 /ebs/ecoli/SRR001666_1.fastq.gz 1000 > SRR001666_1_1000.fastq seqtk sample -s 10 /ebs/ecoli/SRR001666_2.fastq.gz 1000 > SRR001666_2_1000.fastq Trim reads with the modified Mott trimming algorithm:: # trim with default error threshold = 0.05 seqtk trimfq /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq # trim with an error threshold = 0.01 seqtk trimfq -q 0.01 /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq # trim the first 3 bases and the last 5 bases seqtk trimfq -b 3 -e 5 /ebs/ecoli/SRR001666_1.fastq.gz > timmed.fq