Monday, May 27, 2013

Unstable output from fastq-mcf (adaptor removal)

Below is what I wrote to the author regarding the problem I just found. Erik suggested to use "-t 0" to fix the problem (see info at the end)

================

I am really frustrated with a problem I am facing while using the fastq-mcf for adaptor removal -- the same read in different files with different number of reads will get different trimming result. WHAT'S THE PROBLEM?

Here is the test example:

adaptor:

$ cat adaptor.fa 
>adapter1
CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA
>RC_adapter1
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG

Input file:
$ head test.fa
@HWI-ST560:107:D25EJACXX:3:1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATTTGGAATTCTCGGGTGCCAAGGAACTCC
+
CCCDFFFFHHHHGJJJJJJJJJJJJJJJJJJJJJJIHHJJIJJJIJJIIJ
@HWI-ST560:107:D25EJACXX:3:1101:1420:1954 1:N:0:TTAGGC
ATACAGTCCGACGATCTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACT
+
@@@DDDDDHHG?FDGIIBEHAFGIIHFHIHGIIEHGGGHFHIIGGGGGHI

When I include different number of reads, the result are differnt for the 1st read:

$ head -n10000 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 adaptor.fa test1.fq; head test1.out

@HWI-ST560:107:D25EJACXX:3:1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATTTGGAATTCTCGGGTGCCAAGGAACTCC
+
CCCDFFFFHHHHGJJJJJJJJJJJJJJJJJJJJJJIHHJJIJJJIJJIIJ

This is not correct, as the yellow highlight part is from RC_adapter1 obviously. 

But if I set a smaller number of sample, it will be correct. See below for "head -n100":

$ head -n100 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 adaptor.fa test1.fq; head test1.out

@HWI-ST560:107:D25EJACXX:3:1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATT
+
CCCDFFFFHHHHGJJJJJJJJJJ

And if I set "-C 0", it will get correct result as well (even for a larger number of sample):

$ head -n10000 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 -C 0 adaptor.fa test1.fq; head test1.out

@HWI-ST560:107:D25EJACXX:3:1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATT
+
CCCDFFFFHHHHGJJJJJJJJJJ

This is very confusing! Please help me on it.

=========== reply from Erik========
fastq-mcf will not bother to clip if the adaptor is rare (< .25% presence in sampling).   The reason is that the dangers of clipping (removing valid sequence) can outweigh the benefits (removing invalid sequence) as the number of adaptor sequences get fewer and fewer.

To override this, you set -t 0 (always clip everything)

No comments:

Post a Comment