Below is what I wrote to the author regarding the problem I just found. Erik suggested to use "-t 0" to fix the problem (see info at the end)
================
I am really frustrated with a problem I am facing while using the fastq-mcf for adaptor removal -- the same read in different files with different number of reads will get different trimming result. WHAT'S THE PROBLEM?
Here is the test example:
adaptor:
$ cat adaptor.fa
>adapter1
CAAGCAGAAGACGGCATACGAGATTCAAGT GTGACTGGAGTTCCTTGGCACCCGAGAATT CCA
>RC_adapter1
TGGAATTCTCGGGTGCCAAGGAACTCCAGT CACACTTGAATCTCGTATGCCGTCTTCTGC TTG
Input file:
$ head test.fa
@HWI-ST560:107:D25EJACXX:3: 1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATTTGGAATT CTCGGGTGCCAAGGAACTCC
+
CCCDFFFFHHHHGJJJJJJJJJJJJJJJJJ JJJJJIHHJJIJJJIJJIIJ
@HWI-ST560:107:D25EJACXX:3: 1101:1420:1954 1:N:0:TTAGGC
ATACAGTCCGACGATCTGGAATTCTCGGGT GCCAAGGAACTCCAGTCACT
+
@@@DDDDDHHG? FDGIIBEHAFGIIHFHIHGIIEHGGGHFHI IGGGGGHI
When I include different number of reads, the result are differnt for the 1st read:
$ head -n10000 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/ fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 adaptor.fa test1.fq; head test1.out
@HWI-ST560:107:D25EJACXX:3: 1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATTTGGAATT CTCGGGTGCCAAGGAACTCC
+
CCCDFFFFHHHHGJJJJJJJJJJJJJJJJJ JJJJJIHHJJIJJJIJJIIJ
This is not correct, as the yellow highlight part is from RC_adapter1 obviously.
But if I set a smaller number of sample, it will be correct. See below for "head -n100":
$ head -n100 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/ fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 adaptor.fa test1.fq; head test1.out
@HWI-ST560:107:D25EJACXX:3: 1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATT
+
CCCDFFFFHHHHGJJJJJJJJJJ
And if I set "-C 0", it will get correct result as well (even for a larger number of sample):
$ head -n10000 test.fa > test1.fq; ~/bin/ea-utils.1.1.2-537/ fastq-mcf -o test1.out -l 16 -q 15 -w 4 -x 20 -u -S -P 33 -C 0 adaptor.fa test1.fq; head test1.out
@HWI-ST560:107:D25EJACXX:3: 1301:11693:29455 1:N:0:GCCTAA
CAGTGCAATGTTAAAAGGGCATT
+
CCCDFFFFHHHHGJJJJJJJJJJ
This is very confusing! Please help me on it.
=========== reply from Erik========
fastq-mcf will not bother to clip if the adaptor is rare (< .25% presence in sampling). The reason is that the dangers of clipping (removing valid sequence) can outweigh the benefits (removing invalid sequence) as the number of adaptor sequences get fewer and fewer.
=========== reply from Erik========
fastq-mcf will not bother to clip if the adaptor is rare (< .25% presence in sampling). The reason is that the dangers of clipping (removing valid sequence) can outweigh the benefits (removing invalid sequence) as the number of adaptor sequences get fewer and fewer.
To override this, you set -t 0 (always clip everything)
No comments:
Post a Comment