Wednesday, March 14, 2012

adaptor remover/clipper with mismatch allowed


$ head -n12 RNAseq/mouse_adult_wt_smallRNAseq_76nt_strand_ox.fastq 
@HWUSI-EAS1533_0026_FC:6:1:1204:14021#0/1
TTCCTATCAATAAATCTCATGTGANNGCAGCTCGTANNCCNTCTNCNNNTNNNNANNNNANANCNNNNNANNNNNN
+HWUSI-EAS1533_0026_FC:6:1:1204:14021#0/1
gggaegggggfdggfcaedbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS1533_0026_FC:6:1:1204:19652#0/1
TGGTATAAACTCTTCTAAAAGACTNNAGCTCGTATGNNGCNTTCNGNNNGNNNNANNNNANANANNNNNANNNNNN
+HWUSI-EAS1533_0026_FC:6:1:1204:19652#0/1
c^^[^a`ddYaddadf_YafSOTZDDYZK\M`W_`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWUSI-EAS1533_0026_FC:6:1:1205:19274#0/1
TCAGAAATGTCCTGTGAAACTCGCNNAATTTCGTATGCCGCCTTCTGNNTGNAAANNNAAAACAANNNNANNNNNN
+HWUSI-EAS1533_0026_FC:6:1:1205:19274#0/1
fV^V]b]`ZSTXdRbc]dac`BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB


The basic question is to trim/clip the part after the adaptor sequence (e.g. TCGTATGCCGTCTTCTGCTTG in above example) in the FASTQ reads. It's easy if there is no mismatch. For case with mismatch, here is few solutions I got from google:
  1. FAR (http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?title=Main_Page), which uses Needleman-Wunsch alignment (global alignment) to align the adaptor sequence(s) to the read sequences and find the first best alignment.

    far -s stdin.fastq -t stdout.fq -f fastq-sanger -as TCGTATGCCGTCTTCTGCTTG --cut-off 5 --min-overlap 15 --phred-pre-trim 20 --min-readlength 20 --trim-end right --adaptive-overlap yes --max-uncalled 30 --nr-threads 8 --log-level ALL
  2.  Scythe only checks for 3'-end contaminants, up to the adapter's length into the 3'-end. For reads with contamination in any position, the program TagDust (http://genome.gsc.riken.jp/osc/english/dataresource/) is recommended. Scythe has the advantages of allowing fuzzier matching and being base quality-aware, while TagDust has the advantages of very fast matching (but allowing few mismatches, and not considering quality) and FDR. TagDust also removes contaminated reads entirely, while Scythe trims off contaminants.

  3. A possible pipeline would run FASTQ reads through Scythe, then TagDust, then a quality-based trimmer, and finally through a read quality statistics program such as qrqc (http://bioconductor.org/packages/devel/bioc/html/qrqc.html) or FASTqc (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/).
  4. trimLRPatterns() function in BioConductor. Here is an example.
  5. Another very useful toolkit is fastx_toolkit, which includes a set of tools for handling fasta and fastq format. fastx_clipper  is for clipping adaptor, but from what I checked it does not allow mismatch (e.g. N is counted as 'match' in fastx_clipper, but in FAR it's mismatch).  Running the following code on above reads only clips the last read. The first one should also be clipped. Don't know why. (Just get reply from Gordon that fastx_clipper does allow mismatch and also count 'N' as mismatch. Just the first two sequences have high error rate (=number_of_mismatch/adaptor_alignment_length, e.g. 33% in this case), so it's not counted as an adaptor. 
$ head -n12 RNAseq/mouse_adult_wt_smallRNAseq_76nt_strand_ox.fastq | fastx_clipper -a TCGTATGCCGTCTTCTGCTTG -c -n -i - 
@HWUSI-EAS1533_0026_FC:6:1:1205:19274#0/1
TCAGAAATGTCCTGTGAAACTCGCNNAATT
+HWUSI-EAS1533_0026_FC:6:1:1205:19274#0/1
fV^V]b]`ZSTXdRbc]dac`BBBBBBBBB

1 comment:

  1. Kevin1:29 PM

    Is there a threshold for fastx clipper (ie. the highest % mismatch (number of mismatch/adaptor_alignment_length) allowable for adaptor clipping to ensue?)

    ReplyDelete