Thursday, March 07, 2013

how RNAseq quantification tools deal with multi-mapping reads?

Just to give an overview of how the commonly used RNAseq tools deal with multiple-mapping reads.

1. Cufflinks (http://cufflinks.cbcb.umd.edu/howitworks.html#hmul)
By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position. If multi-mapped read correction is enabled (-u/--multi-read-correct), Cufflinks will improve its estimation in a 'rescue' manner. That is, Cufflinks will first calculate initial abundance estimates for all transcripts using the uniform dividing scheme. Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabilistically based on the initial abundance estimation of the genes it maps to, the inferred fragment length, and fragment bias (if bias correction is enabled).

2. HTSeq-count (http://www-huber.embl.de/users/anders/HTSeq/doc/count.html)
HTSeq-count can be quite conservative in the sense that it discards reads that cannot be unambiguously be assigned to a gene. This can cause values to be lower than reported by other tools. (This is done on purpose: For differential expression analysis, you need to discount ambiguous reads to avoid that differential signal on one gene shows up in another one that overlaps.) (http://seqanswers.com/forums/showthread.php?t=9129)  Note that, HTSeq-count also reports alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times.)

3. RSEM (http://www.biomedcentral.com/1471-2105/12/323)
It's like the above Cufflinks rescue model, but using EM method to estimate. It also provides a way to generate normalized or weighted bigwig.

4. bedtools multicov
bedtools gives you raw hits which includes every 1 hit for every intersection of every alignment with any features in the GTF no matter how many times it aligned or how many features it hit. So, you have to deal with the apportion yourself, either uniformly or probabilistically.

4 comments:

  1. But upstream, during the alignment process, by default, bowtie assigns randomly to one location reads that can be mapped to multiple locations, right ?

    ReplyDelete
    Replies
    1. The setting you said is for bowtie -M option (http://bowtie-bio.sourceforge.net/manual.shtml#bowtie-options-M). The default setting is -k 1, I think. See here: http://bowtie-bio.sourceforge.net/manual.shtml#example-4-default--k-1

      Delete
  2. Chirag7:47 PM

    I guess, bowtie allows a single read to map 10/20 times, by default, unless you specify single mapping. When you specify single mapping, it will map randomly. Normally, do people often use multi mapping (N=??) reads in RNA-seq.

    ReplyDelete
    Replies
    1. Tophat by default allows up to 20 multihits (see: --max-multihits option in tophat manual), not bowtie.

      // Normally, do people often use multi mapping (N=??) reads in RNA-seq.
      I think that depends on your research purpose. For long RNAseq, maybe unique mapping is fine enough (also multi-mapping is not a big issue there since the reads are long). For smallRNA, from my experience you definitely want to consider the multi-mapping.

      Delete