Wednesday, October 08, 2014

Cufflinks mask option (-M/--mask-file) works when ...

Obviously I am not the only one who had questions on the "-M/--mask-file" mask GTF option in Cufflinks:

http://seqanswers.com/forums/showthread.php?t=8190
https://www.biostars.org/p/110289/
http://seqanswers.com/forums/showthread.php?t=29975

And too bad that no one from the Texedo group ever threw a piece of clue!

Here are few tips I found necessary to share in order to have it work:

1. The mask GTF file should have all 9 fields in required format. For example, the strand column should be '+', '-', or '.', not anything else. GTF/GFF file can be extracted from GENCODE (http://www.gencodegenes.org/) or downloaded from UCSC Table browser. It can be also converted from a bed file using Kent's bedToGenePred --> genePredToGtf. But be aware that that the bed file should have at least 6 columns (i.e. including strand column), otherwise the converted GTF file will have a "^@" in the strand column, which results in an invalid GTF.

For example, if you want to exclude all reads mapped to human mitochondrial genome,  you can use
echo "chrM 0 16571 mt 0 ." | bedToGenePred stdin stdout | genePredToGtf file stdin chrM.gtf

2. "-M" option also works for de novo assembly (cufflinks -g).

3. Using "-M" option should theoretically increase the FPKM value (comparing to no mask). So, if you observed opposite tread, there must be something wrong.

4. If you expect a lot of reads from the mask regions (e.g. chrM, rRNAs), you can substract the masked reads from your bam file before feeding to cufflinks, for example using "samtools view -L retained_region.bed".

No comments:

Post a Comment