Friday, October 10, 2014

Memory issue for large alignment file when doing de novo assembly

Memory becomes a bottleneck when the sequencing file is bigger and bigger nowadays. This is specially an issue for de novo transcriptome assembly using RNA-seq data from species like human. For Trinity, "a typical configuration is a multi-core server with 256 GB to 1 TB of RAM". "Trinity partitions RNA-Seq data into many independent de Bruijn graphs, ideally one graph per expressed gene, and uses parallel computing to reconstruct transcripts from these graphs, including alternatively spliced isoforms." If Trinity contructs one de Bruijn graph per gene, I don't know why it still needs such a large memory. For Cufflinks (-g option), I already can see it consumes 40G memory for 1/10 of chr1 (given a 4G bam file). Cufflinks constructs a DAG in memory for the given alignment. Hopefully it's one DAG per chromosome, not a DAG for the whole genome. But even though, chr1 has 249 billion base pairs, it still requires a lot of memory... So sad!

Of course, if you have a machine with super large memory, this won't be a problem. But this is not the case usually. For me, most of our nodes have maximally 90G memory.

So, how to have it run?

Here is few tips I can think of:

1. split the alignment into different chromosome, e.g. chr1.bam, chr2,bam etc. and then call de novo on each of files.

2. down-sampling the bam file if it has high coverage. Here is a post about how to down-sample your bam file (https://www.biostars.org/p/4332/). Basically, you can use "samtools view -s" or "sambamba view -s", or GATK's randomSampleFromStream.pl for downsampling.

3. You may also want to remove the PCR artifacts, by "samtools rmdup" or "sambamba markdup".

No comments:

Post a Comment