Friday, March 15, 2013

how to quick sort a 20G bed file?

We are often suggested to sort the input bed file by "sort -k1,1 -k2,2n" in order to invokes a memory-efficient algorithm designed for large files, for example, bedtools intersect ( http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)

But this is slow for a large file up to 20G in size.

Here are some quicker solutions for that

Solution #1:
inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}'
 
#or, use "split -l 100000 $inputbed $inputbed." to split large file into small files 
for i in $inputbed.*; do sort -k2,2n $i > $i.s & done
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted

The trick is to split large file into small ones (in this case, one per chromosome), then sort individual one in a parallel way. Finally we merge them by "sort -m" (Note: sort-merge is not just to concatenate files, but rather sort of sorting-and-merging)
Solution #2: 
A parallel solution is to use sort from coreultils: sort --parallel=N. See my post here: http://www.biostars.org/p/66927/
LC_ALL=C sort --parallel=24 --buffer-size=2G -k1,1 -k2,2n allmap.bed > allmap.sorted.bed
Solution #3:
Use the sort-bed tool from BEDOPSsort-bed --max-mem 5G input.bed > sorted.bed

No comments:

Post a Comment