But this is slow for a large file up to 20G in size.
Here are some quicker solutions for that
Solution #1:
inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}'
#or, use "split -l 100000 $inputbed $inputbed." to split large file into small files
for i in $inputbed.*; do sort -k2,2n $i > $i.s & doneSolution #2:
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted
The trick is to split large file into small ones (in this case, one per chromosome), then sort individual one in a parallel way. Finally we merge them by "sort -m" (Note: sort-merge is not just to concatenate files, but rather sort of sorting-and-merging)
A parallel solution is to use sort from coreultils: sort --parallel=N. See my post here: http://www.biostars.org/p/66927/Solution #3:
LC_ALL=C sort --parallel=24 --buffer-size=2G -k1,1 -k2,2n allmap.bed > allmap.sorted.bed
Use the sort-bed tool from BEDOPSsort-bed --max-mem 5G input.bed > sorted.bed
No comments:
Post a Comment