Showing posts with label sorting. Show all posts
Showing posts with label sorting. Show all posts

Friday, March 15, 2013

how to quick sort a 20G bed file?

We are often suggested to sort the input bed file by "sort -k1,1 -k2,2n" in order to invokes a memory-efficient algorithm designed for large files, for example, bedtools intersect ( http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html)

But this is slow for a large file up to 20G in size.

Here are some quicker solutions for that

Solution #1:
inputbed=$1
awk -v inputbed=$inputbed '{print $0 >> inputbed"."$1}'
 
#or, use "split -l 100000 $inputbed $inputbed." to split large file into small files 
for i in $inputbed.*; do sort -k2,2n $i > $i.s & done
# when it's done
sort -m -k1,1 -k2,2n $inputbed.*s > $inputbed.sorted

The trick is to split large file into small ones (in this case, one per chromosome), then sort individual one in a parallel way. Finally we merge them by "sort -m" (Note: sort-merge is not just to concatenate files, but rather sort of sorting-and-merging)
Solution #2: 
A parallel solution is to use sort from coreultils: sort --parallel=N. See my post here: http://www.biostars.org/p/66927/
LC_ALL=C sort --parallel=24 --buffer-size=2G -k1,1 -k2,2n allmap.bed > allmap.sorted.bed
Solution #3:
Use the sort-bed tool from BEDOPSsort-bed --max-mem 5G input.bed > sorted.bed

Wednesday, March 13, 2013

faster way to get number of reads species

Let's call the unique reads sequences as 'species' in this topic.  So, an easy way to get the number of reads species for a RNAseq (for example) sam file can be

cut -f10 alignment.sam | sort -u | wc -l

But it's not the faster one if the sam file is large. We can improve it by not sorting it (which is unnecessary for this task). Here is it:

cut -f10 alignment.sam | awk '{r[$1]++;}END{for(i in r)j++; print "number of species:", j;}'
or
awk '{r[$10]++;}END{for(i in r)j++; print "number of species:", j;}' alignment.sam

The point is, you don't really need to "sort" the key, but instead to "group" the lines by the key. For example, when using groupBy in bedtools, it requires the input is pre-sorted by the key which is to be grouped. To do that in a quick way (esp. important for large file), we can use array of array for awk or hash of array in perl. Here is the hash of array solution in perl for "sort -k10" of a large file:

perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[17]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'

btw, I found this awk array of array does not work:

awk '{r[$10][length(r[$10])+1]=$0;}END{for(i in r) for (j in r[i]) print r[i][j];}' alignment.sam

Wednesday, August 29, 2012

quick sort in bash line

There are varieties of sorting algorithms (see wiki for details: http://en.wikipedia.org/wiki/Sorting_algorithm). The time complexity for the quick ones are like nlog(n).

In unix, we usually use the 'sort' command to sort input. I cannot find direct info for the computational complexity of unix sort command, but definitely it shows quick slow esp. when the dataset is large.

Here are several feasible solutions for quick sorting I googled:

1. "split & merge" http://arnab.org/blog/quick-and-easy-multicore-sort

split -l5000000 data.tsv '_tmp';
ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done;

Followed by:
sort -m _tmp* -o data.tsv.sorted

arnab suggested to split the large file into small pieces and sort them individually, then use 'sort -m' to merge them. I've tried it, works very well, at least the "sort -m" is worthy to try. (Note: -merge is actually sort instead of merge. Here is what 'info sort' says:

`--merge'
Merge the given files by sorting them as a group. Each input file
must always be individually sorted. It always works to sort
instead of merge; merging is provided because it is faster, in the
case where it works.


This is another implementation for quicksort in bash. I've not tested it yet.

3. BSD seems already have lib for other quick sorting algorithms: