Wednesday, March 13, 2013

faster way to get number of reads species

Let's call the unique reads sequences as 'species' in this topic.  So, an easy way to get the number of reads species for a RNAseq (for example) sam file can be

cut -f10 alignment.sam | sort -u | wc -l

But it's not the faster one if the sam file is large. We can improve it by not sorting it (which is unnecessary for this task). Here is it:

cut -f10 alignment.sam | awk '{r[$1]++;}END{for(i in r)j++; print "number of species:", j;}'
or
awk '{r[$10]++;}END{for(i in r)j++; print "number of species:", j;}' alignment.sam

The point is, you don't really need to "sort" the key, but instead to "group" the lines by the key. For example, when using groupBy in bedtools, it requires the input is pre-sorted by the key which is to be grouped. To do that in a quick way (esp. important for large file), we can use array of array for awk or hash of array in perl. Here is the hash of array solution in perl for "sort -k10" of a large file:

perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[17]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'

btw, I found this awk array of array does not work:

awk '{r[$10][length(r[$10])+1]=$0;}END{for(i in r) for (j in r[i]) print r[i][j];}' alignment.sam

No comments:

Post a Comment