cut -f10 alignment.sam | sort -u | wc -l
But it's not the faster one if the sam file is large. We can improve it by not sorting it (which is unnecessary for this task). Here is it:
cut -f10 alignment.sam | awk '{r[$1]++;}END{for(i in r)j++; print "number of species:", j;}'
or
awk '{r[$10]++;}END{for(i in r)j++; print "number of species:", j;}' alignment.sam
The point is, you don't really need to "sort" the key, but instead to "group" the lines by the key. For example, when using groupBy in bedtools, it requires the input is pre-sorted by the key which is to be grouped. To do that in a quick way (esp. important for large file), we can use array of array for awk or hash of array in perl. Here is the hash of array solution in perl for "sort -k10" of a large file:
perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[17]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'
btw, I found this awk array of array does not work:
awk '{r[$10][length(r[$10])+1]=$0;}END{for(i in r) for (j in r[i]) print r[i][j];}' alignment.sam
The point is, you don't really need to "sort" the key, but instead to "group" the lines by the key. For example, when using groupBy in bedtools, it requires the input is pre-sorted by the key which is to be grouped. To do that in a quick way (esp. important for large file), we can use array of array for awk or hash of array in perl. Here is the hash of array solution in perl for "sort -k10" of a large file:
perl -e 'while (<>) {$l=$_; @a=split("\t", $l); push(@{$HoA{$a[17]}}, $l);}{foreach $i (sort keys %HoA) {print join("", @{$HoA{$i}});}}'
btw, I found this awk array of array does not work:
awk '{r[$10][length(r[$10])+1]=$0;}END{for(i in r) for (j in r[i]) print r[i][j];}' alignment.sam
No comments:
Post a Comment