Here is the code snip:
# for median
# Note: thanks to anonymous reply below, put c, j in arguments to define that they are local variables.
function median(v, c, j)
{
c=asort(v,j);
if (c % 2) return j[(c+1)/2];
else return (j[c/2+1]+j[c/2])/2;
}
# for trimmed mean (where p is the percentage of data to be trimmed)
function trimmedMean(v, p, c, j)
function trimmedMean(v, p, c, j)
{
c=asort(v,j);
a=int(c*p);
for(i=a+1;i<=(c-a);i++) s+=j[i];
return s/(c-2*a);
}
To use it, for example if we want to generate a merged bigwig track for a list of samples, we can take median value of all samples at each genomic position, here is it:
unionBedGraphs -i `ls *normalized.bedGraph` | awk 'function median(v) {c=asort(v,j); if (c % 2) return j[(c+1)/2]; else return (j[c/2+1]+j[c/2])/2.0; } {OFS="\t"; n=1; for(i=4;i<=NF;i++) S[n++]=$i; print $1,$2,$3, median(S)}' > AllSamples.median.normalized.bedGraph
bedGraphToBigWig AllSamples.median.normalized.bedGraph ChromInfo.txt AllSamples.median.normalized.bw
bedGraphToBigWig AllSamples.median.normalized.bedGraph ChromInfo.txt AllSamples.median.normalized.bw
Nice tip. Note that variables in awk are global by default, so this could have some unintended side-effects.
ReplyDeleteIf you want to avoid this, you should use
function median(v ,c,j)
rather than
function median(v)
Similarly for trimmedMean
Thanks for letting me know that. It's fixed.
DeleteFor my own reference, here is the relevent detail: https://www.gnu.org/software/gawk/manual/html_node/Variable-Scope.html