Tuesday, April 15, 2014

Trimmed mean and median in AWK

This can be easily done in R, but sometime you want to get it in scripting language like awk or perl in order to process the big data line by line.

Here is the code snip:

# for median
# Note: thanks to anonymous reply below, put c, j in arguments to define that they are local variables.
function median(v, c, j
    c=asort(v,j); 
    if (c % 2) return j[(c+1)/2]; 
    else return (j[c/2+1]+j[c/2])/2; 
}

# for trimmed mean (where p is the percentage of data to be trimmed)
function trimmedMean(v, p, c, j
    c=asort(v,j); 
    a=int(c*p);
    for(i=a+1;i<=(c-a);i++) s+=j[i];
    return s/(c-2*a); 
}
To use it, for example if we want to generate a merged bigwig track for a list of samples, we can take median value of all samples at each genomic position, here is it:

unionBedGraphs -i `ls *normalized.bedGraph` | awk 'function median(v) {c=asort(v,j); if (c % 2) return j[(c+1)/2]; else return (j[c/2+1]+j[c/2])/2.0; } {OFS="\t"; n=1; for(i=4;i<=NF;i++) S[n++]=$i; print $1,$2,$3, median(S)}' > AllSamples.median.normalized.bedGraph
bedGraphToBigWig AllSamples.median.normalized.bedGraph ChromInfo.txt AllSamples.median.normalized.bw

2 comments:

  1. Anonymous12:18 AM

    Nice tip. Note that variables in awk are global by default, so this could have some unintended side-effects.

    If you want to avoid this, you should use

    function median(v ,c,j)

    rather than

    function median(v)

    Similarly for trimmedMean

    ReplyDelete
    Replies
    1. Thanks for letting me know that. It's fixed.

      For my own reference, here is the relevent detail: https://www.gnu.org/software/gawk/manual/html_node/Variable-Scope.html

      Delete