Thursday, July 25, 2013

How to use the "textHistogram -aveCol" and textHist2 in the UCSC executables

The Jim Kent's executable tools are general quite friendly to use. But for some options of some tools, it's a bit too terse. For example, the "-aveCol" option in textHistogram says

-aveCol=N - A second column to average over. The averages
will be output in place of counts of primary column.


Also, for textHist2, it says
textHist2 - Make two dimensional histogram table out
of a list of 2-D points, one per line.
usage:
   textHist2 input
options:
   -xBins=N - number of bins in x dimension
   -yBins=N - number of bins in y dimension
   -xBinSize=N - size of bins in x dimension
   -yBinSize=N - size of bins in x dimension
   -xMin=N - minimum x number to record
   -yMin=N - minimum y number to record
   -ps=output.ps - make PostScript output
   -psSize=N - Size in points (1/72th of inch)
   -labelStep=N - How many bins to skip between labels
   -margin=N - Margin in points for PostScript output
   -log    - Logarithmic output (only works with ps now)
   -postScale=N (default 1.000000) - What to scale by after normalization

What do this mean? For example, I have userFile with 2 columns like,
1 8
2 9
3 6
5 3

Here is reply from Brooke Rhead in UCSC Genome Bioinformatics Group:
The -aveCol option prints the average values of the items in each bin instead of the number of items in each bin. (The default bin size is 1.) So, if you run textHistogram on your example input without the -aveCol option and the default bin size, you get the number of items in each bin, which is 1 item for all bins except #4:

$ textHistogram userFile
1 ************************************************************ 1
2 ************************************************************ 1
3 ************************************************************ 1
4 0
5 ************************************************************ 1


If you instead use -aveCol=2, you get the average of the *value* in column 2. Since bin size is still equal to 1, this amounts to printing the value in column 2 in each bin:

$ textHistogram -aveCol=2 userFile
1 ***************************************************** 8.000000
2 ************************************************************ 9.00000
3 **************************************** 6.000000
4 0.000000
5 ******************** 3.000000


If you specify a smaller bin size, the option might seem more useful:

$ textHistogram -binSize=2 -aveCol=2 userFile
0 ************************************************************ 8.00000
2 ******************************************************** 7.500000
4 *********************** 3.000000


The bins are taken from the first column: 0-1, 2-3, and 4-5, and the values are from the second column. So, for instance, the bin labeled "2" above contains the average of the values from this part of your file:
2 9
3 6
. . . bin "2" contains the average of 9 and 6, or 7.5.


The textHist2 program makes more sense if you specify the number of bins and bin sizes to use. For example, if we specify the following options on your same input file:

1 8
2 9
3 6
5 3

we get:

$ textHist2 -xBins=6 -yBins=10 userFile
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 1 0 0
0 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0


The first column in the input file specifies the x-axis in the two dimensional histogram, and the second column specifies the y-axis. The numbering of the rows and columns starts at 0 in both cases. The first line of your input:

1 8

Causes a 1 to appear in the second column from the right and the 9th row from the bottom, or column 1 and row 8, if you start counting from zero. The "2 9" line causes the 1 to appear in column 2, row 9, and so on.

1 comment: