Wednesday, September 03, 2014

a bigWigSummary bug

Write down a bigwigsummary bug I found today. It's found when I attempted to get the max value in a region using -type=max and dataPoints=1:

$ bigWigSummary test.bw -type=max chr12 54070173 54072173 1
13.3672
$ bigWigSummary test.bw -type=max chr12 54070173 54072173 10
0.944904 1.02475 0.568405 0.741671 1.43119 1.08896 0.705965 0.542034 0.380971 0.591934

As you see, if I use dataPoints=1, the max value is 13.3672 and when I use dataPoints=10 the max value is 1.43119. So there must be something wrong, since the max value should not change no matter how many data points we check. Visualizing the bigwig in UCSC Genome Browser shows that 1.43119 is correct for this case. Interestingly, 13.3672 is indeed a peak summit, but not for this region, rather a region upstream. I don’t know why bigWigSummary take the summit from region outside. This only happened when dataPoints=1. 

I put the data below. You can download to test:


People reported similar error for bigwigSummary, for example

bigWigSummary outputs different values as bigWigAverageOverBed: UCSC team explained "The reason for this is that the summary levels have some rounding error and some border conditions when extracting data over relatively small regions." They suggested to use bigWigAverageOverBed if you want the highest level of accuracy. But bigWigAverageOverBed won't output the max and also it's nothing with high or low accuracy, but rather a bug. 

Tao Liu also reported another bug for bigwig when it's converted from wig in compressed manner (by default), and suggested to fix it by using -unc when converting wig to bigwig. I've tried to use -unc when converting from bedGraph to bigwig, the bug is still there. 

Still looking for workout, and also report to UCSC:
https://groups.google.com/a/soe.ucsc.edu/forum/#!topic/genome/pWjcov-xQyQ

Update: One workout I found is, to use intersectBed and groupBy in bedtools on bedGraph file (rather than bigwig). Here is pseudocode:

intersectBed -a regions.bed -b signal.bedGraph -wo -sorted | groupBy -g 1,2,3 -c 7 -o max > regions.maxSignal.bed

Update2:  use the -minMax option from the latest version of bigWigAverageOverBed (from v304). See reply from UCSC group:
Thank you for contacting us. One of our engineers was able to reproduce this and says the bigWigSummary program uses the bigWig summary levels, which are lower resolution versions of the data, so when you ask for a single value in a particular range, the range that is used may include bases that are before and after the range.
If you want only the values exactly within the range, you can use (the latest version from v304) bigWigAverageOverBed like so:
echo "chr12 54070173 54072173 one" | bigWigAverageOverBed myTest.bw stdin stdout -minMax | cut -f 8

No comments:

Post a Comment