Tuesday, May 21, 2013

smart binindex system of storing genomic data

XP explained to me what the bin indexing system is yesterday.  Here is what's originally explained in Jim Kent's paper

Figure 7.
Binning scheme for optimizing database accesses for genomic annotations that cover a particular region of the genome. This diagram shows bins of three different sizes. Features are put in the smallest bin in which they fit. A feature covering the range indicated by line Awould go in bin 1. Similarly, line B goes in bin 4 and lineC in bin 20. When the browser needs to access features in a region, it must look in bins of all different sizes. To access all the features that overlapped or were enclosed by line A, the browser looks in bins 1, 2, 3, 7, 8, 9, 10, and 11. For B the browser looks in bins 1, 4, 14, 15, 16, 17. For C, the browser looks in bins 1, 5, and 20.
For example, to select features (e.g. ESTs) overlapping with region of chr2:10000-20000, the query would be:

select * from chr2_est where chromStart <20000 and chromEnd >10000 and (bin = 1 or bin = 2 or bin = 10 or bin = 74 or bin = 586)

The additional bin=xxx part is to limit the searching only on the bins which can be only possible overlapped with the queried region. It can enhance the performance.

This is query part. For encoding, each range supposed to have a bin number, which is the smallest bin (in order to save time for searching?) that it can fit. How to find the smallest bin?

Here is C code in kent/src/lib/binRange.c to explain how to find the smallest bin that the range will fit in:


static int binFromRangeStandard(int start, int end)
/* Given start,end in chromosome coordinates assign it
 * a bin.   There's a bin for each 128k segment, for each
 * 1M segment, for each 8M segment, for each 64M segment,
 * and for each chromosome (which is assumed to be less than
 * 512M.)  A range goes into the smallest bin it will fit in. */
{
int startBin = start, endBin = end-1, i;
startBin >>= _binFirstShift;
endBin >>= _binFirstShift;
for (i=0; i<ArraySize(binOffsets); ++i)
    {
    if (startBin == endBin)
        return binOffsets[i] + startBin;
    startBin >>= _binNextShift;
    endBin >>= _binNextShift;
    }
errAbort("start %d, end %d out of range in findBin (max is 512M)", start, end);
return 0;
}

where _binNextShift=3, _binFirstShift=17, and binOffsets[] = {512+64+8+1, 64+8+1, 8+1, 1, 0}.

You can understand binOffsets as the 0-based index of the first rooms of every floor, but in descending order. By adding the offset, you will get the global index of the bin in the whole structure.

No comments:

Post a Comment