Tuesday, March 11, 2014

Confusing dbSNP ID

There have been enough confusion in ID or naming system in the popular biological databases. I mentioned one such example in an earlier post, where two genes (on totally different chromosomes) use the same symbol and the same gene has different Ensembl Gene ID. Other confusions include the Entrez ID vs. gene symbol vs. Ensembl ID. People who do GO analysis using tools like DAVID must have painful experience on this (btw, I don't understand why most GO analysis tools convert to Entrez ID at the end).

And now, I got another example of confusing ID system in biology or bioinformatics.

You expect each SNP should have their unique ID in the dbSNP database, right? But they are not. Look at this example:

chr1    13837   13838   -      rs7164031, rs79531918
chr1    13837   13838   +     rs200683566, rs28391190, rs28428499, rs71252448, rs79817774

At the same location, multiple SNPs ID are assigned, for both strands. 

Here is what I got from NCBI User service for the explanation:

This is expected for many reasons:
- short probes could have multiple mapping locations on the genome
- certain genes could have duplicate/pseudogene/paralogs
- variations found in repeat regions would be difficult to map to a unique 
location

An rsID represents a cluster of reported variations submitted to dbSNP.
In ideal situation, they can be mapped to a unique location in the genome. 
There are cases where such unique mapping is NOT attainable.

But I am still not clear of how a cluster of variants are defined. [UPDATE to add]

p.s. code snip to merge SNPs with the same location into single line:
grep single snp137.bed | sort -k1,1 -k2,2n -k6,6 | bedtools groupby -g 1,2,3,6 -c 4 -o collapse

No comments:

Post a Comment