Showing posts with label gencode. Show all posts
Showing posts with label gencode. Show all posts

Tuesday, March 11, 2014

Confusing dbSNP ID

There have been enough confusion in ID or naming system in the popular biological databases. I mentioned one such example in an earlier post, where two genes (on totally different chromosomes) use the same symbol and the same gene has different Ensembl Gene ID. Other confusions include the Entrez ID vs. gene symbol vs. Ensembl ID. People who do GO analysis using tools like DAVID must have painful experience on this (btw, I don't understand why most GO analysis tools convert to Entrez ID at the end).

And now, I got another example of confusing ID system in biology or bioinformatics.

You expect each SNP should have their unique ID in the dbSNP database, right? But they are not. Look at this example:

chr1    13837   13838   -      rs7164031, rs79531918
chr1    13837   13838   +     rs200683566, rs28391190, rs28428499, rs71252448, rs79817774

At the same location, multiple SNPs ID are assigned, for both strands. 

Here is what I got from NCBI User service for the explanation:

This is expected for many reasons:
- short probes could have multiple mapping locations on the genome
- certain genes could have duplicate/pseudogene/paralogs
- variations found in repeat regions would be difficult to map to a unique 
location

An rsID represents a cluster of reported variations submitted to dbSNP.
In ideal situation, they can be mapped to a unique location in the genome. 
There are cases where such unique mapping is NOT attainable.

But I am still not clear of how a cluster of variants are defined. [UPDATE to add]

p.s. code snip to merge SNPs with the same location into single line:
grep single snp137.bed | sort -k1,1 -k2,2n -k6,6 | bedtools groupby -g 1,2,3,6 -c 4 -o collapse

Monday, August 12, 2013

The same gene name, but different Ensembl/Gencode ID

I got aware of this when I was working on the latest GENCODE annotation v17. Here is it:

annotationURL=ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz 
curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}'  | cut -f9 | sort | uniq -d
5S_rRNA
61E3.4
A-575C2.4
AAK1
ACA59
ACA64
ACE
AJ271736.10
AK3
AKAP17A
AL732314.1
ALG9
AMDP1
APOC4
AQP1
ARHGAP19
...
...
...
ZNF585B
ZNF607
ZNF625
ZNF668
ZNF709
ZNF747
ZNF763
ZNF788

In total, there are 400 genes that have >1 different GENCODE ID. For example,

curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' |  grep ZNF788
chr19 HAVANA 12203078 12248050 + ENSG00000214189.3 protein_coding KNOWN ZNF788
chr19 ENSEMBL 12203078 12225491 + ENSG00000188474.6 protein_coding KNOWN ZNF788

This case is annotated differently in different sources (HAVANA vs. ENSEMBL). But even in the same source, there are cases that are annotated differently, for example:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep GRIA1
chr5 ENSEMBL 152871759 153190785 + ENSG00000270065.1 protein_coding KNOWN GRIA1
chr5 ENSEMBL 152873613 153190785 + ENSG00000269977.1 protein_coding KNOWN GRIA1

Obviously cases like this should be merged into one Gencode ID.

Some are even on different chromosomes:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep ACA59
chr2 ENSEMBL 64110383 64110525 - ENSG00000251775.1 snoRNA NOVEL ACA59
chr2 ENSEMBL 179887984 179888105 - ENSG00000252000.1 snoRNA NOVEL ACA59
chr11 ENSEMBL 114998938 114999093 - ENSG00000252870.1 snoRNA NOVEL ACA59

How come different genes/loci are assigned a same gene name?

Looks confusing! Hope someone from Gencode or Ensembl can help to understand this.