One Tip Per Day: The same gene name, but different Ensembl/Gencode ID

Monday, August 12, 2013

The same gene name, but different Ensembl/Gencode ID

I got aware of this when I was working on the latest GENCODE annotation v17. Here is it:

annotationURL=ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz
curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | cut -f9 | sort | uniq -d
5S_rRNA
61E3.4
A-575C2.4
AAK1
ACA59
ACA64
ACE
AJ271736.10
AK3
AKAP17A
AL732314.1
ALG9
AMDP1
APOC4
AQP1
ARHGAP19
...
...
...
ZNF585B
ZNF607
ZNF625
ZNF668
ZNF709
ZNF747
ZNF763
ZNF788

In total, there are 400 genes that have >1 different GENCODE ID. For example,

curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | grep ZNF788
chr19 HAVANA 12203078 12248050 + ENSG00000214189.3 protein_coding KNOWN ZNF788
chr19 ENSEMBL 12203078 12225491 + ENSG00000188474.6 protein_coding KNOWN ZNF788

This case is annotated differently in different sources (HAVANA vs. ENSEMBL). But even in the same source, there are cases that are annotated differently, for example:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep GRIA1
chr5 ENSEMBL 152871759 153190785 + ENSG00000270065.1 protein_coding KNOWN GRIA1
chr5 ENSEMBL 152873613 153190785 + ENSG00000269977.1 protein_coding KNOWN GRIA1

Obviously cases like this should be merged into one Gencode ID.

Some are even on different chromosomes:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep ACA59
chr2 ENSEMBL 64110383 64110525 - ENSG00000251775.1 snoRNA NOVEL ACA59
chr2 ENSEMBL 179887984 179888105 - ENSG00000252000.1 snoRNA NOVEL ACA59
chr11 ENSEMBL 114998938 114999093 - ENSG00000252870.1 snoRNA NOVEL ACA59

How come different genes/loci are assigned a same gene name?

Looks confusing! Hope someone from Gencode or Ensembl can help to understand this.

6 comments:

Unknown12:00 AM
I'm also confused about the same question in Refseq annotation file. Just like this on the biostar, http://www.biostars.org/p/18091/. When I get the different anntation file (for example, Refseq, ensembl or gtf file by cufflinks), i want to compare them. However, if they have some duplicate id, it will make the thing complex. How do you think?
ReplyDelete
Replies
Unknown5:23 PM
The names of the genes are the same because they are small non coding RNAs. So ENSEMBL ids are different means the loci of them are different, but the sequences are same and they are the same genes. Some extend like miRNAs, some miRNAs have replicates in the genome, and are only distinguished as mir-XXX-1, mir-XXX-2. snoRNAs are too many in the genome, like tRNA, so only family name kept as gene name, because they are same sequences, and same transcripts.
ReplyDelete
Replies
ygc6:09 AM
I hate this.

We need to convert ID very often, and this is really confusing.
ReplyDelete
Replies
Anonymous10:12 AM
Did you ever find out the reason for this? I am having the same trouble with the ensembl GTF having the same gene name, but different ENSG numbers and different positions, coming from different sources (e.g havana or ensembl_havana). THankyou
ReplyDelete
Replies

Add comment

Pages

Monday, August 12, 2013

The same gene name, but different Ensembl/Gencode ID

6 comments: