Monday, August 12, 2013

The same gene name, but different Ensembl/Gencode ID

I got aware of this when I was working on the latest GENCODE annotation v17. Here is it:

annotationURL=ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz 
curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}'  | cut -f9 | sort | uniq -d
5S_rRNA
61E3.4
A-575C2.4
AAK1
ACA59
ACA64
ACE
AJ271736.10
AK3
AKAP17A
AL732314.1
ALG9
AMDP1
APOC4
AQP1
ARHGAP19
...
...
...
ZNF585B
ZNF607
ZNF625
ZNF668
ZNF709
ZNF747
ZNF763
ZNF788

In total, there are 400 genes that have >1 different GENCODE ID. For example,

curl -0 $annotationURL | zcat | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' |  grep ZNF788
chr19 HAVANA 12203078 12248050 + ENSG00000214189.3 protein_coding KNOWN ZNF788
chr19 ENSEMBL 12203078 12225491 + ENSG00000188474.6 protein_coding KNOWN ZNF788

This case is annotated differently in different sources (HAVANA vs. ENSEMBL). But even in the same source, there are cases that are annotated differently, for example:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep GRIA1
chr5 ENSEMBL 152871759 153190785 + ENSG00000270065.1 protein_coding KNOWN GRIA1
chr5 ENSEMBL 152873613 153190785 + ENSG00000269977.1 protein_coding KNOWN GRIA1

Obviously cases like this should be merged into one Gencode ID.

Some are even on different chromosomes:

zcat $annotation | fgrep -w gene | sed 's/[";]//g;' | awk '{OFS="\t"; print $1,$2,$4,$5,$7,$10,$14,$16,$18;}' | fgrep -w ENSEMBL | grep ACA59
chr2 ENSEMBL 64110383 64110525 - ENSG00000251775.1 snoRNA NOVEL ACA59
chr2 ENSEMBL 179887984 179888105 - ENSG00000252000.1 snoRNA NOVEL ACA59
chr11 ENSEMBL 114998938 114999093 - ENSG00000252870.1 snoRNA NOVEL ACA59

How come different genes/loci are assigned a same gene name?

Looks confusing! Hope someone from Gencode or Ensembl can help to understand this.

6 comments:

  1. I'm also confused about the same question in Refseq annotation file. Just like this on the biostar, http://www.biostars.org/p/18091/. When I get the different anntation file (for example, Refseq, ensembl or gtf file by cufflinks), i want to compare them. However, if they have some duplicate id, it will make the thing complex. How do you think?

    ReplyDelete
    Replies
    1. Hi, 焕威.
      Thanks for sharing your experience.
      The case you linked is more like the gene duplicated on the patch chromosome (e.g. chr6_qbl_hap6), which is also confusing but might be easy to solve. For those cases that different loci have the same gene name/symbol, it's even more bothering, like the user sdriscoll posted here(http://seqanswers.com/forums/showthread.php?t=24903). It's especially a headache for computing. I also don't know how the biologists who work on the gene can tolerate the confusion.
      The community should fix it. I've wrote to Tim Hubbard (@ Gencode) for the confusion. Let's see how he replies.

      Delete
    2. Thank you for your answer and it's really very helpful. There are many duplicate id in Refseq gene annotation. One reason is that they locate on normal chromosome and patch chromosome(just like "NM_000593"), the other reason is that they locate in duplicate region. The gene "NM_000344" locate has two position(chr5:69345349-69373418 and chr5:70220767-70248838), the sequences of the two region are the same. I'm confused that if we filter the duplicate during the process of alignment, there will no reads mapped in the region of duplicate gene. If so, is it reasonable to filter the duplicate gene and patch chromosome gene during analysis?

      Delete
  2. The names of the genes are the same because they are small non coding RNAs. So ENSEMBL ids are different means the loci of them are different, but the sequences are same and they are the same genes. Some extend like miRNAs, some miRNAs have replicates in the genome, and are only distinguished as mir-XXX-1, mir-XXX-2. snoRNAs are too many in the genome, like tRNA, so only family name kept as gene name, because they are same sequences, and same transcripts.

    ReplyDelete
  3. I hate this.

    We need to convert ID very often, and this is really confusing.

    ReplyDelete
  4. Anonymous10:12 AM

    Did you ever find out the reason for this? I am having the same trouble with the ensembl GTF having the same gene name, but different ENSG numbers and different positions, coming from different sources (e.g havana or ensembl_havana). THankyou

    ReplyDelete