Wednesday, December 09, 2015

SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:  http://www.ncbi.nlm.nih.gov/books/NBK44455/#Build.define_the_term__orientation_as_us
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema: http://www.ncbi.nlm.nih.gov/books/NBK44393/#Submit.does_dbsnp_use_topbot_nomenclatur_1
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post: https://www.biostars.org/p/16982/  -- Forward is from dbSNP, Plus is from 1000 Genome. See #6 below. 
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  6. Important note: The FinalReport.txt from Illumina genotype tool GenomeStudio can also export other columns like "Allele1 - Forward", "Allele2 - Forward", "Allele1 - Plus", "Allele2 - Plus". I called Illumina tech that the Forward is for dbSNP, and Plus is for 1000 Genome. Be cautious that dbSNP and +/- can also be change between different version of dbSNP or genome building. (Full description is here, login required). To get to know which version your allele is referred to, you have to get the manifest file from your base caller. 
References:
  1. https://triticeaetoolbox.org/wheat/snps.php
  2. http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
  3. http://www.sciencedirect.com/science/article/pii/S0168952512000704
  4. https://mathgen.stats.ox.ac.uk/impute/1000GP%20Phase%203%20haplotypes%206%20October%202014.html
  5. http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook

1 comment:

  1. Anonymous6:09 AM

    So for annotating with dbsnp database. Should we consider "Allele1 - Forward", "Allele2 - Forward" strands?

    ReplyDelete