Showing posts with label Illumina. Show all posts
Showing posts with label Illumina. Show all posts

Wednesday, December 09, 2015

SNP allele coding schema

There are three types of different SNP allele coding schemes (so far). No idea why the scientists have to make things so confusing sometimes...

Here you go:
  • TOP/BOT from Illumina: Simply saying, for unambiguous cases (incl. A/G, A/C, T/G, T/C), A is always allele A (or allele 1) and on TOP strand, and T is also allele A (or allele 1) but on BOT strand. The other variation is allele B (or allele 2). For ambiguous cases (e.g. A/T and G/C), strand is easy to assign, but allele A and B is ambiguous. So, Illumina use a 'sequence walking' to assign allele A/B by the first unambiguous pair of the flanking sequence. Detail explanation in their tech doc
  • Forward/Reverse for dbSNP: dbSNP use the orientation determined by mapping SNP flanking sequence (of the longest submission) to a contig sequence using BLAST. A refSNP can have multiple submissions, each of which might have their own orientation depending on whether the flanking sequence is from plus strand or minus strand. The submission with the longest flanking will be taken as instantiate sequence for the refSNP during BLAST analysis for the current build. So, the strand of a dbSNP refSNP is determined by (1) the sequence of the longest submission (or the exemplar), and (2) the strand of contig it maps to. A good example can be found here. See more details here:  http://www.ncbi.nlm.nih.gov/books/NBK44455/#Build.define_the_term__orientation_as_us
  • Plus/Minus strand for Genome Annotation, HapMap, 1000 Genome Project: The plus (+) strand is defined as the strand with its 5′ end at the tip of the short arm (see Ref #3 below). SNP alleles reported on the same strand as the (+) strand are called ‘plus’ alleles and those on the (−) strand are called ‘minus’ alleles. This also applies to the imputed results from 1000 Genome, e.g. a0 is for REF (reference) allele and a1 for ALT (alternate) allele
Several notes:
  1. dbSNP does not use TOP/BOT schema: http://www.ncbi.nlm.nih.gov/books/NBK44393/#Submit.does_dbsnp_use_topbot_nomenclatur_1
  2. Minor/Major alleles are just a relative term. They can change between different populations/dataset. So, it really makes no sense to use/think that allele2 is minor allele, G in A>G is minor allele, or so. Minor allele can be reference allele. 
  3. In the Illumina genotyping final report (e.g. "FinalReport.txt"), use "Allele1 - Forward" and "Allele2 - Forward" to construct .lgen file for plink. See post: https://www.biostars.org/p/16982/  -- Forward is from dbSNP, Plus is from 1000 Genome. See #6 below. 
  4. SNPs from dbSNP reverse strand need to flip before merging with imputed SNPs, using flip_strand option in plink.
  5. Allele 1 (in allele 1/2 setting) need to be forced to reference allele before merging with imputed SNPs, using --reference-allele option in plink.
  6. Important note: The FinalReport.txt from Illumina genotype tool GenomeStudio can also export other columns like "Allele1 - Forward", "Allele2 - Forward", "Allele1 - Plus", "Allele2 - Plus". I called Illumina tech that the Forward is for dbSNP, and Plus is for 1000 Genome. Be cautious that dbSNP and +/- can also be change between different version of dbSNP or genome building. (Full description is here, login required). To get to know which version your allele is referred to, you have to get the manifest file from your base caller. 
References:
  1. https://triticeaetoolbox.org/wheat/snps.php
  2. http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
  3. http://www.sciencedirect.com/science/article/pii/S0168952512000704
  4. https://mathgen.stats.ox.ac.uk/impute/1000GP%20Phase%203%20haplotypes%206%20October%202014.html
  5. http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook

Thursday, May 07, 2015

A clarity on the Illumina TruSeq Small RNA prep kit manual

In the TruSeq® Small RNALibrary Prep Guide, below the Figure 1, there is a sentence: "The RNA 3' adapter is modified to target microRNAs and other small RNAs that have a 3' hydroxyl group resulting from enzymatic cleavage by Dicer or other RNA processing enzymes." It's right, but could be very misleading if you are not clear of the diverse picture of transcriptome (scroll down for more detail). I want to emphasize that the 3' hydroxyl group (and the 5'-phosphate group) is NOT specific to microRNAs or any small RNAs. And it doesn't necessarily result from enzymatic cleavage by Dicer. Sonic fragmentation can also break the full length mRNA (with 5'-cap and 3'-polyA) into truncated RNA pieces with 5'-phosphate and 3' hydroxyl free ends. I just called Illumina to confirm that the 3' and 5' ligation steps don't guarantee the selection of miRNAs (but rather any RNAs with 5'-phosphate and 3' hydroxyl ends, if more accurately). The last step of gel purification is the key to select (or enrich, if more accurately) miRNAs.


OK. Here is what I learned from my colleagues about the different RNA species in the trancriptome:

There are 4 species in the transcriptome, where the later 3 are intermediates of transcription (or half product of degradation).
  • me7Gppp-------------------------3' (1) 
  •                 p------------- 3' (2) 
  •                 OH-------------3' (3) 
  •        ppp---------------------3' (4) 
Only group(2) will ligate to 5’adaptor. The 3' end can also have different format, at least two:
  • 5' ----------------- AAAAAA (1) 
  • 5' ---------- OH (2) 
Also note there are two enzymes used to repair the 5' ends: CIP and TAP. CIP (Calf Intestinal alkaline phosphatase) can remove the 5’ phosphate group of DNA strand. The TAP (Tobacco acid pyrophosphatase) is to remove the 5' cap structure (or 5'-5' triphosphate linkage) and leave a mono-phosphate at the 5' end. So, applying first CIP and then TAP will convert the above (2) and (4) to (3), then convert (1) to (2). That's one way to do capture the 5' cap structure, same purpose as CAGE (but CAGE use the type IIs restriction enzyme MmeI and type III restriction enzyme EcoP15I)

Tuesday, December 03, 2013

Illumina HiSeq2000 adaptor and sequencing


I've referred the following material when making the figure:

Just one learning note:

If insert is not long enough (i.e. shorter than the read length), R1 will have contamination from Rd2 SP (e.g. AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC) and R2 will have contamination from reverse complementary of Rd1 SP (e.g. AGATCGGAAGAGCG)

So, basically you just need to check if the shared complementary part of Rd1 and Rd2 SP (which is AGATCGGAAGAGC) occurs in the reads. If yes, simply trim it and its following part (if any).

Note: if you don't understand the "shared complementary part", please refer my previous blog on Illumina adaptor. Here is the link: http://onetipperday.blogspot.com/2013/06/illumina-hiseq2000-adaptor.html

Here is one solution of howto remove adaptor contamination:
1. save the complementary part into a fasta file, e.g. adaptor.fa
>adaptor_complementary_part
AGATCGGAAGAGC
2. run fastq-mcf to remove adaptor
fastq-mcf -o filted -x 10 -l 15 -w 4 -u adaptor.fa input.fq.gz

Monday, September 12, 2011

about high-throughput sequencing

multiplexing: This term refers to the ability of sequencing multiple samples at the same time. For example, for small genome like yeast, C.elegant and drosophila, the number of reads generated in a sequencing unit (e.g. one of the 8 lanes in Illumina Genome Analyzer) may be several times of reads needed to provide a sufficient coverage of the genome.
barcode/index: To distinguish the different samples in same run/lane, a short sequence, call barcode/index, is added to the front of adaptor ligated together to different samples during sample preparation. "This is generally added to the 3' end of the upstream adapter so that this is the first 4-6 bases read during the sequencing run, then you sort the data by these first bases into your groups. "(http://www.umassmed.edu/DeepSeq_FAQ.aspx).

Here is the barcoding system for SOLiD:

For Illumina, see the slides in presentation:


https://www.illumina.com/documents/seminars/presentations/2010-06_sq_21_morris_multiplex_target_enrichment.pdf