Monday, July 30, 2012

How to tell which library type to use (fr-firststrand or fr-secondstrand)?

First of all, as a bioinformatian, you should ask the data producer (e.g. the one who prepared the RNAseq library) which protocol they used to generate the data.

Tophat manual page has listed the general strand-specific protocol:

Library TypeExamplesDescription
fr-unstrandedStandard IlluminaReads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.
fr-firststranddUTP, NSR, NNSRSame as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced.
fr-secondstrandLigation, Standard SOLiDSame as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced.

In case you don't know the library-type, you can still figure it out by yourself. Tophat FAQ page provided a solution for that (http://tophat.cbcb.umd.edu/faq.html#library_type). But more simply (comparing to running 1M reads first), you can choose few reads and BLAT to genome and infer the library-type from the mapping result.

Generally, reads from the left-most end of RNA fragment (always from 5´ to 3´) are always mapped to transcript-strand, and (for pair-end sequencing) reads from the right-most end are always mapped to the opposite strand. See the arrows direction in the below schema. This is because the sequencer always read from 5´ to 3´.
Summary of library type protocols (for Tophat/Bowtie)

But regarding to which strand the RNA fragment is synthesized from, this involves different strand-specific protocols. Thanks to the illustration figure (see below) from Zhao Zhang, we could see that for example dUTP method is to only sequence the strand from the first strand synthesis (the original RNA strand is  degradated due to the dUTP incorporated), so the /2 read is from the original RNA strand.
Strand-specific library protocols (Credit: Zhao Zhang)
Taking a real example, first getting some reads (in fasta format) from the paired-end sequencing fastq file using command like:

$ zcat ~/nearline/rnaseq/BU/Jul2012/Sample_3576_H_01.R1.fastq.gz | sed 's/@//g;s/ /_/g' | awk '{if(NR%4==1)print ">"$0;if(NR%4==2) print $0;}' | head

$ zcat ~/nearline/rnaseq/BU/Jul2012/Sample_3576_H_01.R2.fastq.gz | sed 's/@//g;s/ /_/g' | awk '{if(NR%4==1)print ">"$0;if(NR%4==2) print $0;}' | head

Blatting them in UCSC Genome Browser

Below is screenshot for top hits of one pair of reads. They mapped to exons of OS9 genes (the left one is /1 and right one is /2, with opposite direction). We see that /1 mapped to transcript direction, /2 mapped to opposite direction, which means it can only be fr-secondstrand or fr-unstrand (cannot be fr-firststrand).


Continuing to look at other reads in the file, we can find examples like these:

where /2 mapped to transcript strand and /1 mapped to the opposite strand. Combining with the observation from above, we can conclude that this is a fr-unstrand library.

5 comments:

  1. Anonymous8:33 AM

    Actually, the dUTP and also the ligation method both sequence the from the left-most site. The nice picture above is right but the text is wrong.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi there:

    Thanks for the great post. Could you share the title of Zhao Zhang's paper please? I am very interedted in reading it.

    Thanks in advance.

    ReplyDelete
    Replies
    1. Hi German. Thanks for your interest on my blog. Please contact him directly here: https://emb.carnegiescience.edu/science/faculty/zhao-zhang

      Delete
  4. Hi Xianjun,
    By definition on wiki (https://en.wikipedia.org/wiki/Coding_strand), the transcribed strand refers to non-coding strand. It seems "transcript strand" in the blog is a little bit confusing. Perhaps you can use Coding/nonCoding or Sense/nonSense.

    Best,
    Tao

    ReplyDelete