Tuesday, April 03, 2012

understand the FLAG code of SAM format


Here is what SAM specification stated for the FLAG column:
FLAG: bitwise FLAG. Each bit is explained in the following table:
Bit Description
0x1 template having multiple segments in sequencing (1)
0x2 each segment properly aligned according to the aligner (2)
0x4 segment unmapped (4)
0x8 next segment in the template unmapped (8)
0x10 SEQ being reverse complemented (16)
0x20 SEQ of the next segment in the template being reversed (32)
0x40 the rst segment in the template (64)
0x80 the last segment in the template (128)
0x100 secondary alignment (256)
0x200 not passing quality controls (512)
0x400 PCR or optical duplicate (1024)
These codes are hexadecimal (also base 16, or hex) code, e.g. 0x10 = 16 in base ten format (decimal). It's easy to convert hex to binary format, simply convert each bit in hex format to 4 bits, e.g. 1=0001, 0=0000, so, 0x10=00010000 in binary way.

In SAM format, the 2nd column (i.e. FLAG above) is one of the mandatory fields in the alignment section. It can be used to indicate the alignment, for example, 0x4 means unmapped reads, 0x10 means the reads mapped to the minus strand.  In the SAM format, it uses decimal format, so imagine 64+16+2+1=83 means it's first read (0x40) of pair-end reads (0x1) and it's mapped on minus strand (0x10) and both reads mapped (0x2).  Here is a very nice post from Kamalakar giving the full explanation of the code:

http://seqanswers.com/forums/showthread.php?t=17314

For paired reads, 0'th bit HAS to be set. Hence all flags for paired reads HAVE to be odd. In other words, all even-numbered flags other than the above three (0, 4 and 16) are meaningless (Note: this may not totally true; for multiple hits, 256 is used for alignments beyond the first one. See below). (kgulukota)

The four good numbers to remember are 64+16+2+1=83, 64+32+2+1=99, 128+16+2+1=147 and 128+32+2+1=163. Something is very wrong if you ever see both 128 and 64 together, and with most current technologies, you should see 16 or 32 (Note: as kgulukota stated above, you should not see 32 alone in the SAM FLAG column), but not both. If you see both, or don't see either, your reads are paired strangely.  (swbarnes2)

Each reported read or pair alignment beyond the first has the SAM 'secondary' bit (which equals 256) set in its FLAG field. So, for multiple mapping reads, SAM alignments also contain their strand information.

btw, For strand-specific RNAseq, the Tophat output SAM (converted from BAM) does not contain XS:A:+/- tag (which is required by cufflinks) for non-spliced reads. In order to get the proper strand info for the assembly, you need to manually add the tags. Here is an example code:



samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if($2==16 || $2==272) print $0"\tXS:A:-"; if($2==0 || $2==256) print $0"\tXS:A:+";}}' accepted_hits.sam

FLAG=256 and 272 is the corresponding version of 0 and 16 for multiple mapped hits.

UPDATE: This only works for single-end lib. For pair-end lib, the FLAG should be odd number, but in any case, reads on minus strand always have 1 on the 5th bit of binary code (e.g. 0x10 =10000). Thanks to Wei's suggestion on this. Here is the updated code:

samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x10)==16) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam


UPDATE2: (from the  final version of RNAseq lecture slides)ØFor non-strand-specific lib, you’re actually sequencing cDNA from both strands. So, the ‘strand’ info in the alignment is senseless (except for the spliced-reads).
ØFor strand-specific lib, if FLAG contains 0x10 (e.g. 0x53=0x40+0x10+0x2+0x1), reads map to ‘minus’ strand, otherwise, ‘plus’ strand.
ØCufflinks requires XS:A:+/- tag (which Tophat doesnt have for some reads). You need to manually add it by command, e.g. for dUTP library (where /2 is from transcript strand, /1 from the opposite strand):
samtools view -h accepted_hits.bam | awk '{if($0 ~ /XS:A:/ || $1 ~ /^@/) print $0; else {if(and($2,0x40) || and($2,0x90)) print $0"\tXS:A:-"; else print $0"\tXS:A:+";}}' accepted_hits.sam


 See my post here:

http://seqanswers.com/forums/showthread.php?p=69643#post69643

If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron. (Thomas Doktor)


# So, the strand info in the SAM alignment for non-strand-specific RNAseq lib cannot be used as evidence of transcript strand.

No comments:

Post a Comment