Wednesday, November 04, 2020

Explaination of "samtools flagstat" output

==> CRR119891.sam.flagstat <==

192013910 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

38830640 + 0 supplementary

0 + 0 duplicates

191681231 + 0 mapped (99.83% : N/A)

153183270 + 0 paired in sequencing

76591635 + 0 read1

76591635 + 0 read2

118230182 + 0 properly paired (77.18% : N/A)

152555312 + 0 with itself and mate mapped

295279 + 0 singletons (0.19% : N/A)

10160332 + 0 with mate mapped to a different chr

5532624 + 0 with mate mapped to a different chr (mapQ>=5)

Here are the notes from our careful check:

1. What's secondary, supplementary, and singleton? I was using "bwa mem" with default parameters. In that case, it only reports one alignment if the read can be mapped to multiple locations with equal quality. See discussion here: https://www.biostars.org/p/304614/#304620. That's why you don't see the secondary alignment in above example. Supplementary alignments are those alignments that are part of a chimeric alignment. For example, in a circRNA case, one read is broke into two pieces, mapped to different locations in back-splicing order. Then one of the two fragment will be supplementary (the other is not). Singleton is a mapped read whose mate is unmapped. So its flag has 0x1 and 0x8 but 0x4 is unset. Note that a singleton's mate (which is unmapped) can have a flag of 0110.  

2. What's the total number of reads? The number in the output is NOT for reads, but rather for alignment in the SAM files. The samtools flagstat only check the FLAG, not the read ID. So, in the above example, the total number of reads (R1+R2) should be 192013910 - 38830640. (If there is secondary or duplicate, it should also be excluded). It's also indicated by "paired in sequencing", 153183270 (R1 +R2). This can also be calculated by checking the read ID column e.g. cat CRR119891.sam | cut -f1 | sort -u | wc -l, then times 2 for paired-end reads.

3. How many reads are unmapped? The total alignment is sum of "mapped" + "singletons" + unmapped. So, the number of unmapped reads should be calcualted as 192013910-191681231-295279 = 37400 in above case. The unmapped reads can also be counted by checking the RNAME column in the SAM file (e.g. cat CRR119891.sam | awk '$3=="*"' | wc -l). 

4. Mappability? The above mapped % is based on "mapped" / "total" alignment, not based on reads. If for reads, it should be calculated based on above two numbers. 

No comments:

Post a Comment