Showing posts with label BAM. Show all posts
Showing posts with label BAM. Show all posts

Wednesday, April 18, 2012

sorting BAM makes the file smaller

Yes, it's not just me to notice this change. See here: http://seqanswers.com/forums/archive/index.php/t-13652.html

As Heng said, "BAM is compressed. Sorting helps to give a better compression ratio because similar sequences are grouped together."... So it's not because of the removal of the unmapped reads (which are put at the end).

The tips is - always sort the output BAM after converting a SAM, e.g.

samtools view -Sbu in.sam | samtools sort - in.sorted
mv in.sorted.bam in.bam

Sorted BAM is smaller and better for searching. 

---------------------------
Other tips are:

1. SAM->BAM does not require a sorted header, nor a header. 
If there is header, samtools view -Sb in.sam > in.bam
if there is no header, samtools view -Sbt genome.fa.fai in.sam > in.bam
But the in.bam will not follow the order in the sequence dictionary (genome.fa.fai) unless you sort it by samtools sort. 

2. If input BAM, cufflinks require a proper header for the BAM file, esp. the line of 

@HD VN:1.0 SO:coordinate

Without the line, even if your BAM (or SAM) is sorted, but cufflinks cannot tell it by the file, only if you provide the info through the @HD line. So, I guess 

@HD VN:1.0 SO:unsorted 

won't work.