Tuesday, December 29, 2020

compression level in samtools and gzip

samtools fastq can convert bam to fastq format, e.g. samtools fastq input.bam -o output.fastq

The output file will be automatically compressed if the file names have a .gz or .bgzf extension, e.g. 

samtools fastq input.bam -o output1.fastq.gz

Alternatively, you can also pipe the stdout to compressor explicitly, e.g.

samtools fastq input.bam | gzip > output2.fastq.gz

Interestingly, I noticed that output2.fastq.gz is significantly smaller than output1.fastq.gz, even though the uncompressed file content is the same. 

Actually, this is because of the different default compression ratios used in samtools and gzip. 

In samtools fastq, its default compression level is 1 (out of [0..9]) while gzip's default compression level is 6 (out of [1..9]). 

Usage:

$ samtools fastq

Usage: samtools fastq [options...] <in.bam>


Description:

Converts a SAM, BAM or CRAM into either FASTQ or FASTA format depending on the command invoked.


Options:

  -0 FILE              write reads designated READ_OTHER to FILE

  -1 FILE              write reads designated READ1 to FILE

  -2 FILE              write reads designated READ2 to FILE

  -o FILE              write reads designated READ1 or READ2 to FILE

                       note: if a singleton file is specified with -s, only

                       paired reads will be written to the -1 and -2 files.

  -f INT               only include reads with all  of the FLAGs in INT present [0]

  -F INT               only include reads with none of the FLAGS in INT present [0x900]

  -G INT               only EXCLUDE reads with all  of the FLAGs in INT present [0]

  -n                   don't append /1 and /2 to the read name

  -N                   always append /1 and /2 to the read name

  -O                   output quality in the OQ tag if present

  -s FILE              write singleton reads designated READ1 or READ2 to FILE

  -t                   copy RG, BC and QT tags to the FASTQ header line

  -T TAGLIST           copy arbitrary tags to the FASTQ header line

  -v INT               default quality score if not given in file [1]

  -i                   add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)

  -c                   compression level [0..9] to use when creating gz or bgzf fastq files [1]

  --i1 FILE            write first index reads to FILE

  --i2 FILE            write second index reads to FILE

  --barcode-tag TAG    Barcode tag [default: BC]

  --quality-tag TAG    Quality tag [default: QT]

  --index-format STR   How to parse barcode and quality tags


      --input-fmt-option OPT[=VAL]

               Specify a single input file format option in the form

               of OPTION or OPTION=VALUE

      --reference FILE

               Reference sequence FASTA FILE [null]

  -@, --threads INT

               Number of additional threads to use [0]

      --verbosity INT

               Set level of verbosity


The files will be automatically compressed if the file names have a .gz or .bgzf extension.


GZIP(1)                                                                GZIP(1)


NAME

       gzip, gunzip, zcat - compress or expand files


SYNOPSIS

       gzip [ -acdfhlLnNrtvV19 ] [-S suffix] [--rsyncable] [ name ...  ]

       gunzip [ -acfhlLnNrtvV ] [-S suffix] [ name ...  ]

       zcat [ -fhLV ] [ name ...  ]

OPTIONS

       -# --fast --best

              Regulate  the speed of compression using the specified digit #, where -1 or --fast indicates the fastest compression method (less compression) and -9 or --best indicates the slowest compression method (best compression).  The default compression level is -6 (that is, biased towards high compression at expense of speed).

No comments:

Post a Comment