One Tip Per Day: Tophat

Showing posts with label Tophat. Show all posts

Thursday, May 29, 2014

samtools flagstat won't output the number of reads mapped

I have been using samtools flagstat to get statistics summary of my BAM/SAM file, like many tutorial suggested (e.g. Dave Tang's note).

But then I noticed that flagstat reports the number of mapping (or alignments/hits, whatever you like to call), but not the number of reads mapped. This can be very different if your alignment contains multiple mappers (and this is especially true if you use default setting or without setting "-g 1" in Tophat). To get the number of reads mapped (e.g. in order to get mappability), you can use the following command:

samtools view -cF 0x100 accepted_hits.bam

Note: Even with multiple mappers, only one hit has flag of 0x100 in the output SAM file. I've already explained this in one of my previous posts.

Tuesday, September 18, 2012

--report-secondary-alignments in Tophat is nothing with 0x100 SAM flag of SAM output

--report-secondary-alignments option in Tophat is off by default, which means only best or primary alignments will be reported. So, if a read mapped to multiple loci with perfect match (or equal best score), then they all should be reported by default (within the limit of -g/--max-multihits). For example, when using "-g 100" and the read mapped to 40 positions with perfect match, then these 40 alignments are output in the SAM file. (If "-g 40" and the read mapped to 100 positions perfectly, then only 40 hits will be randomly output).

However, if --report-secondary-alignments is on, Tophat will continue to find the secondary best alignments to get the 100 hits output, out of which 40 are called best/primary alignments.

In either case above, the SAM file only has one hit without 0x100 FLAG (i.e. all of the rest have 0x100 SAM flag, they all are called "secondary alignment" by SAM specification).

This is from discussion with Jessie. I will check to confirm.

Wednesday, August 22, 2012

How Tophat call junctions from deletions?

We noticed that in the output SAM file of Tophat, there are two kinds of CIGAR line indicating the same type of alignment - deletion:

xxxxMxxxxDxxxxM and xxxxMxxxxNxxxxM

Here is what I parsed from the source code of Tophat

-------------- long_spanning_reads.cpp --------------

/*
* FIXME, currently just differentiating between a deletion and a
* reference skip based on length. However, would probably be better
* to denote the difference explicitly, this would allow the user
* to supply their own (very large) deletions
*/
if ((lb->right - lb->left - 1) <= max_deletion_length)
{
new_cigar.push_back(CigarOp(DEL, lb->right - lb->left - 1));
antisense_closure = prev_hit->is_spliced() ? prev_hit->antisense_splice() : curr_hit->antisense_splice();
}
else
{
new_cigar.push_back(CigarOp(REF_SKIP, lb->right - lb->left - 1));
antisense_closure = lb->antisense;
}

As the author commented, this is still to fix.

From the code above, it seems that max_deletion_length is the one to control the DEL from REF_SKIP. But it does not change the output if I tune the value of --max-deletion-length in Tophat.

I also tried to confirm this by compiling the code myself, but have an error from making at the moment ("tophat_reports.cpp:105: error: ‘boost::random::mt19937’ has not been declared"). mt19937 is found in ~/include/boost/tr1/random.hpp, but even I include <boost/tr1/random.hpp> in tophat_reports.cpp, it still doesn't work. Don;t know why...

Friday, August 10, 2012

awk script to correct XS:A tag of Tophat output for strand-specific paired-end reads

It's been noticed that the current Tophat (v2.0.3) can assign wrong XS:A tag for strand-specific paired-end library, at least for some reads. Here is such an example:

HWI-ST560:74:D14ELACXX:4:1201:8827:168386 pr1 chr1 10024104 50 50M = 10020756 -3398 TGGTTCTTGAAACTGCTGGTTCAGCATCTGTGTACTAACATCAATCCCGG IJJJJJJJIIIGJJJJJJJJJJJJJJJJJJJJIJJJJHHHHHFFFFFCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:50 YT:Z:UU NH:i:1 XS:A:+
HWI-ST560:74:D14ELACXX:4:1201:8827:168386 pR2 chr1 10020756 50 36M1628N14M = 10024104 3398 GATGATTTGAAATATGAGACTTCTAAGGCATAATATTGTTTGCAGTGCAC CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJIIIIJJJJIJIJHJGEC AS:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 NM:i:0 XS:A:- NH:i:1

This is a dUTP protocol where "R2/r1" FLAG for the read pair indicate the reads are a transcript on the + strand, which means both reads should be assigned as XS:A:+. However /2 is assigned to XS:A:-.

I've written an awk script to solve the problem:

#!/bin/awk -f

BEGIN{

if(save_discrepancy_to_file!="") system("[ -e " save_discrepancy_to_file " ] && rm " save_discrepancy_to_file);

}

{

if($1 ~ /^@/) print;

else

{

for(i=1;i<=NF;i++) if($i!~/^XS/) printf("%s\t",$i); else XS0=$i;

XS1=XS0;

if($2~/^0x/ || $2~/^[0-9]+$/){ # FLAG in HEX or Decimal format

if(libtype=="fr-firststrand") XS1=((and($2, 0x10) && and($2, 0x40)) || (and($2,0x80) && !and($2,0x10)))?"XS:A:+":"XS:A:-";

if(libtype=="fr-secondstrand") XS1=((and($2, 0x10) && and($2, 0x80)) || (and($2,0x40) && !and($2,0x10)))?"XS:A:+":"XS:A:-";

}

else if($2~/^[:alpha:]/){ # FLAG in string

if(libtype=="fr-firststrand") XS1=($2~/r.*1/ || ($2~/2/ && $2!~/r/))?"XS:A:+":"XS:A:-";

if(libtype=="fr-secondstrand") XS1=($2~/r.*2/|| ($2~/1/ && $2!~/r/))?"XS:A:+":"XS:A:-";

}

print XS1;

if(save_discrepancy_to_file!="" && XS1!=XS0) print >> save_discrepancy_to_file;

}

Monday, July 30, 2012

How to tell which library type to use (fr-firststrand or fr-secondstrand)?

First of all, as a bioinformatian, you should ask the data producer (e.g. the one who prepared the RNAseq library) which protocol they used to generate the data.

Tophat manual page has listed the general strand-specific protocol:

Library Type	Examples	Description
fr-unstranded	`Standard Illumina`	Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.
fr-firststrand	`dUTP, NSR, NNSR`	Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced.
fr-secondstrand	`Ligation, Standard SOLiD`	Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced.

In case you don't know the library-type, you can still figure it out by yourself. Tophat FAQ page provided a solution for that (http://tophat.cbcb.umd.edu/faq.html#library_type). But more simply (comparing to running 1M reads first), you can choose few reads and BLAT to genome and infer the library-type from the mapping result.

Generally, reads from the left-most end of RNA fragment (always from 5´ to 3´) are always mapped to transcript-strand, and (for pair-end sequencing) reads from the right-most end are always mapped to the opposite strand. See the arrows direction in the below schema. This is because the sequencer always read from 5´ to 3´.

Summary of library type protocols (for Tophat/Bowtie)

But regarding to which strand the RNA fragment is synthesized from, this involves different strand-specific protocols. Thanks to the illustration figure (see below) from Zhao Zhang, we could see that for example dUTP method is to only sequence the strand from the first strand synthesis (the original RNA strand is degradated due to the dUTP incorporated), so the /2 read is from the original RNA strand.

Strand-specific library protocols (Credit: Zhao Zhang)

Taking a real example, first getting some reads (in fasta format) from the paired-end sequencing fastq file using command like:

$ zcat ~/nearline/rnaseq/BU/Jul2012/Sample_3576_H_01.R1.fastq.gz | sed 's/@//g;s/ /_/g' | awk '{if(NR%4==1)print ">"$0;if(NR%4==2) print $0;}' | head

$ zcat ~/nearline/rnaseq/BU/Jul2012/Sample_3576_H_01.R2.fastq.gz | sed 's/@//g;s/ /_/g' | awk '{if(NR%4==1)print ">"$0;if(NR%4==2) print $0;}' | head

Blatting them in UCSC Genome Browser

Below is screenshot for top hits of one pair of reads. They mapped to exons of OS9 genes (the left one is /1 and right one is /2, with opposite direction). We see that /1 mapped to transcript direction, /2 mapped to opposite direction, which means it can only be fr-secondstrand or fr-unstrand (cannot be fr-firststrand).

Continuing to look at other reads in the file, we can find examples like these:

where /2 mapped to transcript strand and /1 mapped to the opposite strand. Combining with the observation from above, we can conclude that this is a fr-unstrand library.

Pages