Thursday, August 08, 2013

How to select MACS peaks based on p-value, fold_enrichment and FDR?

In the output peaks.xls file from MACS(v1.4), there are 3 columns: -10log10(pvalue), fold_enrichment, and FDR(%). We can use these three columns to sort/filter peaks. Here are what they means:
The '−10*log10(Pvalue)' column lists the transformed P value of each peak, which makes peak sorting easier. For example, a P value of 1e−5 would be transformed to 50. The 'fold_enrichment' column shows the ratio of the ChIP-seq read count to the local value of lambda within each peak. The 'FDR(%)' column contains the empirical FDR percentage for each peak. For example, the fourth peak in the list has an 'FDR(%)' value of '6.45' and '−10*log10(Pvalue)' value of '83.67'; using the same P value cutoff of 4.4e−09 = 1083.67/−10, the ratio of the number of peaks identified by MACS after and before exchanging control and ChIP-seq samples is 6.45:100. The FDR column is only available when the control sample is available.
As the MACS paper (Zhang et al., Genome Biology, 2008) clearly stated, MACS uses Poisson distribution to measure the distribution of ChIP-seq tag. Poisson distribution is characterized as having only one parameter, λ, for both mean and variance. The origin of using Poisson distribution to measure ChIPseq tag can be referred to a Nature paper (see its supplementary note):
The number of sequence reads required to map a chromatin feature can be estimated from a simple model.
Suppose that the genome is divided into N non-overlapping bins of fixed size, that a fraction of these bins contain a particular chromatin feature and that one performs ChIP-Seq with an antibody that enriches the sequence in these bins by a factor of e. If one collects a total of R sequence reads, the number of reads in a bin should approximately follow a Poisson distribution with mean eM for bins containing the feature and M for the other bins, where M = R/N(ef+(1-f))
Xianjun: For those who have difficulty to understand the formula, I would suggest to just understand it like this: if there is a single Poisson distribution, the λ should be R/N; when there are two parts/distribution, we just divide N into Nf and N(1-f) and give different weight to them. For the unenriched part, you want to weight the enriched bins with weight e (because it's more enriched). That is R/(Nfe+N(1-f))=M; For the enriched part, you want to weight unenriched bins with 1/e, then the λ is R/(Nf+N(1-f)/e)=eM. (Thanks to Shikui and Sowmya for helping understanding Poisson distribution and the formula!)

MACS uses a dynamic λ to compensate the local fluctuations and biases even in control sample: 


λlocal = max(λBG, [λ1k,] λ5k, λ10k)
where λBG is a uniform estimation for the whole genome, λ1k, λ5k and λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak location in the control sample, or the ChIP-Seq sample when a control sample is not available (in which case λ1k is not used). λlocal captures the influence of local biases, and is robust against occasional low tag counts at small local regions. Xianjun: That's why you will get different p-value and fold_enrichment for the same peak when using different control samples.  MACS uses λlocal to calculate the p-value of each candidate peak (Xianjun: Once you have λ, you have the Poisson distribution; once you have a distribution and the observed value, which is the observed reads count in the peak region for this case, then you can get p-value based on the distribution. See wikipedia for p-value computation) and removes potential false positives due to local biases (that is, peaks significantly under λBG, but not under λlocal). Candidate peaks with p-values below a user-defined threshold p-value (default 10-5) are called, and the ratio between the ChIP-Seq tag count and λlocal is reported as the fold_enrichment

The question is: how to select MACS peaks based on the measurements?

To illustrate the relationship of the three measurements, I plot them as below. Note: the figures below are generated by Excel. Looking not so nice, but only for illustration purpose. 


X-axis: -10log10(p-value); Y-axis: FDR (%)

X-axis: fold_enrichment;  Y-axis: FDR (%)

I don't know why FDR has a functional relationship with the p-value (Neither Tao).

But we can choose a smaller FDR cutoff (e.g. 5%), since it's underestimated due to a small control size in our case. That would correspond to a p-value of 10^-50, significant enough.

Here is the fold_enrichment histogram if we choose FDR cutoff as 10%.

Most peaks remained have a >5 fold enrichment, should be fine as well. 

No comments:

Post a Comment