Monday, November 22, 2021

EASE vs. Fisher's exact test

EASE is a modified version of Fisher's exact test. It's used in DAVID. We are often asked about their difference. Here is. 

From DAVID website: https://david.ncifcrf.gov/content.jsp?file=functional_annotation.html

A Hypothetical Example 

In the human genome background (30,000 genes total; Population Total (PT)), 40 genes are involved in the p53 signaling pathway (Population Hits (PH)). A given gene list has found that three genes (List Hits (LH)) out of 300 total genes in the list (List Total (LT)) belong to the p53 signaling pathway. Then we ask the question if 3/300 is more than random chance compared to the human background of 40/30000

A 2 x 2 contingency table is built based on the above numbers: 

List Hits (LH) = 3 
List Total (LT) = 300 
Population Hits (PH) = 40 
Population Total (PT) = 30,000


Exact p-value = 0.007. Since p-value < 0.05, this user's gene list is specifically associated (enriched) in the p53 signaling pathway by more than random chance. 

What about the EASE Score 
The EASE Score is more conservative by subtracting one gene from the List Hits (LH) as seen below. If LH = 1 (only one gene in the user's list annotated to the term), EASE Score is automatically set to 1.



For our hypothetical example involving the p53 signaling pathway, the EASE Score is more conservative with a p-value = 0.06 (using 3-1 instead of 3). Since the p-value > 0.05, this user's gene list is not considered specifically associated (enriched) in the p53 signaling pathway any more than by random chance.

From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC328459/:

The EASE score is offered as a conservative adjustment to the Fisher exact probability that weights significance in favor of themes supported by more genes. The theoretical basis of the EASE score lies in the concept of jackknifing a probability. The stability of any given statistic can be ascertained by a procedure called jackknifing, in which a single data point is removed and the statistic is recalculated many times to give a distribution of probabilities that is broad if the result is highly variable and tight if the result is robust [9]. The EASE score is calculated by penalizing (removing) one gene within the given category from the list and calculating the resulting Fisher exact probability for that category. It therefore represents the upper bound of the distribution of jackknife Fisher exact probabilities and has advantages in terms of penalizing the significance of categories supported by few genes. For example, assume a list of 206 genes is selected from a population of 13,679 genes. If there is only one gene in the population in some rare category, X, and that gene happens to appear on the list of 206 genes, the Fisher exact would consider category X significant (p = 0.0152). At the same time, the Fisher exact probability would deem a more common category, Y, with 787 members in the population and 20 members on the list, as slightly less significant (p = 0.0154). From the perspective of global biological themes, however, a theme based on the presence of a single gene is neither global nor stable and is rarely interesting. If the single gene happens to be a false positive, then the significance of the dependent theme is entirely false. However, the EASE score for these two situations is p = 1 for category X and p < 0.0274 for category Y, and thus the EASE score eliminates the significance of the 'unstable' category X while only slightly penalizing the significance of the more global theme Y. By extrapolating between these two extremes, the EASE score penalizes the significance of categories supported by fewer genes and thus favors more robust categories than the Fisher exact probability.

No comments:

Post a Comment