Tuesday, May 03, 2022

Why don't you see all GO terms in GSEA-MSigDB?

 This is why...

This procedure has been modified from that described previously for MSigDB v5.2. First, for each GO term we got the corresponding human genes from the gene2go file. Next, we have applied the path rule. Gene products are associated with the most specific GO terms possible. All parent terms up to the root automatically apply to the gene product. Thus, the parent GO term gene sets should include all genes associated with the children GO terms. Then we removed sets with fewer than 5 or more than 2,000 Gene IDs. Finally, we resolved redundancies as follows. We computed Jaccard coefficients for each pair of sets, and marked a pair as highly similar if its Jaccard coefficient was greater than 0.85. We then clustered highly similar sets into "chunks" using the hclust function from the R stats package according to their GO terms and applied two rounds of filtering for every "chunk". First, we kept the largest set in the "chunk" and discarded the smaller sets. This left "chunks" of highly similar sets of identical sizes, which we further pruned by preferentially keeping the more general set (i.e., the set closest to the root of the GO ontology tree).

Ref: https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/MSigDB_v5.2_Release_Notes


No comments:

Post a Comment