This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

**## correlation of inferred factors vs. known factors**

**# name PEER factors**

**colnames(factors)=paste0("PEER_top_factor_",1:bestK)**

**# continuous known covariates:**

**covs2=subset(covs, select=c(RIN, PMI, Age));**

**# re-generate batch categorical variable from individual binary indicators (required by PEER)**

**covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))**

**covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))**

**library("plyr")**

**# ref: http://stackoverflow.com/a/11421267**

**xvars=covs2; yvars=as.data.frame(factors);**

**r2 <- laply(xvars, function(x) {**

**laply(yvars, function(y) {**

**summary.lm(aov(y~x))$r.squared**

**})**

**})**

**rownames(r2) <- colnames(xvars)**

**colnames(r2) <- colnames(yvars)**

**pvalue <- laply(xvars, function(x) {**

**laply(yvars, function(y) {**

**anova(lm(y~x))$`Pr(>F)`[1]**

**})**

**})**

**rownames(pvalue) <- colnames(xvars)**

**colnames(pvalue) <- colnames(yvars)**

**require(pheatmap);**

**pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")**

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

**> summary.lm(results)
Call:
aov(formula = weight ~ group)
Residuals:
Min 1Q Median 3Q Max
-1.0710 -0.4180 -0.0060 0.2627 1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.1971 25.527 <2e-16
grouptrt1 -0.3710 0.2788 -1.331 0.1944
grouptrt2 0.4940 0.2788 1.772 0.0877
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096
F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591**

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

**> F=summary.lm(results)$fstatistic**

**> F=as.numeric(F)**

**> pf(F[1],F[2],F[3])**

Below table is a nice summary the methods applicable to corresponding data type.

PREDICTOR VARIABLE (S) | OUTCOME VARIABLE | |

Categorical | Continuous | |

Categorical | Chi Square, Log linear, Logistic | t-test, ANOVA (Analysis of Varirance), Linear regression |

Continuous | Logistic regression | Linear regression, Pearson correlation |

Mixture of Categorical and Continuous | Logistic regression | Linear regression, Analysis of Covariance |

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.

https://people.richland.edu/james/ictcm/2004/weight.html

http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html