This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).
One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.
Here is my R code snip:
## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))
library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
laply(yvars, function(y) {
summary.lm(aov(y~x))$r.squared
})
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)
pvalue <- laply(xvars, function(x) {
laply(yvars, function(y) {
anova(lm(y~x))$`Pr(>F)`[1]
})
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)
require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")
I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:
> summary.lm(results)
Call:
aov(formula = weight ~ group)
Residuals:
Min 1Q Median 3Q Max
-1.0710 -0.4180 -0.0060 0.2627 1.3690
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0320 0.1971 25.527 <2e-16
grouptrt1 -0.3710 0.2788 -1.331 0.1944
grouptrt2 0.4940 0.2788 1.772 0.0877
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641, Adjusted R-squared: 0.2096
F-statistic: 4.846 on 2 and 27 DF, p-value: 0.01591
Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:
> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])
Below table is a nice summary the methods applicable to corresponding data type.
PREDICTOR VARIABLE (S) |
OUTCOME VARIABLE
| |
Categorical | Continuous | |
Categorical | Chi Square, Log linear, Logistic | t-test, ANOVA (Analysis of Varirance), Linear regression |
Continuous | Logistic regression | Linear regression, Pearson correlation |
Mixture of Categorical and Continuous | Logistic regression | Linear regression, Analysis of Covariance |
Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.
I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as
r = b1 * SDx / SDy
When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html
Seemed to be very useful. But to be honest, it's a little bit difficult for me to understand.
ReplyDelete-Caroline
http://www.creativebiomart.net/