Showing posts with label regression. Show all posts
Showing posts with label regression. Show all posts

Sunday, August 16, 2015

Using ANOVA to get correlation between categorical and continuous variables

How to calculate the correlation between categorical variables and continuous variables?

This is the question I was facing when attempting to check the correlation of PEER inferred factors vs. known covariates (e.g. batch).

One solution I found is, I can use ANOVA to calculate the R-square between categorical input and continuous output.

Here is my R code snip:

## correlation of inferred factors vs. known factors
# name PEER factors
colnames(factors)=paste0("PEER_top_factor_",1:bestK)
# continuous known covariates:
covs2=subset(covs, select=c(RIN, PMI, Age));
# re-generate batch categorical variable from individual binary indicators (required by PEER)
covs2=cbind(covs2, batch=paste0("batch",apply(covs[,1:6],1,which.max)))
covs2=cbind(covs2, Sex=ifelse(covs$Sex,"M","F"), readLength=ifelse(covs$readsLength_75nt, "75nt", "50nt"))

library("plyr")
# ref: http://stackoverflow.com/a/11421267
xvars=covs2; yvars=as.data.frame(factors);
r2 <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    summary.lm(aov(y~x))$r.squared
  })
})
rownames(r2) <- colnames(xvars)
colnames(r2) <- colnames(yvars)

pvalue <- laply(xvars, function(x) {
  laply(yvars, function(y) {
    anova(lm(y~x))$`Pr(>F)`[1]
  })
})
rownames(pvalue) <- colnames(xvars)
colnames(pvalue) <- colnames(yvars)

require(pheatmap);
pheatmap(-log10(t(pvalue)),color= colorRampPalette(c("white", "blue"))(10), cluster_row =F, cluster_col=F, display_numbers=as.matrix(t(round(r2,2))), filename="peer.factor.correlation.pdf")

I highlighted the core part in yellow color. As it shows, we can use aov() function in R to run ANOVA. Its result can be summarized with summary.lm() function, which show output like:

> summary.lm(results)

Call:
aov(formula = weight ~ group)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.0320     0.1971  25.527   <2e-16
grouptrt1    -0.3710     0.2788  -1.331   0.1944
grouptrt2     0.4940     0.2788   1.772   0.0877

Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591

R^2 and p-value are shown at the end of output.

Note: the summary.lm() object doesn't contain value of p-value directly. But we can compute p-value in command like:

> F=summary.lm(results)$fstatistic
> F=as.numeric(F)
> pf(F[1],F[2],F[3])

Below table is a nice summary the methods applicable to corresponding data type.

PREDICTOR VARIABLE (S)
OUTCOME VARIABLE
CategoricalContinuous
CategoricalChi Square, Log linear, Logistict-test, ANOVA (Analysis of Varirance)Linear regression
ContinuousLogistic regressionLinear regression,  Pearson correlation
Mixture of Categorical and ContinuousLogistic regressionLinear regressionAnalysis of Covariance
Ref:http://www.tulane.edu/~panda2/Analysis2/sidebar/stats.htm

Next thing I need to refresh my mind is how different in calculating the correlation using cor() and the above ANOVA method above.

I know the correlation coefficient r can be inferred from the coefficient and sd of two variables. For example, we know sd(x) and sd(y), then when regressing y~x, we got regression line e.g. y=b0 + b1x. Then we can calculate r as

r = b1 * SDx / SDy

When x and y are in standard normal distribution, e.g. u=0, sd=1, then r=b1.
https://people.richland.edu/james/ictcm/2004/weight.html
http://ww2.coastal.edu/kingw/statistics/R-tutorials/oneway.html