Monday, October 8, 2012

Two ways that correlation and stepwise regression can give different results



In general, a correlation test is used to test the association between two variables (y and z). However, if there is a third variable (x) that might be related to z or y, it makes sense to use stepwise regression (or partial correlation). There are two quite different situations where the correlation and stepwise regression will produce different results. Here are some examples using made up data.


Case 1. z is not associated with y, but is correlated with x, which is associated with y:
set.seed(1)
x <- rnorm(100)
z <- x + rnorm(100)
y <- x + rnorm(100, sd = 0.1)

dat.1 <- data.frame(x = x, y = y, z = z)
cor.test(~y + z, data = dat.1)
## 
##  Pearson's product-moment correlation
## 
## data:  y and z 
## t = 9.058, df = 98, p-value = 1.332e-14
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  0.5518 0.7694 
## sample estimates:
##   cor 
## 0.675 
## 
anova(lm(y ~ x, data = dat.1), lm(y ~ x + z, data = dat.1))
## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ x + z
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1     98 1.06                         
## 2     97 1.06  1    0.0026 0.24   0.63
In this case, the correlation test showed an association between z and y, but that association was really just a by-product of by both variables' association with x. The stepwise regression revealed that z made no independent contribution to y after x was already included in the model.

The second, perhaps less obvious, case is when the relationship between z and y is masked by variance in y due to x. In other words, x and z are completely unrelated and both are related to y, but the variance in y due to x is very large. Here is an example of this situation:

Case 2. z and x are completely unrelated, both x and z cotribute to y, but the variance in x is much larger (10x) than the variance in z.
set.seed(1)
x <- rnorm(100, sd = 10)
z <- rnorm(100)
y <- x + z + rnorm(100, sd = 0.1)

dat.2 <- data.frame(x = x, z = z, y = y)
cor.test(~y + z, data = dat.2)
## 
##  Pearson's product-moment correlation
## 
## data:  y and z 
## t = 1.04, df = 98, p-value = 0.3009
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.09387  0.29484 
## sample estimates:
##    cor 
## 0.1045 
## 
anova(lm(y ~ x, data = dat.2), lm(y ~ x + z, data = dat.2))
## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ x + z
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)    
## 1     98 90.9                             
## 2     97  1.1  1      89.9 8254 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
In this situation, the simple (bivariate) correlation between y and z did not reach significance, but their relationship emerged in a stepwise regression once the variance due to x was accounted for (for this simple case, partial correlation would work also). The bottom line is that when you're considering the relationship between an outcome and a predictor, if you know that your outcome variable has an important relationship with some third variable, stepwise regression (or partial correlation) can (1) make sure an observed associaiton is not due to the third variable and (2) reveal an association that could be masked by the third variable.

1 comment:

  1. With the globalization rate being so fast, people have to adapt to the changing trends. This includes having to adopt new languages in order to fit in the multilingual atmospheres. it is great post. Educational Neuroscience

    ReplyDelete