Regression and correlation

Problem

You want to perform linear regressions and/or correlations.

Solution

Some sample data to work with:

  1. # Make some data
  2. # X increases (noisily)
  3. # Z increases slowly
  4. # Y is constructed so it is inversely related to xvar and positively related to xvar*zvar
  5. set.seed(955)
  6. xvar <- 1:20 + rnorm(20,sd=3)
  7. zvar <- 1:20/4 + rnorm(20,sd=2)
  8. yvar <- -2*xvar + xvar*zvar/5 + 3 + rnorm(20,sd=4)
  9. # Make a data frame with the variables
  10. dat <- data.frame(x=xvar, y=yvar, z=zvar)
  11. # Show first few rows
  12. head(dat)
  13. #> x y z
  14. #> 1 -4.252354 4.5857688 1.89877152
  15. #> 2 1.702318 -4.9027824 -0.82937359
  16. #> 3 4.323054 -4.3076433 -1.31283495
  17. #> 4 1.780628 0.2050367 -0.28479448
  18. #> 5 11.537348 -29.7670502 -1.27303976
  19. #> 6 6.672130 -10.1458220 -0.09459239

Correlation

  1. # Correlation coefficient
  2. cor(dat$x, dat$y)
  3. #> [1] -0.7695378

Correlation matrices (for multiple variables)

It is also possible to run correlations between many pairs of variables, using a matrix or data frame.

  1. # A correlation matrix of the variables
  2. cor(dat)
  3. #> x y z
  4. #> x 1.0000000 -0.769537849 0.491698938
  5. #> y -0.7695378 1.000000000 0.004172295
  6. #> z 0.4916989 0.004172295 1.000000000
  7. # Print with only two decimal places
  8. round(cor(dat), 2)
  9. #> x y z
  10. #> x 1.00 -0.77 0.49
  11. #> y -0.77 1.00 0.00
  12. #> z 0.49 0.00 1.00

To visualize a correlation matrix, see ../../Graphs/Correlation matrix.

Linear regression

Linear regressions, where dat$x is the predictor, and dat$y is the outcome. This can be done using two columns from a data frame, or with numeric vectors directly.

  1. # These two commands will have the same outcome:
  2. fit <- lm(y ~ x, data=dat) # Using the columns x and y from the data frame
  3. fit <- lm(dat$y ~ dat$x) # Using the vectors dat$x and dat$y
  4. fit
  5. #>
  6. #> Call:
  7. #> lm(formula = dat$y ~ dat$x)
  8. #>
  9. #> Coefficients:
  10. #> (Intercept) dat$x
  11. #> -0.2278 -1.1829
  12. # This means that the predicted y = -0.2278 - 1.1829*x
  13. # Get more detailed information:
  14. summary(fit)
  15. #>
  16. #> Call:
  17. #> lm(formula = dat$y ~ dat$x)
  18. #>
  19. #> Residuals:
  20. #> Min 1Q Median 3Q Max
  21. #> -15.8922 -2.5114 0.2866 4.4646 9.3285
  22. #>
  23. #> Coefficients:
  24. #> Estimate Std. Error t value Pr(>|t|)
  25. #> (Intercept) -0.2278 2.6323 -0.087 0.932
  26. #> dat$x -1.1829 0.2314 -5.113 7.28e-05 ***
  27. #> ---
  28. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  29. #>
  30. #> Residual standard error: 6.506 on 18 degrees of freedom
  31. #> Multiple R-squared: 0.5922, Adjusted R-squared: 0.5695
  32. #> F-statistic: 26.14 on 1 and 18 DF, p-value: 7.282e-05

To visualize the data with regression lines, see ../../Graphs/Scatterplots (ggplot2) and ../../Graphs/Scatterplot.

Linear regression with multiple predictors

Linear regression with y as the outcome, and x and z as predictors.

Note that the formula specified below does not test for interactions between x and z.

  1. # These have the same result
  2. fit2 <- lm(y ~ x + z, data=dat) # Using the columns x, y, and z from the data frame
  3. fit2 <- lm(dat$y ~ dat$x + dat$z) # Using the vectors x, y, z
  4. fit2
  5. #>
  6. #> Call:
  7. #> lm(formula = dat$y ~ dat$x + dat$z)
  8. #>
  9. #> Coefficients:
  10. #> (Intercept) dat$x dat$z
  11. #> -1.382 -1.564 1.858
  12. summary(fit2)
  13. #>
  14. #> Call:
  15. #> lm(formula = dat$y ~ dat$x + dat$z)
  16. #>
  17. #> Residuals:
  18. #> Min 1Q Median 3Q Max
  19. #> -7.974 -3.187 -1.205 3.847 7.524
  20. #>
  21. #> Coefficients:
  22. #> Estimate Std. Error t value Pr(>|t|)
  23. #> (Intercept) -1.3816 1.9878 -0.695 0.49644
  24. #> dat$x -1.5642 0.1984 -7.883 4.46e-07 ***
  25. #> dat$z 1.8578 0.4753 3.908 0.00113 **
  26. #> ---
  27. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  28. #>
  29. #> Residual standard error: 4.859 on 17 degrees of freedom
  30. #> Multiple R-squared: 0.7852, Adjusted R-squared: 0.7599
  31. #> F-statistic: 31.07 on 2 and 17 DF, p-value: 2.1e-06

Interactions

The topic of how to properly do multiple regression and test for interactions can be quite complex and is not covered here. Here we just fit a model with x, z, and the interaction between the two.

To model interactions between x and z, a x:z term must be added. Alternatively, the formula x*z expands to x+z+x:z.

  1. # These are equivalent; the x*z expands to x + z + x:z
  2. fit3 <- lm(y ~ x * z, data=dat)
  3. fit3 <- lm(y ~ x + z + x:z, data=dat)
  4. fit3
  5. #>
  6. #> Call:
  7. #> lm(formula = y ~ x + z + x:z, data = dat)
  8. #>
  9. #> Coefficients:
  10. #> (Intercept) x z x:z
  11. #> 2.2820 -2.1311 -0.1068 0.2081
  12. summary(fit3)
  13. #>
  14. #> Call:
  15. #> lm(formula = y ~ x + z + x:z, data = dat)
  16. #>
  17. #> Residuals:
  18. #> Min 1Q Median 3Q Max
  19. #> -5.3045 -3.5998 0.3926 2.1376 8.3957
  20. #>
  21. #> Coefficients:
  22. #> Estimate Std. Error t value Pr(>|t|)
  23. #> (Intercept) 2.28204 2.20064 1.037 0.3152
  24. #> x -2.13110 0.27406 -7.776 8e-07 ***
  25. #> z -0.10682 0.84820 -0.126 0.9013
  26. #> x:z 0.20814 0.07874 2.643 0.0177 *
  27. #> ---
  28. #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  29. #>
  30. #> Residual standard error: 4.178 on 16 degrees of freedom
  31. #> Multiple R-squared: 0.8505, Adjusted R-squared: 0.8225
  32. #> F-statistic: 30.34 on 3 and 16 DF, p-value: 7.759e-07