- 1 Motivation
- 2 Load packages
- 3 Some data
- 4 Research question
- 5 Regression with unstandardized input variables
- 6 Standardize input variables
- 7 Regression with standardized input variables
- 8 The models (
lm1
andlm2
) are identical - 9 Interpretation of a standardized regression coefficient
- 10 Reproducibility
1 Motivation
Running a regression in R yields unstandardized coefficients, not standardized ones. However, as is spelled out by eg., Gelman and Hill (2007), standardizing values is of advantages in many situations. This post shows how run a regression in R using standardized values as inputs (“standardized regression” for short, as some dup it).
The advantage of standardizing input variables is the simpler comparison of importance. It can be seen as undesirable that the scaling (SD) of the input variable determines (in part) the regression coefficient. For instance, measuring the “power” of a car in horse power or in kilowatt will strongly influence the value of the regression coefficient. Similarly, measuring the distance walked in kilomweters or in millimeters will have an profound effect on the respective regression coefficient on, say, the amount of fat burned (in grams or in kilo grams…).
Hence, having all variables on the same scale will facilitate easy comparison of the “importance” of each variable, as now all variables are on the same scale.
The most common way to standardize the variable \(X\) is to use the \(z\) transformation:
\[z_i = \frac{x_i - \mu}{sd_X}\]
2 Load packages
library(tidyverse) # data wrangling
library(broom) # tidy regression output
library(mosaic) # standardizing variables
3 Some data
mtcars <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
4 Research question
Say, we are interested in the association of horse power (hp
) and fuel consumption (mpg
; miles per gallon): What’s the difference in the fuel consumption between cars which differ in their horse power?
5 Regression with unstandardized input variables
lm1 <- lm(mpg ~ hp, data = mtcars)
tidy(lm1)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 30.0988605 | 1.6339210 | 18.421246 | 0e+00 |
hp | -0.0682283 | 0.0101193 | -6.742388 | 2e-07 |
As cen be seen in the output, our model lm1
estimates that the cars which differ in 1 hp, differ in -0.07 miles per gallon, on overage (and given the model is true). That is, a car with 1 hp more, goes 0.07 miles less (compared to a car with 1 hp less).
6 Standardize input variables
mtcars_standardized <-
mtcars %>%
mutate(hp_s = scale(hp))
As we see, scale
does the trick, that is the z transformation. For example:
x <- c(0,10, 20)
scale(x)
-1 |
0 |
1 |
Let’s double check:
x_mean <- mean(x)
x_sd <- sd(x)
z <- (x - mean(x)) / sd(x)
z
#> [1] -1 0 1
It’s not so nice that scale()
takes a vector as input, but hands back a matrix.
A similar function, zscore()
is provided by the package {mosaic}
; this function gives back a vector which is more comfortable:
zscore(x)
#> [1] -1 0 1
7 Regression with standardized input variables
lm2 <- lm(mpg ~ hp_s, data = mtcars_standardized)
tidy(lm2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 20.090625 | 0.6828817 | 29.420359 | 0e+00 |
hp_s | -4.677926 | 0.6938085 | -6.742388 | 2e-07 |
8 The models (lm1
and lm2
) are identical
Have a look at the p-values and the model fit values of both models (lm1
and lm2
) to reassure yourself that both models are indeed equivalant, as it should be:
glance(lm1)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.6024373 | 0.5891853 | 3.862962 | 45.4598 | 2e-07 | 1 | -87.61931 | 181.2386 | 185.6358 | 447.6743 | 30 | 32 |
glance(lm2)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.6024373 | 0.5891853 | 3.862962 | 45.4598 | 2e-07 | 1 | -87.61931 | 181.2386 | 185.6358 | 447.6743 | 30 | 32 |
9 Interpretation of a standardized regression coefficient
“According to our model, lm2
, cars differ in their fuel consumption (measured as miles consumed per gallon) such that a cars with 1 SD higher horse power value consume one average approx. 5 gallons less fuel.”
10 Reproducibility
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os macOS 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2021-02-26
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0)
#> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.2)
#> blogdown 1.1 2021-01-19 [1] CRAN (R 4.0.2)
#> bookdown 0.21.6 2021-02-02 [1] Github (rstudio/bookdown@6c7346a)
#> broom * 0.7.5 2021-02-19 [1] CRAN (R 4.0.2)
#> bslib 0.2.4.9000 2021-02-02 [1] Github (rstudio/bslib@b3cd7a9)
#> cachem 1.0.4 2021-02-13 [1] CRAN (R 4.0.2)
#> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0)
#> cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.2)
#> codetools 0.2-16 2018-12-24 [2] CRAN (R 4.0.2)
#> colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
#> curl 4.3 2019-12-02 [1] CRAN (R 4.0.0)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
#> dbplyr 2.1.0 2021-02-03 [1] CRAN (R 4.0.2)
#> debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.4 2021-02-02 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0)
#> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
#> ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0)
#> haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.0)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
#> jquerylib 0.1.3 2020-12-17 [1] CRAN (R 4.0.2)
#> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.2)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
#> lubridate 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
#> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.2)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0)
#> pillar 1.5.0 2021-02-22 [1] CRAN (R 4.0.2)
#> pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0)
#> pkgload 1.2.0 2021-02-23 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0)
#> processx 3.4.5 2020-11-30 [1] CRAN (R 4.0.2)
#> ps 1.5.0 2020-12-05 [1] CRAN (R 4.0.2)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.2)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.2)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.2)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
#> rstudioapi 0.13.0-9000 2021-02-11 [1] Github (rstudio/rstudioapi@9d21f50)
#> rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
#> sass 0.3.1 2021-01-24 [1] CRAN (R 4.0.2)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0)
#> testthat 3.0.2 2021-02-14 [1] CRAN (R 4.0.2)
#> tibble * 3.0.6 2021-01-29 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.0)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.0)
#> usethis 2.0.1 2021-02-10 [1] CRAN (R 4.0.2)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.2)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.2)
#> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.2)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0)
#>
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library