What are data transformation good for? Why do we bother to transform variables for regression analysis? This post explores some nuances around these themes.
Simulate an exponentially distributed assocation
len <- 42 # 42 x values x <- rep(runif(len), 30) # each x value repeated 30 times y <- dexp(x) + rnorm(length(x), mean = 0, sd = .01) # add some noise
gf_point(y ~ x) %>% gf_smooth()
If we now do a log transformation on \(y\), we get this:
gf_point(log(y) ~ x) %>% gf_smooth()
A straigt line. That means, we have a linear trend, thus additivity holds - and additivity is a/the central assumption in lineare models.
For comparison, let’s log \(x\):
gf_point(y ~ log(x)) %>% gf_smooth()
We get a different trend, but additivity does not hold anymore.
Gelman’s comment on the use of transformation (Chap. 5, p. 53ff)
Gelman and Hill (2007) point out three reasons for transformations:
- Easier interpretation (more on that later)
- Improved additivity (as seen above)
- Improved normality of the residuls (see below)
Improved normality of the residuls
Consider the residulas of the nontransformed regression:
lm1 <- lm(y ~ x) gf_histogram(~ resid(lm1), bins = 10)
Well, not toooo much of a normal distribution.
Compare the logged lm:
lm2 <- lm(log(y) ~ x) gf_histogram(~ resid(lm2), bins = 10)
Assume we have a regression model on log scale. As can be seen in the Figure below, small values of the exponential function (close to zero) are similar to the \(y=x+1\) straigt line. That means, we can interpret an logged coefficient of say 6% as a growth factor of 1.06. Note that this correspondence deteriorates for larger values.
gf_fun(exp(x) ~ x, xlim = c(-1,1)) %>% gf_fun(x +1 ~ x , color = "red")