# Load packages

```
library(tidyverse)
library(mosaic)
```

# Motivation

What are data transformation good for? Why do we bother to transform variables for regression analysis? This post explores some nuances around these themes.

# Simulate an exponentially distributed assocation

```
len <- 42 # 42 x values
x <- rep(runif(len), 30) # each x value repeated 30 times
y <- dexp(x) + rnorm(length(x), mean = 0, sd = .01) # add some noise
```

Plot it:

```
gf_point(y ~ x) %>%
gf_smooth()
```

If we now do a log transformation on \(y\), we get this:

```
gf_point(log(y) ~ x) %>%
gf_smooth()
```

A straigt line. That means, we have a linear trend, thus additivity holds - and additivity is a/the central assumption in lineare models.

For comparison, let’s log \(x\):

```
gf_point(y ~ log(x)) %>%
gf_smooth()
```

We get a different trend, but additivity does not hold anymore.

# Gelman’s comment on the use of transformation (Chap. 5, p. 53ff)

Gelman and Hill (2007) point out three reasons for transformations:

- Easier interpretation (more on that later)
- Improved additivity (as seen above)
- Improved normality of the residuls (see below)

# Improved normality of the residuls

Consider the residulas of the *non*transformed regression:

```
lm1 <- lm(y ~ x)
gf_histogram(~ resid(lm1), bins = 10)
```

Well, not toooo much of a normal distribution.

Compare the logged lm:

```
lm2 <- lm(log(y) ~ x)
gf_histogram(~ resid(lm2), bins = 10)
```

Much better.

# Easier interpretation

Assume we have a regression model on log scale. As can be seen in the Figure below, small values of the exponential function (close to zero) are similar to the \(y=x+1\) straigt line. That means, we can interpret an logged coefficient of say 6% as a growth factor of 1.06. Note that this correspondence deteriorates for larger values.

```
gf_fun(exp(x) ~ x, xlim = c(-1,1)) %>%
gf_fun(x +1 ~ x , color = "red")
```