Visualizing error distribution in regression analysis

Errors and residuals in Regression

A residual is defined as

\(r_i = y_i - X_i \hat{\beta}\).

That is, a residual is a tangible thing in the sense that it describes observables (cf. Gelman 2021, chap. 11.3, p. 161). That is, the residuals are the difference between observed and predicted values.

In contrast, the error term is defined as the difference between the observed value and the true (unobserved) value:

\(e_i = y_i - X_i \beta\).

The residuals can be thought of as estimates of the errors source.

(Also see this).

Frequently, it is assumed that the errors are distributed normally.

Let’s visualize error terms assuming normality.

Setup

data(mtcars)
library(tidyverse)
library(glue)

Illustrative data set

Let’s say we’d like to illustrate the errors for this regression:

mtcars <-
  mtcars %>% 
  mutate(hp_z = (hp - mean(hp)) / sd(hp))

Any NAs?

mtcars %>% 
  summarise(sum(is.na(hp_z)))
##   sum(is.na(hp_z))
## 1                0

No.

lm1 <- lm(mpg ~ hp_z, data = mtcars)
coef(lm1)
## (Intercept)        hp_z 
##   20.090625   -4.677926

Define some constants:

z <- 3  # go until 3 sd's away from the mean
avg <- 0  # mean value
sigma <- sigma(lm1)
b0 <- coef(lm1)[1]
b1 <- coef(lm1)[2]
preds <- c(
    predict(lm1, data.frame(hp_z = 0)) ,
    predict(lm1, data.frame(hp_z = 1)) 
)

p1 <- 
  mtcars %>% 
  ggplot() +
  aes(x = hp_z, y = mpg) +
  geom_point() +
  geom_abline(slope = b1, intercept = b0, color = "blue") +
  annotate("point", x = 0, y = predict(lm1, data.frame(hp_z = 0)),
           color = "red", size = 5, alpha = .7) +
  annotate("point", x = 1, y = predict(lm1, data.frame(hp_z = 1)),
           color = "red", size = 5, alpha = .7) +
  ylim(0, 30)

p1

Let’s visualize errors

Here’s a vector holding the densities of a normal along with the x-values:

d <-
  tibble(
  x = seq(-z*sigma, z*sigma, length.out = 100),
  y = dnorm(x, mean = avg, sd = sigma)
)

Let’s plot that:

ggplot(d) +
  aes(x = x, y = y) +
  geom_line()

In all its glory.

But not what we wanted.

Put it in a function

vertical_dnorm <- function(predictor_value, mod, pred_y, k = 2){
  
  sigma_mod <- sigma(mod)
  #predictor_name <- coef(lm1)[2] %>% names()
  
  
  out <-
    tibble(
      y = seq(-k*sigma, k*sigma, length.out = 100),
      x = dnorm(y, mean = 0, sd = sigma_mod)
    ) %>% 
    mutate(x = x + predictor_value,
           y = y + pred_y)
  
  return(out)
  
}

We have not standardized the height of the normal curve in the plot. The normal curve of the errors should appear equally heigh, no matter what the scale of the x axis is. Let’s take care of that some other time.

#debug(vertical_dnorm)
d1 <- vertical_dnorm(predictor_value = 0, mod = lm1, pred_y = predict(lm1, tibble(hp_z = 0))) 
d2 <- vertical_dnorm(predictor_value = -2, mod = lm1, pred_y = predict(lm1, tibble(hp_z = -2))) 
d3 <- vertical_dnorm(predictor_value = 2, mod = lm1, pred_y = predict(lm1, tibble(hp_z = 2))) 
head(d)
## # A tibble: 6 × 2
##       x       y
##   <dbl>   <dbl>
## 1 -11.6 0.00115
## 2 -11.4 0.00137
## 3 -11.1 0.00164
## 4 -10.9 0.00195
## 5 -10.7 0.00231
## 6 -10.4 0.00272
p1 +
  geom_path(data = d1, aes(x,y)) +
  geom_path(data = d2, aes(x,y)) +
  geom_path(data = d3, aes(x,y)) 

Yeah.

The sigmas look quite wide though. Let’s check:

sigma(lm1)
## [1] 3.862962

Ok, the plot appears reasonable.