Modeling is a central part not only of statistical inquiry, but also of everyday human sense-making. We use models as metaphors for the world, in a broader sense. Of course, a model that explains the world better (than some other model) is to be preferred, all other things being equal. In this post, we demonstrate that a more “clever” statistical model reduces the residual variance. It should be noted that this “noise reduction” comes at a cost, however: The model gets more complex; there a more parameters in the model.

First, let’s load some data and usual packages:

```
library(tidyverse)
library(mosaic)
library(gridExtra)
data(tips, package = "reshape2")
```

We filter only the extreme groups in order to carve out the effect more lucidly. In addition, let’s compute the residuals and the square residuals:

```
tips2 <- filter(tips, size %in% c(1, 6)) %>%
mutate(ID = 1:nrow(.),
tip_resid = tip - mean(tip),
tips_resid_quad =tip_resid^2) %>%
group_by(size) %>%
mutate(tip_mean_group = mean(tip))
glimpse(tips2)
```

```
## Rows: 8
## Columns: 11
## Groups: size [2]
## $ total_bill <dbl> 3.07, 10.07, 7.25, 29.80, 34.30, 27.05, 48.17, 8.58
## $ tip <dbl> 1.00, 1.83, 1.00, 4.20, 6.70, 5.00, 5.00, 1.92
## $ sex <fct> Female, Female, Female, Female, Male, Female, Male, Ma…
## $ smoker <fct> Yes, No, No, No, No, No, No, Yes
## $ day <fct> Sat, Thur, Sat, Thur, Thur, Thur, Sun, Fri
## $ time <fct> Dinner, Lunch, Dinner, Lunch, Lunch, Lunch, Dinner, Lu…
## $ size <int> 1, 1, 1, 6, 6, 6, 6, 1
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8
## $ tip_resid <dbl> -2.33125, -1.50125, -2.33125, 0.86875, 3.36875, 1.6687…
## $ tips_resid_quad <dbl> 5.4347266, 2.2537516, 5.4347266, 0.7547266, 11.3484766…
## $ tip_mean_group <dbl> 1.4375, 1.4375, 1.4375, 5.2250, 5.2250, 5.2250, 5.2250…
```

`tip_mean_group`

refers to the mean value of the response variable `tip`

for each conditioned group.

Next, we compute a helper data frame that stores the group means only:

```
tips_sum <- tips2 %>%
group_by(size) %>%
summarise(tip = mean(tip))
```

Now, let’s plot:

That’s the plot for the *simpler* model without consideration of subgroups. We model the target variable `tip`

as the grand mean (of the whole data set).

```
p2 <- ggplot(tips2) +
geom_hline(aes(yintercept = mean(tip))) +
geom_segment(aes(x = ID,
xend = ID,
y = tip,
yend = mean(tip)),
color = "grey80") +
geom_point(aes(x = ID, y = tip)) +
labs(title = "tip ~ 1")
p2
```

In contrast, consider the residuals when conditioned on the subgroup means:

```
p1 <- ggplot(tips2) +
geom_hline(data = tips_sum,
aes(yintercept = tip,
color = factor(size))) +
geom_segment(aes(x = ID,
xend = ID,
y = tip,
yend = tip_mean_group),
color = "grey80") +
geom_point(aes(x = ID, y = tip,
color = factor(size))) +
theme(legend.position = "none") +
labs(title = "tip ~ size")
p1
```

As can be seen, the residuals become *smaller* in this model: The more complex model, where the target variable is modeled as the subgroup average, explains the data better. This models shows a better fit to the data.

`grid.arrange(p2, p1, nrow = 1)`