Modeling is a central part not only of statistical inquiry, but also of everyday human sense-making. We use models as metaphors for the world, in a broader sense. Of course, a model that explains the world better (than some other model) is to be preferred, all other things being equal. In this post, we demonstrate that a more “clever” statistical model reduces the residual variance. It should be noted that this “noise reduction” comes at a cost, however: The model gets more complex; there a more parameters in the model.
First, let’s load some data and usual packages:
library(tidyverse) library(mosaic) library(gridExtra) data(tips, package = "reshape2")
We filter only the extreme groups in order to carve out the effect more lucidly. In addition, let’s compute the residuals and the square residuals:
tips2 <- filter(tips, size %in% c(1, 6)) %>% mutate(ID = 1:nrow(.), tip_resid = tip - mean(tip), tips_resid_quad =tip_resid^2) %>% group_by(size) %>% mutate(tip_mean_group = mean(tip)) glimpse(tips2)
## Rows: 8 ## Columns: 11 ## Groups: size  ## $ total_bill <dbl> 3.07, 10.07, 7.25, 29.80, 34.30, 27.05, 48.17, 8.58 ## $ tip <dbl> 1.00, 1.83, 1.00, 4.20, 6.70, 5.00, 5.00, 1.92 ## $ sex <fct> Female, Female, Female, Female, Male, Female, Male, Ma… ## $ smoker <fct> Yes, No, No, No, No, No, No, Yes ## $ day <fct> Sat, Thur, Sat, Thur, Thur, Thur, Sun, Fri ## $ time <fct> Dinner, Lunch, Dinner, Lunch, Lunch, Lunch, Dinner, Lu… ## $ size <int> 1, 1, 1, 6, 6, 6, 6, 1 ## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8 ## $ tip_resid <dbl> -2.33125, -1.50125, -2.33125, 0.86875, 3.36875, 1.6687… ## $ tips_resid_quad <dbl> 5.4347266, 2.2537516, 5.4347266, 0.7547266, 11.3484766… ## $ tip_mean_group <dbl> 1.4375, 1.4375, 1.4375, 5.2250, 5.2250, 5.2250, 5.2250…
tip_mean_group refers to the mean value of the response variable
tip for each conditioned group.
Next, we compute a helper data frame that stores the group means only:
tips_sum <- tips2 %>% group_by(size) %>% summarise(tip = mean(tip))
Now, let’s plot:
That’s the plot for the simpler model without consideration of subgroups. We model the target variable
tip as the grand mean (of the whole data set).
p2 <- ggplot(tips2) + geom_hline(aes(yintercept = mean(tip))) + geom_segment(aes(x = ID, xend = ID, y = tip, yend = mean(tip)), color = "grey80") + geom_point(aes(x = ID, y = tip)) + labs(title = "tip ~ 1") p2
In contrast, consider the residuals when conditioned on the subgroup means:
p1 <- ggplot(tips2) + geom_hline(data = tips_sum, aes(yintercept = tip, color = factor(size))) + geom_segment(aes(x = ID, xend = ID, y = tip, yend = tip_mean_group), color = "grey80") + geom_point(aes(x = ID, y = tip, color = factor(size))) + theme(legend.position = "none") + labs(title = "tip ~ size") p1
As can be seen, the residuals become smaller in this model: The more complex model, where the target variable is modeled as the subgroup average, explains the data better. This models shows a better fit to the data.
grid.arrange(p2, p1, nrow = 1)