It is well-known that the empirical variance underestimates the population variance. Specifically, the empirical variance is defined as: \(var_{emp} = \frac{\sum_i (x_i - \bar{x})^2}{n-1}\). But why \(n-1\), why not just \(n\), as intuition (of some) dictates? Put shortly, as the variance of a sample tends to underestimate the population variance we have to inflate it artificially, to enlarge it, that’s why we do put a *smaller* number (the “n-1”) in the denominator, resulting in a *larger* value of the whole fraction. This larger value is called the empirical variance, it estimates the “real” population variance well.

In this post, we will not prove these ideas, but we will test it empirically. That is, we will let R draw samples and check whether the variance of the samples is slightly smaller than the variance in the population.

Load some packages:

`library(tidyverse)`

`## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──`

```
## ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
```

```
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

`library(magrittr)`

```
##
## Attaching package: 'magrittr'
```

```
## The following object is masked from 'package:purrr':
##
## set_names
```

```
## The following object is masked from 'package:tidyr':
##
## extract
```

Then we’ll define a function that computes the empirical variance `var_emp`

from a sample taken of a standard normal distribution.

```
var_emp <- function(n = 10) {
rnorm(n = n, mean = 0, sd = 1) %>%
var()
}
# test it:
set.seed(42)
var_emp()
```

`## [1] 0.6979747`

By the way, to get the variance (or sd) of a *sample* as a description of the sample, we can use this computation:

```
n <- 10 # sample size
n_max <- 200 # max sample size to consider later on
var_sample <- function(n = 10) {
rnorm(n = n, mean = 0, sd = 1) %>%
var(.) %>%
`*`((n-1)/n)
}
set.seed(42)
var_sample()
```

`## [1] 0.6281773`

The “correction factor” of \((n-1)/n\) amounts to a factor of .9 in the case of \(n=10\), ie., \((10-1)/10\).

But one sample is not sufficient to gauge an overall picture. Rather let’s draw many samples, say \(m=1000\), and compute the empirical variance and the sample variance each time.

```
m <- 1000
df_variance <- data_frame(id = 1:m)
df_variance %<>%
mutate(var_emp_vec = replicate(n = m, var_emp()),
var_sample_vec = replicate(n = m,
var_sample())) %>%
gather(key = variance_type, value = variance, -id)
```

See here the distributions:

```
df_variance %>%
ggplot() +
aes(x = variance, fill = variance_type) +
geom_density(alpha = .5)
```

The sample variance distribution leans towards slightly smaller values, it appears.

What’s the central tendency in each case?

```
df_variance %>%
group_by(variance_type) %>%
summarise(mean_variance = mean(variance))
```

```
## # A tibble: 2 x 2
## variance_type mean_variance
## <chr> <dbl>
## 1 var_emp_vec 1.01
## 2 var_sample_vec 0.916
```

What about the quantiles of this sampling distribution?

```
df_variance %>%
split(.$variance_type) %>%
map(~quantile(.$variance, probs = c(.055, .5, .945)))
```

```
## $var_emp_vec
## 5.5% 50% 94.5%
## 0.3597041 0.9429192 1.8472943
##
## $var_sample_vec
## 5.5% 50% 94.5%
## 0.3542062 0.8542238 1.6649809
```

Our results show that the variance of the sample is smaller than the empirical variance; however even the empirical variance too is a little too small compared with the population variance (which is 1). Note that sample size was \(n=10\) in each draw of the simulation. With sample size increasing, both should get closer to the “real” (population) sample size (although the bias is negligible for the empirical variance). Let’s check that.

First let’s define a function which does the simulation for a given sample size \(n\) and a given number of samples, \(m\).

```
simu_mean <- function(n = 10, m = 1000, correction = 1){
replicate(n = m, var_emp(n = n) * correction) %>%
mean()
}
```

This function draws a sample of size \(n\), computes the (empirical) variance, and multiply the result with a correction factor if needed (to estimate the sample variance). This procedure is repeated for \(m\) times. Finally, all \(m\) variance values are averaged:

```
set.seed(42)
simu_mean()
```

`## [1] 1.006739`

Now we let this function run for different samples sizes:

```
sizes <- 2:n_max
sizes %>%
map_dbl(~simu_mean(n = .)) %>%
as_tibble -> simu_df
simu_df %<>%
mutate(sample_size = sizes)
names(simu_df)[1] <- "variance_empirical"
```

Finally, let’s plot the resulting curve:

```
simu_df %>%
ggplot +
aes(x = sample_size, y = variance_empirical) +
geom_line() +
scale_y_continuous(limits = c(0.5, 1.5))
```

Even with small samples, the variance is close to the real variance in the population (1).

Compare this to the sample size curve when the variance is computing with \(n\) in the denominator (instead of \(n-1\)).

```
sizes <-2:n_max
sizes %>%
map_dbl(~simu_mean(n = ., correction = (. -1) / .)) %>%
as_tibble -> simu_df_sample
simu_df_sample %<>%
mutate(sample_size = sizes)
names(simu_df_sample)[1] <-"variance_sample"
```

```
simu_df %>%
left_join(simu_df_sample, by = "sample_size") %>%
gather(key = variance_type, value = variance_value, -sample_size) %>%
ggplot() +
aes(x = sample_size, y = variance_value, color = variance_type) +
geom_line()
```

The sample variance underestimates the true value of the variance - at least for small samples. This bias diminishes when sample size increases.

Of interest, what is the variability of the mean variance for a given sample size? Let’s add these estimates of uncertainty.

This function computes the sd of the variances in the \(m\) samples (additionally to the mean values of the variances):

```
simu_mean_sd <- function(n = 10, m = 1000, correction = 1){
simu_var_vec <- replicate(n = m, var_emp(n = n) * correction)
simu_var_mean <- mean(simu_var_vec)
simu_var_sd <- sd(simu_var_vec)
return(list(simu_var_mean = simu_var_mean,
simu_var_sd = simu_var_sd))
}
```

Again, we run the function for different samples sizes:

```
sizes <- 2:n_max
df_variance_sds <- data_frame(sample_size = sizes)
df_variance_sds %<>%
rowwise() %>%
mutate(simu_mean_sd = simu_mean_sd(n = sample_size)[[2]])
df_variance_sds %<>%
left_join(simu_df)
```

`## Joining, by = "sample_size"`

Again, let’s plot the resulting curve:

```
df_variance_sds %>%
ggplot +
aes(x = sample_size, y = variance_empirical) +
geom_ribbon(aes(ymin = variance_empirical - simu_mean_sd,
ymax = variance_empirical + simu_mean_sd),
fill = "grey80") +
geom_line() +
scale_y_continuous(limits = c(0.5, 1.5))
```

Hey, the sd is still relatively huge. Would be interesting too know at which sample size the sd gets “small”, smaller than a desired width, \(w\). Say, we would like to know the sample size for \(w = 0.2\) which amounts to an sd of 0.1.

Let’s run our simulation a longer distance:

```
n_max <- 1000
sizes <- 2:n_max
df_variance_sds <- data_frame(sample_size = sizes)
df_variance_sds %<>%
rowwise() %>%
mutate(simu_mean_sd = simu_mean_sd(n = sample_size)[[2]],
simu_mean_avg = simu_mean_sd(n = sample_size)[[1]])
```

Pooh! Needs some computation time.

How let’s glimpse at the course of the sd interval:

```
df_variance_sds %>%
ggplot +
aes(x = sample_size, y = simu_mean_avg) +
geom_ribbon(aes(ymin = simu_mean_avg - simu_mean_sd,
ymax = simu_mean_avg + simu_mean_sd),
fill = "grey80") +
geom_line() +
scale_y_continuous(limits = c(0.5, 1.5))
```

What about the actual numbers. What are the smallest widths \(w\)?

```
df_variance_sds %<>%
mutate(width = 2 * simu_mean_sd) %>%
arrange(width)
df_variance_sds
```

```
## # A tibble: 999 x 4
## sample_size simu_mean_sd simu_mean_avg width
## <int> <dbl> <dbl> <dbl>
## 1 972 0.0432 1.000 0.0864
## 2 968 0.0434 1.00 0.0867
## 3 992 0.0439 1.00 0.0877
## 4 981 0.0441 0.998 0.0882
## 5 996 0.0441 0.999 0.0882
## 6 938 0.0442 0.998 0.0885
## 7 995 0.0442 0.998 0.0885
## 8 982 0.0442 1.00 0.0885
## 9 984 0.0443 1.00 0.0885
## 10 976 0.0443 0.998 0.0886
## # ... with 989 more rows
```

Straightforward `max`

and `min`

values:

`max(df_variance_sds$width)`

`## [1] 2.59451`

`min(df_variance_sds$width)`

`## [1] 0.08639865`

And, the answer to the actual question asked, at what sample size falls \(w\) below 0.2?

`quantile(df_variance_sds$width, probs = c(0.1, 0.5, 0.9)) `

```
## 10% 50% 90%
## 0.09420279 0.12617887 0.28190605
```

```
df_variance_sds %>%
filter(width < 0.201 & width > 0.199) %>%
arrange(sample_size)
```

```
## # A tibble: 4 x 4
## sample_size simu_mean_sd simu_mean_avg width
## <int> <dbl> <dbl> <dbl>
## 1 190 0.100 0.997 0.200
## 2 191 0.100 0.996 0.200
## 3 204 0.0999 1.00 0.200
## 4 207 0.0995 0.997 0.199
```

So, at about \(n=200\), our sample size falls within the \(w=0.2\) corridor.

Same question for sd:

`quantile(df_variance_sds$simu_mean_sd, probs = c(0.01, 0.1, 0.5, 0.9, .99)) `

```
## 1% 10% 50% 90% 99%
## 0.04430252 0.04710139 0.06308943 0.14095303 0.43023127
```

# Debrief

We have not really discussed “why” the sample variance is smaller than the “real” variance, that is why it is somewhat biased in small samples. But we have gone through some evidence that it is the case. One quite nice intuitive explanation “why” this happens can be found in the book of Chris Bishop.