\— title: Eliminating a factor reduces variance author: ’’ date: ‘2018-12-10’ slug: eliminating-a-factor-reduces-variance draft: TRUE categories: - rstats tags: - tutorial - plotting —

A well known measure to reduce variability and increase power in experimental (and observational) research design is to eliminate a factor that may influence the outcome variable.

“Eliminating” a factor means, by and above, to hold it constant.

Consider the following example. Say, an experiment is performed with two groups, and the experimental groups shows higher values than the control group. Assume, conly male subjects were selected for this experiment. More formally:

$y_{i | G=e, S=m} \sim N(1, 1) \\ y_{i | G=c, S=m} \sim N(0, 1) \\$

Here, $$e$$ refers to the experimental group and $$c$$ to the control group; $$f$$ ($$m$$) denotes females (males).

Now suppose in onother trials of the same experiment, only females were selected, and the distributions are as follows:

$y_{i | G=e, S=f} \sim N(2, 1) \\ y_{i | G=c, S=f} \sim N(1, 1) \\$

That is, one SD higher compared to men.

First, we simulate the data in R:

library(tidyverse)
n <- 1e05
set.seed(42)
d <- data_frame(
e_m = rnorm(n = n, mean = 1, sd = 1),
c_m = rnorm(n = n, mean = 0, sd = 1),
e_f = rnorm(n = n, mean = 2, sd= 1),
c_f = rnorm(n = n, mean= 1, sd = 1)
)

Change this “wide” format to “long” format (i.e., tidy format).

d_long <-
d %>%
gather(key = group, value = value) %>%
separate(col = group, c("exp", "sex"))
d_no_group <-
d_long %>%
select(value)

Now let’s plot. First, as density:

d_long %>%
ggplot(aes(x = value)) +
facet_grid(~sex) +
geom_density(data = d_no_group, alpha = .3, fill = "grey20") +
geom_density(aes(color = exp)) Similarly, as boxplot:

d_long %>%
ggplot(aes(x = exp, y = value, fill = exp)) +
facet_grid(~sex) +
geom_boxplot() As can be seen, the ungrouped dataframe has a larger variability compared to the individual groups.

d_no_group %>%
ggplot(aes(x = value)) +
geom_density() d_no_group %>%
ggplot(aes(x = "all", y = value)) +
geom_boxplot() Let’s check the figures exactly. SDs for the individual groups:

d_long %>%
group_by(sex, exp) %>%
summarise(sd(value))
## # A tibble: 4 x 3
## # Groups:   sex 
##   sex   exp   sd(value)
##   <chr> <chr>       <dbl>
## 1 f     c           0.997
## 2 f     e           1.00
## 3 m     c           1.00
## 4 m     e           1.00

For the whole, ungrouped data:

d_long %>%
summarise(sd(value))
## # A tibble: 1 x 1
##   sd(value)
##         <dbl>
## 1        1.23

And for the “half” grouped data (ie., only one grouping variable):

d_long %>%
group_by(sex) %>%
summarise(sd(value))
## # A tibble: 2 x 2
##   sex   sd(value)
##   <chr>       <dbl>
## 1 f            1.12
## 2 m            1.12
d_long %>%
group_by(exp) %>%
summarise(sd(value))
## # A tibble: 2 x 2
##   exp   sd(value)
##   <chr>       <dbl>
## 1 c            1.12
## 2 e            1.12