Load packages
library(tidyverse)
Simulation based inference
Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.
Here’s a summary of the central ideas.
However, this post does not aim at explaining simulation based inference, which is done elsewhere.
Testing the association of two categorical variables
One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.
Let’s try that using SBI.
Setup
library(mosaic)
data(tips, package = "reshape2")
Research question: Is there an assocation of day and smoker
Let’s test whether there’s an association of day
and smoker
.
First, the sample descriptives:
tally( sex ~ day, data = tips, format = "proportion")
#> day
#> sex Fri Sat Sun Thur
#> Female 0.4736842 0.3218391 0.2368421 0.5161290
#> Male 0.5263158 0.6781609 0.7631579 0.4838710
Interesting: On Saturday on Sunday, mostly men pay. On Friday and Thursday, there’s an equity of pay between the sexes. At least in our data. That brings us to the point: How about the population?
Chi-Square statistic
The \(\chi^2\) statistic gives us a measure of deviation from expected value.
chi_value <- chisq(sex ~ day, data = tips)
Alas, this value is not so intuitive. We can extract the p-value from the theoretical distribution like this:
pchisq(chi_value, 3, lower.tail=FALSE)
#> X.squared
#> 0.004180302
Let’s compare it to the result of the \(\chi^2\) test:
x <- xchisq.test(sex ~ day, data = tips)
#>
#> Pearson's Chi-squared test
#>
#> data: x
#> X-squared = 13.222, df = 3, p-value = 0.00418
#>
#> 9 28 18 32
#> ( 6.77) (31.02) (27.10) (22.11)
#> [0.73] [0.29] [3.05] [4.43]
#> < 0.86> <-0.54> <-1.75> < 2.10>
#>
#> 10 59 58 30
#> (12.23) (55.98) (48.90) (39.89)
#> [0.41] [0.16] [1.69] [2.45]
#> <-0.64> < 0.40> < 1.30> <-1.57>
#>
#> key:
#> observed
#> (expected)
#> [contribution to X-squared]
#> <Pearson residual>
Which is (approximately) the same.
\(\chi^2\) via SBI
This statistic can easily be submitted to SBI, ie., the permutation test:
null1 <- do(1000) * chisq(sex ~ shuffle(day), data = tips)
Let’s check the distribution of the \(\chi^2\) statistic:
gf_histogram(~ X.squared, data = null1)
Now let’s add a marker for our empirical value:
x$statistic
#> X-squared
#> 13.222
Plot it
gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic)
As can easily be seen, the empirical value is an infrequent event in the null distribution.
More precisely:
p_value <- prop(~ X.squared >= x$statistic, data = null1)
p_value
#> prop_TRUE
#> 0
Incidentally, this simulated p-value is approximately the value we got from the theoretical distribuion.
We might want to add the critical value to this graph:
crit_value <- qchisq(.05, df = 3, lower.tail=FALSE)
crit_value
#> [1] 7.814728
crit_value
gives us the critical value for a \(\chi^2\)-distribution with \(df=3\) degrees of freedom.
gf_histogram(~ X.squared, data = null1) %>%
gf_vline(xintercept = ~ x$statistic) %>%
gf_vline(xintercept = ~ crit_value, linetype = "dashed")
```