Simulation based inference
Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.
Here’s a summary of the central ideas.
However, this post does not aim at explaining simulation based inference, which is done elsewhere.
Testing the association of two categorical variables
One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.
Let’s try that using SBI.
library(mosaic) data(tips, package = "reshape2")
Research question: Is there an assocation of day and smoker
Let’s test whether there’s an association of
First, the sample descriptives:
tally( sex ~ day, data = tips, format = "proportion") #> day #> sex Fri Sat Sun Thur #> Female 0.4736842 0.3218391 0.2368421 0.5161290 #> Male 0.5263158 0.6781609 0.7631579 0.4838710
Interesting: On Saturday on Sunday, mostly men pay. On Friday and Thursday, there’s an equity of pay between the sexes. At least in our data. That brings us to the point: How about the population?
The \(\chi^2\) statistic gives us a measure of deviation from expected value.
chi_value <- chisq(sex ~ day, data = tips)
Alas, this value is not so intuitive. We can extract the p-value from the theoretical distribution like this:
pchisq(chi_value, 3, lower.tail=FALSE) #> X.squared #> 0.004180302
Let’s compare it to the result of the \(\chi^2\) test:
x <- xchisq.test(sex ~ day, data = tips) #> #> Pearson's Chi-squared test #> #> data: x #> X-squared = 13.222, df = 3, p-value = 0.00418 #> #> 9 28 18 32 #> ( 6.77) (31.02) (27.10) (22.11) #> [0.73] [0.29] [3.05] [4.43] #> < 0.86> <-0.54> <-1.75> < 2.10> #> #> 10 59 58 30 #> (12.23) (55.98) (48.90) (39.89) #> [0.41] [0.16] [1.69] [2.45] #> <-0.64> < 0.40> < 1.30> <-1.57> #> #> key: #> observed #> (expected) #> [contribution to X-squared] #> <Pearson residual>
Which is (approximately) the same.
\(\chi^2\) via SBI
This statistic can easily be submitted to SBI, ie., the permutation test:
null1 <- do(1000) * chisq(sex ~ shuffle(day), data = tips)
Let’s check the distribution of the \(\chi^2\) statistic:
gf_histogram(~ X.squared, data = null1)
Now let’s add a marker for our empirical value:
x$statistic #> X-squared #> 13.222
gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic)
As can easily be seen, the empirical value is an infrequent event in the null distribution.
p_value <- prop(~ X.squared >= x$statistic, data = null1) p_value #> prop_TRUE #> 0
Incidentally, this simulated p-value is approximately the value we got from the theoretical distribuion.
We might want to add the critical value to this graph:
crit_value <- qchisq(.05, df = 3, lower.tail=FALSE) crit_value #>  7.814728
crit_value gives us the critical value for a \(\chi^2\)-distribution with \(df=3\) degrees of freedom.
gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic) %>% gf_vline(xintercept = ~ crit_value, linetype = "dashed")