# Load packages

`library(tidyverse)`

# Simulation based inference

Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.

Here’s a summary of the central ideas.

However, this post does not aim at explaining simulation based inference, which is done elsewhere.

# Testing the association of two categorical variables

One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.

Let’s try that using SBI.

# Setup

```
library(mosaic)
data(tips, package = "reshape2")
```

# Research question: Is there an assocation of day and smoker

Let’s test whether there’s an association of `day`

and `smoker`

.

First, the sample descriptives:

```
tally( sex ~ day, data = tips, format = "proportion")
#> day
#> sex Fri Sat Sun Thur
#> Female 0.4736842 0.3218391 0.2368421 0.5161290
#> Male 0.5263158 0.6781609 0.7631579 0.4838710
```

Interesting: On Saturday on Sunday, mostly men pay. On Friday and Thursday, there’s an equity of pay between the sexes. At least in our data. That brings us to the point: How about the population?

# Chi-Square statistic

The \(\chi^2\) statistic gives us a measure of deviation from expected value.

`chi_value <- chisq(sex ~ day, data = tips)`

Alas, this value is not so intuitive. We can extract the p-value from the theoretical distribution like this:

```
pchisq(chi_value, 3, lower.tail=FALSE)
#> X.squared
#> 0.004180302
```

Let’s compare it to the result of the \(\chi^2\) test:

```
x <- xchisq.test(sex ~ day, data = tips)
#>
#> Pearson's Chi-squared test
#>
#> data: x
#> X-squared = 13.222, df = 3, p-value = 0.00418
#>
#> 9 28 18 32
#> ( 6.77) (31.02) (27.10) (22.11)
#> [0.73] [0.29] [3.05] [4.43]
#> < 0.86> <-0.54> <-1.75> < 2.10>
#>
#> 10 59 58 30
#> (12.23) (55.98) (48.90) (39.89)
#> [0.41] [0.16] [1.69] [2.45]
#> <-0.64> < 0.40> < 1.30> <-1.57>
#>
#> key:
#> observed
#> (expected)
#> [contribution to X-squared]
#> <Pearson residual>
```

Which is (approximately) the same.

# \(\chi^2\) via SBI

This statistic can easily be submitted to SBI, ie., the permutation test:

`null1 <- do(1000) * chisq(sex ~ shuffle(day), data = tips)`

Let’s check the distribution of the \(\chi^2\) statistic:

`gf_histogram(~ X.squared, data = null1)`

Now let’s add a marker for our empirical value:

```
x$statistic
#> X-squared
#> 13.222
```

# Plot it

`gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic)`

As can easily be seen, the empirical value is an infrequent event in the null distribution.

More precisely:

```
p_value <- prop(~ X.squared >= x$statistic, data = null1)
p_value
#> prop_TRUE
#> 0
```

Incidentally, this simulated p-value is approximately the value we got from the theoretical distribuion.

We might want to add the critical value to this graph:

```
crit_value <- qchisq(.05, df = 3, lower.tail=FALSE)
crit_value
#> [1] 7.814728
```

`crit_value`

gives us the critical value for a \(\chi^2\)-distribution with \(df=3\) degrees of freedom.

```
gf_histogram(~ X.squared, data = null1) %>%
gf_vline(xintercept = ~ x$statistic) %>%
gf_vline(xintercept = ~ crit_value, linetype = "dashed")
```

```