# Chi-squared test using simulation based inference

library(tidyverse)

# Simulation based inference

Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.

Here’s a summary of the central ideas.

However, this post does not aim at explaining simulation based inference, which is done elsewhere.

# Testing the association of two categorical variables

One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.

Let’s try that using SBI.

# Setup

library(mosaic)
data(tips, package = "reshape2")

# Research question: Is there an assocation of day and smoker

Let’s test whether there’s an association of day and smoker.

First, the sample descriptives:

tally( sex ~ day, data = tips, format = "proportion")
#>         day
#> sex            Fri       Sat       Sun      Thur
#>   Female 0.4736842 0.3218391 0.2368421 0.5161290
#>   Male   0.5263158 0.6781609 0.7631579 0.4838710

Interesting: On Saturday on Sunday, mostly men pay. On Friday and Thursday, there’s an equity of pay between the sexes. At least in our data. That brings us to the point: How about the population?

# Chi-Square statistic

The $$\chi^2$$ statistic gives us a measure of deviation from expected value.

chi_value <- chisq(sex ~ day, data = tips)

Alas, this value is not so intuitive. We can extract the p-value from the theoretical distribution like this:

pchisq(chi_value, 3, lower.tail=FALSE)
#>   X.squared
#> 0.004180302

Let’s compare it to the result of the $$\chi^2$$ test:

x <- xchisq.test(sex ~ day, data = tips)
#>
#>  Pearson's Chi-squared test
#>
#> data:  x
#> X-squared = 13.222, df = 3, p-value = 0.00418
#>
#>     9       28       18       32
#> ( 6.77)  (31.02)  (27.10)  (22.11)
#>  [0.73]   [0.29]   [3.05]   [4.43]
#> < 0.86>  <-0.54>  <-1.75>  < 2.10>
#>
#>    10       59       58       30
#> (12.23)  (55.98)  (48.90)  (39.89)
#>  [0.41]   [0.16]   [1.69]   [2.45]
#> <-0.64>  < 0.40>  < 1.30>  <-1.57>
#>
#> key:
#>  observed
#>  (expected)
#>  [contribution to X-squared]
#>  <Pearson residual>

Which is (approximately) the same.

# $$\chi^2$$ via SBI

This statistic can easily be submitted to SBI, ie., the permutation test:

null1 <- do(1000) * chisq(sex ~ shuffle(day), data = tips)

Let’s check the distribution of the $$\chi^2$$ statistic:

gf_histogram(~ X.squared, data = null1)

Now let’s add a marker for our empirical value:

x$statistic #> X-squared #> 13.222 # Plot it gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic)

As can easily be seen, the empirical value is an infrequent event in the null distribution.

More precisely:

p_value <- prop(~ X.squared >= x$statistic, data = null1) p_value #> prop_TRUE #> 0 Incidentally, this simulated p-value is approximately the value we got from the theoretical distribuion. We might want to add the critical value to this graph: crit_value <- qchisq(.05, df = 3, lower.tail=FALSE) crit_value #> [1] 7.814728 crit_value gives us the critical value for a $$\chi^2$$-distribution with $$df=3$$ degrees of freedom. gf_histogram(~ X.squared, data = null1) %>% gf_vline(xintercept = ~ x$statistic) %>%
gf_vline(xintercept = ~ crit_value, linetype = "dashed")

`