Chi-squared test using simulation based inference

Load packages

library(tidyverse)

Simulation based inference

Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.

Here’s a summary of the central ideas.

However, this post does not aim at explaining simulation based inference, which is done elsewhere.

Testing the association of two categorical variables

One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.

Let’s try that using SBI.

Setup

library(mosaic)
data(tips, package = "reshape2")

Research question: Is there an assocation of day and smoker

Let’s test whether there’s an association of day and smoker.

First, the sample descriptives:

tally( sex ~ day, data = tips, format = "proportion")
#>         day
#> sex            Fri       Sat       Sun      Thur
#>   Female 0.4736842 0.3218391 0.2368421 0.5161290
#>   Male   0.5263158 0.6781609 0.7631579 0.4838710

Interesting: On Saturday on Sunday, mostly men pay. On Friday and Thursday, there’s an equity of pay between the sexes. At least in our data. That brings us to the point: How about the population?

Chi-Square statistic

The \(\chi^2\) statistic gives us a measure of deviation from expected value.

chi_value <- chisq(sex ~ day, data = tips)

Alas, this value is not so intuitive. We can extract the p-value from the theoretical distribution like this:

pchisq(chi_value, 3, lower.tail=FALSE)
#>   X.squared 
#> 0.004180302

Let’s compare it to the result of the \(\chi^2\) test:

x <- xchisq.test(sex ~ day, data = tips)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  x
#> X-squared = 13.222, df = 3, p-value = 0.00418
#> 
#>     9       28       18       32   
#> ( 6.77)  (31.02)  (27.10)  (22.11) 
#>  [0.73]   [0.29]   [3.05]   [4.43] 
#> < 0.86>  <-0.54>  <-1.75>  < 2.10> 
#>        
#>    10       59       58       30   
#> (12.23)  (55.98)  (48.90)  (39.89) 
#>  [0.41]   [0.16]   [1.69]   [2.45] 
#> <-0.64>  < 0.40>  < 1.30>  <-1.57> 
#>        
#> key:
#>  observed
#>  (expected)
#>  [contribution to X-squared]
#>  <Pearson residual>

Which is (approximately) the same.

\(\chi^2\) via SBI

This statistic can easily be submitted to SBI, ie., the permutation test:

null1 <- do(1000) * chisq(sex ~ shuffle(day), data = tips)

Let’s check the distribution of the \(\chi^2\) statistic:

gf_histogram(~ X.squared, data = null1)

Now let’s add a marker for our empirical value:

x$statistic
#> X-squared 
#>    13.222

Plot it

gf_histogram(~ X.squared, data = null1) %>%  gf_vline(xintercept = ~ x$statistic)

As can easily be seen, the empirical value is an infrequent event in the null distribution.

More precisely:

p_value <- prop(~ X.squared >= x$statistic, data = null1)
p_value
#> prop_TRUE 
#>         0

Incidentally, this simulated p-value is approximately the value we got from the theoretical distribuion.

We might want to add the critical value to this graph:

crit_value <- qchisq(.05, df = 3, lower.tail=FALSE)
crit_value
#> [1] 7.814728

crit_value gives us the critical value for a \(\chi^2\)-distribution with \(df=3\) degrees of freedom.

gf_histogram(~ X.squared, data = null1) %>% 
  gf_vline(xintercept = ~ x$statistic) %>% 
  gf_vline(xintercept = ~ crit_value, linetype = "dashed")

```