Simulation based inference for non-parametric tests, and a trick

Load packages

library(tidyverse)
library(mosaic)

Data

data("tips", package = "reshape2")

Non-parametric tests and simulation based inference

Simulation-based inference (SBI) is an old tool that has seen a surge in research interest in recent years probably due to the large amount of computational powers at the hands of researchers.

SBI is less prone to violations of assumptions, particularly with distributional assumptions. This is because inference is not based on the idea that some variable follows a – for example – normal distribution.

As a consequence there is much less need to seek shelter at nonparametric tests in case distributional assumptions are violated.

In this post, I’ll show how the Kurskal-Wallis test can be applied using SBI and how to translate it to a linear model.

Research question

Is there a difference in mean tip across different days?

Kruskall test the classic way

k_test <- kruskal.test(tip ~ day, data = tips)
k_test
#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  tip by day
#> Kruskal-Wallis chi-squared = 8.5656, df = 3, p-value = 0.03566

Let’s have a closer look to the resulting object:

str(k_test)
#> List of 5
#>  $ statistic: Named num 8.57
#>   ..- attr(*, "names")= chr "Kruskal-Wallis chi-squared"
#>  $ parameter: Named int 3
#>   ..- attr(*, "names")= chr "df"
#>  $ p.value  : num 0.0357
#>  $ method   : chr "Kruskal-Wallis rank sum test"
#>  $ data.name: chr "tip by day"
#>  - attr(*, "class")= chr "htest"

We can extract the statistic in this way:

k_test$statistic
#> Kruskal-Wallis chi-squared 
#>                   8.565588
pluck(k_test, "statistic")
#> Kruskal-Wallis chi-squared 
#>                   8.565588

Or, alternatively:

kruskal.test(tip ~ day, data = tips)$statistic
#> Kruskal-Wallis chi-squared 
#>                   8.565588

The statistic is defined as shown here:

\[ H = (N-1)\frac{\sum_{i=1}^g n_i(\bar{r}_{i\cdot} - \bar{r})^2}{\sum_{i=1}^g\sum_{j=1}^{n_i}(r_{ij} - \bar{r})^2} \]

And this statistic is asymptotically \(\chi^2\) distributed.

Kruskall test via SBI

null1 <- mosaic::do(1000) * kruskal.test(tip ~ shuffle(day), 
                                 data = tips)

Let’s check the results:

null1 %>% 
  head() %>% 
  knitr::kable()
Kruskal.Wallis.chi.squared df p.value method alternative data .row .index
3.1688452 3 0.3663180 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 1
0.0094803 3 0.9997552 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 2
11.7318235 3 0.0083606 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 3
4.3075100 3 0.2301161 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 4
1.5435937 3 0.6722461 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 5
1.4203264 3 0.7007771 Kruskal-Wallis rank sum test NA tip by shuffle(day) 1 6

Plot it

Let’s add the empirical value as a vertical line.

gf_histogram( ~ Kruskal.Wallis.chi.squared, data = null1) %>% 
  gf_vline(xintercept = ~ k_test$statistic)

Let’s color the top-5 percent red, to highlight the critical value.

null1 <- null1 %>% 
  mutate(is_top5percent = percent_rank(Kruskal.Wallis.chi.squared)  > .95)
gf_histogram( ~ Kruskal.Wallis.chi.squared, 
              data = null1,
              fill = ~ is_top5percent) %>% 
  gf_vline(xintercept = ~ k_test$statistic)

Nice, but …

As outlined above, there may be no need to bypass the test of the mean. Using SBI one does not have to assume normal distribution of the variable tested if one would like to compute p-values.

Linear model for comparing means

lm1 <- lm(tip ~ day, data = tips)
lm1_sum <- summary(lm1)

lm1_sum
#> 
#> Call:
#> lm(formula = tip ~ day, data = tips)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.2451 -0.9931 -0.2347  0.5382  7.0069 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.73474    0.31612   8.651 7.46e-16 ***
#> daySat       0.25837    0.34893   0.740    0.460    
#> daySun       0.52039    0.35343   1.472    0.142    
#> dayThur      0.03671    0.36132   0.102    0.919    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.378 on 240 degrees of freedom
#> Multiple R-squared:  0.02048,    Adjusted R-squared:  0.008232 
#> F-statistic: 1.672 on 3 and 240 DF,  p-value: 0.1736

As can be seen, the \(F\) value is not significant thereby questioning the difference in means hypothesis. Put differently: The models sees no strong evidence for rejecting the null.

The F value, along with its 3 df values, can be plucked out like this:

emp_F_value <- pluck(lm1_sum, "fstatistic")
emp_F_value
#>      value      numdf      dendf 
#>   1.672355   3.000000 240.000000

Note that this is a vector of three 3 elements.

Linear model using SBI

Now the trick.

boot1 <- do(1000) * lm(tip ~ day, data = resample(tips))

confint(boot1)
#>        name        lower     upper level     method   estimate
#> 1 Intercept  2.270318182 3.2140346  0.95 percentile 2.73473684
#> 2    daySat -0.304165972 0.8272075  0.95 percentile 0.25836661
#> 3    daySun  0.030470180 1.0744534  0.95 percentile 0.52039474
#> 4   dayThur -0.480811080 0.6252964  0.95 percentile 0.03671477
#> 5     sigma  1.157816369 1.5729610  0.95 percentile 1.37793112
#> 6 r.squared  0.005899732 0.0758314  0.95 percentile 0.02047639
#> 7         F  0.474779633 6.5642919  0.95 percentile 1.67235520

Pluck the r-squared from the lm object

Here’s a way to pluck the R squared value from the lm object:

lm(tip ~ day, data = tips) %>% 
  summary() %>% 
  pluck("r.squared")
#> [1] 0.02047639

Plot the results

gf_histogram(~ r.squared, data = boot1) %>% 
  gf_vline( xintercept = ~ lm(tip ~ day, data = tips) %>% 
              summary() %>% 
              pluck("r.squared"))

Interestingly, the R squeared distribution is not normally distributed. In sum, the R squared are really small.

Test the null using lm and SBI

null2 <- do(1000) * lm(tip ~ shuffle(day), data = tips)

Let’s check the \(F\) statistic. The critical value of the \(F\) statistic is, given 3 and 240 degrees of freedom (as shown by the lm output):

crit_f <- qf(p = 0.95, 
             df1 = 3,
             df2 = 240)
crit_f
#> [1] 2.642213

Plot it:

null2 <- null2 %>% 
  mutate(top_5percent = percent_rank(F) > 0.95)

gf_histogram(~ F, data = null2,
             fill =~top_5percent) %>% 
  gf_vline(xintercept = ~emp_F_value[1]) %>% 
  gf_label(label = ~ paste0("F: ",
                            emp_F_value[1] %>% round(2)),  
           x = ~ emp_F_value[1], 
           y = 0,
           show.legend = FALSE) 

As can be seen, the F value is not significant.

Some summary statistics to this distribution:

favstats(~ F, data = null2)
#>          min        Q1    median       Q3      max      mean        sd    n
#>  0.006571185 0.3887514 0.7581193 1.293386 5.553417 0.9314424 0.7433274 1000
#>  missing
#>        0

Caution

Note that the ASA advises that the p-value should not be the (sole) basis of decision in any data analysis.

Bayesian statistics is a neat alternative.

ANOVA

As can be seen, the Kruskal Wallis and the F tests yield different results. This is probably due to the fact that we tested median differences (Kruskall) and later on mean differences (F test).

As a check, here’s a standard ANOVA:

aov1 <- aov(tip ~ day, data = tips)
summary(aov1)
#>              Df Sum Sq Mean Sq F value Pr(>F)
#> day           3    9.5   3.175   1.672  0.174
#> Residuals   240  455.7   1.899

Replicating the F value decision from the SBI analysis above.

Plotting medians vs means

As a further check, let’s plot median and mean differences for the sake of comparison.

gf_point(tip ~ day, 
         data = tips,
         stat = "summary", 
         fun = mean,
         shape = 3,
         size = 5,
         color = "blue") +
  stat_summary(geom = "point", 
               fun = median,
               color = "red",
               size = 5,
               shape = 5) +
  labs(caption = "Crosses: Mean, diamonds: Median")

As ca be seen, the median values are more spread out, hence it appears sensible that the test would get significant whereas the more clustered mean values do not show a significant difference in our test.