Modellannahmen grafisch überprüfen

1 Hintergrund

Diese Übung bezieht sich auf ISRS, Kap. 6.3.

2 Pakete

library(tidyverse)  # data wrangling
#library(broom)  # tidy Regressionsoutput
library(skimr)  # EDA
library(moderndive)  # Komfort

3 Daten laden

Auf dieser Seite sind die Daten zu finden.

d <- read_csv("https://www.openintro.org/data/csv/mariokart.csv")

(“d” wie Daten.)

Wir werfen einen Blick in die Daten:

glimpse(d)
#> Rows: 143
#> Columns: 12
#> $ id          <dbl> 150377422259, 260483376854, 320432342985, 280405224677, 1…
#> $ duration    <dbl> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1, …
#> $ n_bids      <dbl> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, 6,…
#> $ cond        <chr> "new", "used", "new", "new", "new", "new", "used", "new",…
#> $ start_pr    <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 19.…
#> $ ship_pr     <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4.0…
#> $ total_pr    <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, 4…
#> $ ship_sp     <chr> "standard", "firstClass", "firstClass", "standard", "medi…
#> $ seller_rate <dbl> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 4858…
#> $ stock_photo <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "y…
#> $ wheels      <dbl> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2, …
#> $ title       <chr> "~~ Wii MARIO KART &amp; WHEEL ~ NINTENDO Wii ~ BRAND NEW…

Oder lieber so:

skim(d)
Table 3.1: Data summary
Name d
Number of rows 143
Number of columns 12
_______________________
Column type frequency:
character 4
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cond 0 1.00 3 4 0 2 0
ship_sp 0 1.00 5 10 0 8 0
stock_photo 0 1.00 2 3 0 2 0
title 1 0.99 13 59 0 80 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 2.235290e+11 8.809543e+10 1.104392e+11 1.403506e+11 2.204911e+11 2.953551e+11 4.000775e+11 ▇▃▅▅▃
duration 0 1 3.770000e+00 2.590000e+00 1.000000e+00 1.000000e+00 3.000000e+00 7.000000e+00 1.000000e+01 ▇▅▂▆▁
n_bids 0 1 1.354000e+01 5.880000e+00 1.000000e+00 1.000000e+01 1.400000e+01 1.700000e+01 2.900000e+01 ▂▅▇▃▁
start_pr 0 1 8.780000e+00 1.507000e+01 1.000000e-02 9.900000e-01 1.000000e+00 1.000000e+01 6.995000e+01 ▇▁▁▁▁
ship_pr 0 1 3.140000e+00 3.210000e+00 0.000000e+00 0.000000e+00 3.000000e+00 4.000000e+00 2.551000e+01 ▇▁▁▁▁
total_pr 0 1 4.988000e+01 2.569000e+01 2.898000e+01 4.117000e+01 4.650000e+01 5.399000e+01 3.265100e+02 ▇▁▁▁▁
seller_rate 0 1 1.589842e+04 5.184032e+04 0.000000e+00 1.090000e+02 8.200000e+02 4.858000e+03 2.701440e+05 ▇▁▁▁▁
wheels 0 1 1.150000e+00 8.500000e-01 0.000000e+00 0.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 ▆▇▇▁▁

4 Fehlende Werte

Fehlende Werte können Probleme bereiten. Entfernen wir einfach alle fehlenden Werte, es sind ja nicht so viele.

d_nona <- d %>%   # nona wie "no NA", keine fehlenden Werte
  drop_na()

5 Modell 1

Betrachten wir dieses Modell:

lm1 <- lm(total_pr ~ wheels, data = d_nona)
get_regression_summaries(lm1)
#> # A tibble: 1 x 9
#>   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df  nobs
#>       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl> <dbl>
#> 1      0.11         0.103  587.  24.2  24.4      17.3       0     1   142
get_regression_table(lm1)
#> # A tibble: 2 x 7
#>   term      estimate std_error statistic p_value lower_ci upper_ci
#>   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
#> 1 intercept     38.4      3.44     11.1        0    31.6      45.2
#> 2 wheels        10.1      2.43      4.15       0     5.28     14.9

6 Überprüfen der Annahmen

Die vorhergesagten Werte und die Residuen kann man sich so ausgeben lassen:

get_regression_points(lm1)
#> # A tibble: 142 x 5
#>       ID total_pr wheels total_pr_hat residual
#>    <int>    <dbl>  <dbl>        <dbl>    <dbl>
#>  1     1     51.6      1         48.5     3.09
#>  2     2     37.0      1         48.5   -11.4 
#>  3     3     45.5      1         48.5    -2.96
#>  4     4     44        1         48.5    -4.46
#>  5     5     71        2         58.5    12.5 
#>  6     6     45        0         38.4     6.62
#>  7     7     37.0      0         38.4    -1.36
#>  8     8     54.0      2         58.5    -4.56
#>  9     9     47        1         48.5    -1.46
#> 10    10     50        1         48.5     1.54
#> # … with 132 more rows

6.1 Linearität

get_regression_points(lm1) %>% 
  ggplot(aes(x = wheels, y = residual)) +
  geom_point()

Hier böte es sich an, zunächst auf Ausreißer hin zu kontrollieren.

6.2 Varianzgleichheit der Residuen

get_regression_points(lm1) %>% 
  ggplot(aes(x = total_pr_hat, y = abs(residual))) +
  geom_point()

6.3 Normalverteilung der Residuen

get_regression_points(lm1) %>% 
  ggplot(aes(x = residual)) +
  geom_density()

7 Reproducibility

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2020-12-10                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                             
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.0.0)                     
#>  backports     1.2.0       2020-11-02 [1] CRAN (R 4.0.2)                     
#>  blogdown      0.21        2020-10-11 [1] CRAN (R 4.0.2)                     
#>  bookdown      0.21        2020-10-13 [1] CRAN (R 4.0.2)                     
#>  broom         0.7.2       2020-10-20 [1] CRAN (R 4.0.2)                     
#>  callr         3.5.1       2020-10-13 [1] CRAN (R 4.0.2)                     
#>  cellranger    1.1.0       2016-07-27 [1] CRAN (R 4.0.0)                     
#>  cli           2.2.0       2020-11-20 [1] CRAN (R 4.0.2)                     
#>  codetools     0.2-16      2018-12-24 [2] CRAN (R 4.0.2)                     
#>  colorspace    2.0-0       2020-11-11 [1] CRAN (R 4.0.2)                     
#>  crayon        1.3.4       2017-09-16 [1] CRAN (R 4.0.0)                     
#>  DBI           1.1.0       2019-12-15 [1] CRAN (R 4.0.0)                     
#>  dbplyr        2.0.0       2020-11-03 [1] CRAN (R 4.0.2)                     
#>  desc          1.2.0       2018-05-01 [1] CRAN (R 4.0.0)                     
#>  devtools      2.3.2       2020-09-18 [1] CRAN (R 4.0.2)                     
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.2)                     
#>  dplyr       * 1.0.2       2020-08-18 [1] CRAN (R 4.0.2)                     
#>  ellipsis      0.3.1       2020-05-15 [1] CRAN (R 4.0.0)                     
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.0)                     
#>  fansi         0.4.1       2020-01-08 [1] CRAN (R 4.0.0)                     
#>  forcats     * 0.5.0       2020-03-01 [1] CRAN (R 4.0.0)                     
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.2)                     
#>  generics      0.1.0       2020-10-31 [1] CRAN (R 4.0.2)                     
#>  ggplot2     * 3.3.2       2020-06-19 [1] CRAN (R 4.0.0)                     
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  gtable        0.3.0       2019-03-25 [1] CRAN (R 4.0.0)                     
#>  haven         2.3.1       2020-06-01 [1] CRAN (R 4.0.0)                     
#>  hms           0.5.3       2020-01-08 [1] CRAN (R 4.0.0)                     
#>  htmltools     0.5.0       2020-06-16 [1] CRAN (R 4.0.0)                     
#>  httr          1.4.2       2020-07-20 [1] CRAN (R 4.0.2)                     
#>  jsonlite      1.7.1       2020-09-07 [1] CRAN (R 4.0.2)                     
#>  knitr         1.30        2020-09-22 [1] CRAN (R 4.0.2)                     
#>  lifecycle     0.2.0       2020-03-06 [1] CRAN (R 4.0.0)                     
#>  lubridate     1.7.9.2     2020-11-13 [1] CRAN (R 4.0.2)                     
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.2)                     
#>  memoise       1.1.0       2017-04-21 [1] CRAN (R 4.0.0)                     
#>  modelr        0.1.8       2020-05-19 [1] CRAN (R 4.0.0)                     
#>  munsell       0.5.0       2018-06-12 [1] CRAN (R 4.0.0)                     
#>  pillar        1.4.7       2020-11-20 [1] CRAN (R 4.0.2)                     
#>  pkgbuild      1.1.0       2020-07-13 [1] CRAN (R 4.0.2)                     
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.0)                     
#>  pkgload       1.1.0       2020-05-29 [1] CRAN (R 4.0.0)                     
#>  prettyunits   1.1.1       2020-01-24 [1] CRAN (R 4.0.0)                     
#>  processx      3.4.5       2020-11-30 [1] CRAN (R 4.0.2)                     
#>  ps            1.4.0       2020-10-07 [1] CRAN (R 4.0.2)                     
#>  purrr       * 0.3.4       2020-04-17 [1] CRAN (R 4.0.0)                     
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.2)                     
#>  Rcpp          1.0.5       2020-07-06 [1] CRAN (R 4.0.2)                     
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.2)                     
#>  readxl        1.3.1       2019-03-13 [1] CRAN (R 4.0.0)                     
#>  remotes       2.2.0       2020-07-21 [1] CRAN (R 4.0.2)                     
#>  reprex        0.3.0       2019-05-16 [1] CRAN (R 4.0.0)                     
#>  rlang         0.4.9       2020-11-26 [1] CRAN (R 4.0.2)                     
#>  rmarkdown     2.5         2020-10-21 [1] CRAN (R 4.0.2)                     
#>  rprojroot     2.0.2       2020-11-15 [1] CRAN (R 4.0.2)                     
#>  rstudioapi    0.13.0-9000 2020-12-09 [1] Github (rstudio/rstudioapi@4baeb39)
#>  rvest         0.3.6       2020-07-25 [1] CRAN (R 4.0.2)                     
#>  scales        1.1.1       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.0)                     
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.2)                     
#>  stringr     * 1.4.0       2019-02-10 [1] CRAN (R 4.0.0)                     
#>  testthat      3.0.0       2020-10-31 [1] CRAN (R 4.0.2)                     
#>  tibble      * 3.0.4       2020-10-12 [1] CRAN (R 4.0.2)                     
#>  tidyr       * 1.1.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  tidyselect    1.1.0       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  tidyverse   * 1.3.0       2019-11-21 [1] CRAN (R 4.0.0)                     
#>  usethis       1.6.3       2020-09-17 [1] CRAN (R 4.0.2)                     
#>  vctrs         0.3.5       2020-11-17 [1] CRAN (R 4.0.2)                     
#>  withr         2.3.0       2020-09-22 [1] CRAN (R 4.0.2)                     
#>  xfun          0.19        2020-10-30 [1] CRAN (R 4.0.2)                     
#>  xml2          1.3.2       2020-04-23 [1] CRAN (R 4.0.0)                     
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.0)                     
#> 
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library