Grading a prediction contest

1 Motivation

Being a teacher (in some part of my life), I conducted a prediction contest. Students had to predict a bunch of values as precisely as possible. That’s the sort of stuff data scientist do (or are said to do). As far as I am concerned, I was looking at a convenient way of grading the prediction data. Here’s an attempt.

I’m sorry that post is not fully reproducible. The reason is a privacy rights of my students and that I do not want to fully undisclosure the data set I used, because I might use it for upcoming student cohorts. Teachers: Feel free to contact me if you like to know more about the data set.

2 Setup

library(tidyverse)
library(testthat)
library(data.table)
library(glue)
library(here)
library(skimr)

Use ?data.table etc. for help on the packages.

3 Helper functions

3.1 Function to parse data

One trouble is that the CSV files I expected can be in different formats: standard CSV, CSV2 (semicolon delimiter, comma decimal). That’s why we need to come up with a more general data parsing function. Unfortunately, and to my suprise, I found no existing funtion that was able to deal with that matter (no, even fread did not).

try_readcsv <- function(file, verbose = FALSE) {
  
  # import csv file:
  x <- data.table::fread(file, header = TRUE)
  
  # if more than 2 columns, only select first and last one:
  if (ncol(x) > 2) {
    x <- x %>% 
      select(1, last_col())
  }
  
  # stop if only one column:
  if (ncol(x) == 1) stop("Only 1 column! There should be (at least) two: ID and predictions.")
  
  # set names:
  names(x) <- c("id", "pred")
  
  # replace commas with dots to deal with German locale:
  x <-
    x %>% 
    mutate(across(where(is.character), 
                  .fns = ~ str_replace_all(., 
                                           pattern = ",",
                                           replacement = ".")))
  
  # check how many columns where found in each CSV file:
  if (verbose == TRUE) {print("Ncol: "); print(ncol(x))}
  return(x)
  
} 

3.2 Function to compute \(R^2\)

r2 <- function(predicted, observed) {
  
  rss <- sum((predicted - observed) ^ 2)  ## residual sum of squares
  tss <- sum((observed - mean(observed)) ^ 2)  ## total sum of squares
  rsq <- 1 - rss/tss
  
  return(rsq)
  
}

3.3 Function to compute \(MSE\)

Mean Squared Error

mse <- function(predicted, observed) {
  
  mse <- mean((predicted - observed) ^ 2)  ## mean residual sum of squares
 
  return(mse)
  
}

3.4 Function to compute generalized error function

gen_error <- function(predicted, observed, degree = 1) {
  
  generr <- mean(abs(predicted - observed) ^ degree)  ## mean residual sum of absolute errors to the `degree` power

  return(generr)
  
}

4 Import solution (true) data (ie., solution)

Define solution file name, and check whether this file name exists:

solution_filename <- paste0("/Users/sebastiansaueruser/Google Drive/Lehre/Lehre_AKTUELL/2020-WiSe/WisMeth/Prognose-Wettbewerb/Prognose-Material/Material-nur-fuer-Lehrende/test_df_teacher.csv")

stopifnot(file.exists(solution_filename))

Import the solution data:

test_df_teacher <- read_csv(solution_filename)

test_df_teacher <-
  test_df_teacher %>% 
  mutate(id = row_number()) %>% 
  select(id, pay)

5 Parse the data

Get the list of existing files.

Here’s the project path; in your case it will be different.

proj_path <- "/Users/sebastiansaueruser/Google Drive/Lehre/Lehre_AKTUELL/2020-WiSe/WisMeth/Prognose-Wettbewerb/Prognose-Material"
subm_path <- paste0(proj_path, "/Submissions/")

submissions <- list.files(path = subm_path,
                          pattern = "\\.csv$",
                          recursive = TRUE,
                          full.names = TRUE)

Here is the list of CSV files:

submissions %>% 
  head()
#> [1] "/Users/sebastiansaueruser/Google Drive/Lehre/Lehre_AKTUELL/2020-WiSe/WisMeth/Prognose-Wettbewerb/Prognose-Material/Submissions//64d4faba5afe920efbf0001812299853.csv"
#> [2] "/Users/sebastiansaueruser/Google Drive/Lehre/Lehre_AKTUELL/2020-WiSe/WisMeth/Prognose-Wettbewerb/Prognose-Material/Submissions//9c31e89af1b4b1bc67c639708f5b5fa3.csv"

No worries. That are anonymized data files.

The length of this vector should match the number of students (or student teams) we expect:

length(submissions)
#> [1] 2

6 Build master data frame

We build a list data frame for a tidyverse style data manipulation.

6.1 List df where each submission is one row

subm_df <-
  # create column file submission names:
  tibble(filepath = submissions) %>% 
  
  # cut the path away, leave only the file name:
  mutate(filename = str_extract(filepath, "/[^/]*.csv$")) %>% 
  
  # create list column with submission data:
  mutate(subm_data = purrr::map(.x = filepath,
                                .f = ~ try_readcsv(.x))) %>% 
  
  # unnest the columns of the list column:
  unnest_wider(subm_data)

See:

subm_df
#> # A tibble: 2 x 4
#>   filepath                                filename              id       pred   
#>   <chr>                                   <chr>                 <list>   <list> 
#> 1 /Users/sebastiansaueruser/Google Drive… /64d4faba5afe920efbf… <int [3… <chr […
#> 2 /Users/sebastiansaueruser/Google Drive… /9c31e89af1b4b1bc67c… <int [3… <chr […

Check if all values of pred are of type character:

map_chr(subm_df$pred, class)
#> [1] "character" "character"

Which is the case. This is an artifact of data import; some CSV files had a German decimal delimiter (dot) whereas others used the standard (comma). So, a robust strategy is to parse those strange numbers all as characters. Then replace the delimiting commas with dots. Then transform to numeric. See the import function above for details.

6.2 Change character to numeric

We need to convert pred to numeric:

subm_df2 <-
  subm_df %>% 
  # render character into numeric (still a list column, hence we need the `map`):
  mutate(pred_num = map(pred, as.numeric))

Check if all values still are of type character:

subm_df2$pred_num %>% 
  map_chr(class)  %>% 
  unique() 
#> [1] "numeric"

Nope. Numeric, as it should be.

More programmatically:

subm_df2$pred_num %>% 
  map_chr(class)  %>% 
  unique() %>% 
  length() %>% 
  expect_equal(1)

6.3 Add observed (true) values

subm_df3 <-
  subm_df2 %>% 
  # add y-value (to be predicted), again a list column:
  mutate(observed = list(test_df_teacher$pay)) 

See:

head(subm_df3)
#> # A tibble: 2 x 6
#>   filepath                    filename         id      pred    pred_num observed
#>   <chr>                       <chr>            <list>  <list>  <list>   <list>  
#> 1 /Users/sebastiansaueruser/… /64d4faba5afe92… <int [… <chr [… <dbl [3… <dbl [3…
#> 2 /Users/sebastiansaueruser/… /9c31e89af1b4b1… <int [… <chr [… <dbl [3… <dbl [3…

6.4 Check lengths of submissions

subm_df3$id %>% 
  map_dbl(length) %>% 
  unique() 
#> [1] 300
subm_df3$id %>% 
  map_dbl(length) %>% 
  unique() %>% 
  length() %>% 
  expect_equal(1)

Each submission should consist of 300 entries in this example.

7 Compute accuracy (\(R^2\) etc.)

subm_df4 <-
  subm_df3 %>% 
  mutate(r2 = map2(.x = pred_num, 
                     .y = observed, 
                     .f = ~ r2(.x, .y)),
         mse = map2(.x = pred_num, 
                     .y = observed, 
                     .f = ~ mse(.x, .y)),
         mae = map2(.x = pred_num, 
                     .y = observed, 
                     .f = ~ gen_error(.x, .y))) %>% 
  unnest(r2)

See:

subm_df4 %>% 
  glimpse()
#> Rows: 2
#> Columns: 9
#> $ filepath <chr> "/Users/sebastiansaueruser/Google Drive/Lehre/Lehre_AKTUELL/…
#> $ filename <chr> "/64d4faba5afe920efbf0001812299853.csv", "/9c31e89af1b4b1bc6…
#> $ id       <list> [<1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
#> $ pred     <list> [<"102622.898614099", "110032.651526331", "97047.9425798611…
#> $ pred_num <list> [<102622.90, 110032.65, 97047.94, 60826.89, 168182.88, 1562…
#> $ observed <list> [<91336, 112286, 77536, 69872, 156896, 151509, 78267, 12535…
#> $ r2       <dbl> 0.8287524, 0.8280914
#> $ mse      <list> [107043353, 107456477]
#> $ mae      <list> [8392.041, 8453.271]

7.1 Check distribution of \(R^2\)

subm_df4 %>%  
  filter(between(r2,0, 1)) %>% 
  summarise(r2_mean = mean(r2),
            r2_sd = sd(r2),
            r2_med = median(r2),
            r2_iqr = IQR(r2),
            r2_min = min(r2),
            r2_max = max(r2))
#> # A tibble: 1 x 6
#>   r2_mean    r2_sd r2_med   r2_iqr r2_min r2_max
#>     <dbl>    <dbl>  <dbl>    <dbl>  <dbl>  <dbl>
#> 1   0.828 0.000467  0.828 0.000330  0.828  0.829
subm_df4 %>% 
  filter(between(r2, 0, 1)) %>% 
  ggplot(aes(x = r2)) +
  geom_histogram()

7.2 Number of distinct values for \(R^2\)

length(unique(subm_df4$r2))
#> [1] 2

8 Grading

8.1 Note-4.0 model

note4_r2 <- 0.01

8.2 Note-1.0 model

note1_r2 <- 0.83

8.3 Grades in sequence

There are 10 grades (from 4.0 to 1.0), plus the 5 (fail), plus the “supra-best” (only to define the maximum threshold), giving 12 grades in total.

grades_df <-
  tibble(thresholds = c(0, 
                        seq(from = 0.01, to = 0.83, 
                            length.out = 9), .85, 1),
         thresholds2 = c(0, 
                         seq(from = .51, to = 1, 
                             length.out = 11)),
         grades = c(5, 4, 3.7, 3.3, 3.0, 2.7, 2.3, 2, 1.7, 1.3, 1, .7)) %>% 
  mutate(id = nrow(.):1)
grades_df
#> # A tibble: 12 x 4
#>    thresholds thresholds2 grades    id
#>         <dbl>       <dbl>  <dbl> <int>
#>  1      0           0        5      12
#>  2      0.01        0.51     4      11
#>  3      0.112       0.559    3.7    10
#>  4      0.215       0.608    3.3     9
#>  5      0.318       0.657    3       8
#>  6      0.42        0.706    2.7     7
#>  7      0.522       0.755    2.3     6
#>  8      0.625       0.804    2       5
#>  9      0.727       0.853    1.7     4
#> 10      0.83        0.902    1.3     3
#> 11      0.85        0.951    1       2
#> 12      1           1        0.7     1

8.4 Map grades to individual \(R^2\) values of the students

subm_df5 <-
  subm_df4 %>% 
  mutate(grade_id = map_dbl(r2,
                         .f = ~ {`>`(grades_df$thresholds, .x) %>% sum()} )) %>% 
  left_join(grades_df %>% select(-c(thresholds, thresholds2)), 
            by = c("grade_id" = "id"))

8.5 Grade distribution

subm_df5 %>% 
  select(grades) %>% 
  skim()
Table 8.1: Data summary
Name Piped data
Number of rows 2
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
grades 0 1 1.3 0 1.3 1.3 1.3 1.3 1.3 ▁▁▇▁▁

9 Reproducibility

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS  10.16                
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2021-01-20                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                             
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.0.0)                     
#>  backports     1.2.0       2020-11-02 [1] CRAN (R 4.0.2)                     
#>  blogdown      0.21        2020-10-11 [1] CRAN (R 4.0.2)                     
#>  bookdown      0.21        2020-10-13 [1] CRAN (R 4.0.2)                     
#>  broom         0.7.2       2020-10-20 [1] CRAN (R 4.0.2)                     
#>  callr         3.5.1       2020-10-13 [1] CRAN (R 4.0.2)                     
#>  cellranger    1.1.0       2016-07-27 [1] CRAN (R 4.0.0)                     
#>  cli           2.2.0       2020-11-20 [1] CRAN (R 4.0.2)                     
#>  codetools     0.2-16      2018-12-24 [2] CRAN (R 4.0.2)                     
#>  colorspace    2.0-0       2020-11-11 [1] CRAN (R 4.0.2)                     
#>  crayon        1.3.4       2017-09-16 [1] CRAN (R 4.0.0)                     
#>  DBI           1.1.0       2019-12-15 [1] CRAN (R 4.0.0)                     
#>  dbplyr        2.0.0       2020-11-03 [1] CRAN (R 4.0.2)                     
#>  desc          1.2.0       2018-05-01 [1] CRAN (R 4.0.0)                     
#>  devtools      2.3.2       2020-09-18 [1] CRAN (R 4.0.2)                     
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.2)                     
#>  dplyr       * 1.0.2       2020-08-18 [1] CRAN (R 4.0.2)                     
#>  ellipsis      0.3.1       2020-05-15 [1] CRAN (R 4.0.0)                     
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.0)                     
#>  fansi         0.4.1       2020-01-08 [1] CRAN (R 4.0.0)                     
#>  forcats     * 0.5.0       2020-03-01 [1] CRAN (R 4.0.0)                     
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.2)                     
#>  generics      0.1.0       2020-10-31 [1] CRAN (R 4.0.2)                     
#>  ggplot2     * 3.3.2       2020-06-19 [1] CRAN (R 4.0.0)                     
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  gtable        0.3.0       2019-03-25 [1] CRAN (R 4.0.0)                     
#>  haven         2.3.1       2020-06-01 [1] CRAN (R 4.0.0)                     
#>  hms           0.5.3       2020-01-08 [1] CRAN (R 4.0.0)                     
#>  htmltools     0.5.0       2020-06-16 [1] CRAN (R 4.0.0)                     
#>  httr          1.4.2       2020-07-20 [1] CRAN (R 4.0.2)                     
#>  jsonlite      1.7.1       2020-09-07 [1] CRAN (R 4.0.2)                     
#>  knitr         1.30        2020-09-22 [1] CRAN (R 4.0.2)                     
#>  lifecycle     0.2.0       2020-03-06 [1] CRAN (R 4.0.0)                     
#>  lubridate     1.7.9.2     2020-11-13 [1] CRAN (R 4.0.2)                     
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.2)                     
#>  memoise       1.1.0       2017-04-21 [1] CRAN (R 4.0.0)                     
#>  modelr        0.1.8       2020-05-19 [1] CRAN (R 4.0.0)                     
#>  munsell       0.5.0       2018-06-12 [1] CRAN (R 4.0.0)                     
#>  pillar        1.4.7       2020-11-20 [1] CRAN (R 4.0.2)                     
#>  pkgbuild      1.1.0       2020-07-13 [1] CRAN (R 4.0.2)                     
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.0)                     
#>  pkgload       1.1.0       2020-05-29 [1] CRAN (R 4.0.0)                     
#>  prettyunits   1.1.1       2020-01-24 [1] CRAN (R 4.0.0)                     
#>  processx      3.4.5       2020-11-30 [1] CRAN (R 4.0.2)                     
#>  ps            1.4.0       2020-10-07 [1] CRAN (R 4.0.2)                     
#>  purrr       * 0.3.4       2020-04-17 [1] CRAN (R 4.0.0)                     
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.2)                     
#>  Rcpp          1.0.5       2020-07-06 [1] CRAN (R 4.0.2)                     
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.2)                     
#>  readxl        1.3.1       2019-03-13 [1] CRAN (R 4.0.0)                     
#>  remotes       2.2.0       2020-07-21 [1] CRAN (R 4.0.2)                     
#>  reprex        0.3.0       2019-05-16 [1] CRAN (R 4.0.0)                     
#>  rlang         0.4.9       2020-11-26 [1] CRAN (R 4.0.2)                     
#>  rmarkdown     2.5         2020-10-21 [1] CRAN (R 4.0.2)                     
#>  rprojroot     2.0.2       2020-11-15 [1] CRAN (R 4.0.2)                     
#>  rstudioapi    0.13.0-9000 2020-12-09 [1] Github (rstudio/rstudioapi@4baeb39)
#>  rvest         0.3.6       2020-07-25 [1] CRAN (R 4.0.2)                     
#>  scales        1.1.1       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.0)                     
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.2)                     
#>  stringr     * 1.4.0       2019-02-10 [1] CRAN (R 4.0.0)                     
#>  testthat      3.0.0       2020-10-31 [1] CRAN (R 4.0.2)                     
#>  tibble      * 3.0.4       2020-10-12 [1] CRAN (R 4.0.2)                     
#>  tidyr       * 1.1.2       2020-08-27 [1] CRAN (R 4.0.2)                     
#>  tidyselect    1.1.0       2020-05-11 [1] CRAN (R 4.0.0)                     
#>  tidyverse   * 1.3.0       2019-11-21 [1] CRAN (R 4.0.0)                     
#>  usethis       1.6.3       2020-09-17 [1] CRAN (R 4.0.2)                     
#>  vctrs         0.3.5       2020-11-17 [1] CRAN (R 4.0.2)                     
#>  withr         2.3.0       2020-09-22 [1] CRAN (R 4.0.2)                     
#>  xfun          0.19        2020-10-30 [1] CRAN (R 4.0.2)                     
#>  xml2          1.3.2       2020-04-23 [1] CRAN (R 4.0.0)                     
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.0)                     
#> 
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library