Load packages
library(tidyverse)
Background
Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients.
However, a different way, and the way presented in this post, consists of checking of all raters agree on one given item (and repeating that for all items). If rater A assigns two tags/criteria (tag1, tag2) to item A, then the other raters may not assign different tags (eg tag3, tag4) to that item, if a match should be scored. Note that this proceeding allows for different numbers of tags/criteria for the items (eg., item 1 with only 1 tag, but item 2 with 3 tags etc.). However, our grading should give some points, if, say, rater1 assigns tag1 and tag2, but raters 2 and 3 only assign tag1.
Sample data
Say, we would like to rate some tweets of policitians as to they express some form of populism, according to some widely accepted definition. Further assume, a number of raters (eg., 3) has hand-rated a number of tweets (eg., 250). Here are the results.
Get the data from here.
d <- read_csv("https://data-se.netlify.com/download/d_rating.csv")
d
#> # A tibble: 459 x 4
#> text tweet selection rater_name
#> <chr> <dbl> <dbl> <chr>
#> 1 Anbei 1 4 rater1
#> 2 WIE B 2 32 rater1
#> 3 WIE B 2 5 rater1
#> 4 ++ Nu 3 4 rater1
#> 5 Wahlk 4 4 rater1
#> 6 In fü 5 4 rater1
#> 7 Ansch 6 4 rater1
#> 8 Unser 7 32 rater1
#> 9 Unser 7 5 rater1
#> 10 Was w 8 32 rater1
#> # … with 449 more rows
In this example, there were 5 criteria (or “tags” for that matter) that might be applied to a given tweet. The raters were free to assign several tags to a tweet. For example, rater1
assigned tag 32
and tag 5
to tweet 7 (32
means “criteria 3, subcriteria 2”), and so on. Hence, the column selection
denotes the tags assigend to a tweet. Note that each row indicates one assignment (not one tweet).
Get into wide format
Now we spread the data so that each rater has his/her column:
d_spread <- d_l %>%
spread(key = rater_name, value = data)
d_spread
#> # A tibble: 200 x 4
#> tweet rater1 rater2 rater3
#> <dbl> <list> <list> <list>
#> 1 1 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 2 2 <tibble [2 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 3 3 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 4 4 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 5 5 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 6 6 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 7 7 <tibble [2 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 8 8 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 9 9 <tibble [1 × 1]> <tibble [1 × 1]> <tibble [1 × 1]>
#> 10 10 <tibble [2 × 1]> <tibble [2 × 1]> <tibble [2 × 1]>
#> # … with 190 more rows
Now we have the data in a form where we can (quite) easily check something like “is the selection in line 1 the same for all rater-columns?”, and so on for all lines. R preferes comparing rows instead of comparing lines, at least when it comes to comparing different variables. So, the data are now in a form that suits further comparison (ie., checking for equality).
Tidying 1: Vectors instead of tibbles
Let’s simplify the tibbles to vectors, for easier access:
d_tidy1 <- d_spread %>%
transmute_at(c("rater1", "rater2", "rater3"), list(~map(., "selection")))
Note that the column names must be quoted here.
So, what did we do? We said to R that the columns “at the names” rater1, rater2, rater3 should be transformed (“transmuted”). And how should they be tranformed? By pulling the column “selection” out of each selected column (ie., rater1, rater2, rater3). Why do we need map
? Because we have many data frames to which we want to apply a function (in each row of d_tidy1
there is one data frame in each of the selected columns!). map
repeats (or maps) a function to each dataframe. The “pulling” of the vector is performed as standard operation of map
if we do not further specify what map
should do.
d_tidy1
#> # A tibble: 200 x 3
#> rater1 rater2 rater3
#> <list> <list> <list>
#> 1 <dbl [1]> <dbl [1]> <dbl [1]>
#> 2 <dbl [2]> <dbl [1]> <dbl [1]>
#> 3 <dbl [1]> <dbl [1]> <dbl [1]>
#> 4 <dbl [1]> <dbl [1]> <dbl [1]>
#> 5 <dbl [1]> <dbl [1]> <dbl [1]>
#> 6 <dbl [1]> <dbl [1]> <dbl [1]>
#> 7 <dbl [2]> <dbl [1]> <dbl [1]>
#> 8 <dbl [1]> <dbl [1]> <dbl [1]>
#> 9 <dbl [1]> <dbl [1]> <dbl [1]>
#> 10 <dbl [2]> <dbl [2]> <dbl [2]>
#> # … with 190 more rows
Some checks:
d_tidy1$rater1[7]
#> [[1]]
#> [1] 32 5
d_tidy1$rater2[1]
#> [[1]]
#> [1] 4
d_tidy1$rater3[2]
#> [[1]]
#> [1] 23
Tidying 2: Replace NULL values by NA
If map
does not find anything to pull out, it returns NULL
, as is the case for rater2, see for example:
d_tidy1 %>%
slice(100:110)
#> # A tibble: 11 x 3
#> rater1 rater2 rater3
#> <list> <list> <list>
#> 1 <dbl [1]> <NULL> <dbl [1]>
#> 2 <dbl [1]> <NULL> <dbl [1]>
#> 3 <dbl [2]> <NULL> <dbl [1]>
#> 4 <dbl [1]> <NULL> <dbl [1]>
#> 5 <dbl [1]> <NULL> <dbl [1]>
#> 6 <dbl [1]> <NULL> <dbl [1]>
#> 7 <dbl [1]> <NULL> <dbl [1]>
#> 8 <dbl [1]> <NULL> <dbl [1]>
#> 9 <dbl [1]> <NULL> <dbl [1]>
#> 10 <dbl [1]> <NULL> <dbl [1]>
#> 11 <dbl [1]> <NULL> <dbl [1]>
That’s why we might want to replace ==
cannot handle NULL, it seems (not quite sure why).
d_tidy2 <-
d_tidy1 %>%
mutate(rater2 = map_if(rater2, is_empty, ~ NA))
d_tidy2 %>%
slice(100:110)
#> # A tibble: 11 x 3
#> rater1 rater2 rater3
#> <list> <list> <list>
#> 1 <dbl [1]> <lgl [1]> <dbl [1]>
#> 2 <dbl [1]> <lgl [1]> <dbl [1]>
#> 3 <dbl [2]> <lgl [1]> <dbl [1]>
#> 4 <dbl [1]> <lgl [1]> <dbl [1]>
#> 5 <dbl [1]> <lgl [1]> <dbl [1]>
#> 6 <dbl [1]> <lgl [1]> <dbl [1]>
#> 7 <dbl [1]> <lgl [1]> <dbl [1]>
#> 8 <dbl [1]> <lgl [1]> <dbl [1]>
#> 9 <dbl [1]> <lgl [1]> <dbl [1]>
#> 10 <dbl [1]> <lgl [1]> <dbl [1]>
#> 11 <dbl [1]> <lgl [1]> <dbl [1]>
Check:
d_tidy2$rater2[110]
#> [[1]]
#> [1] NA
OK.
Check for equality for one tweet
Now we are in the position to check if the values of one row are equal. If they are equal it means that the raters agree on their vote. This may well mean that they all asssigned multiple tags to this tweet under consideration. If so, it means that all of them assigned multiple tags. In each case, they must assign the same number of tags for (perfect) equality to hold (but we would want to give credit if say, they all assigned tag1
but disagreed about other tags).
Also notice that not only a subset of the raters must have equal ratings, but all pairs of raters must have the same tag/value for equality to hold. In practice, we just lookup (for a given tweet) what the first rater assigned. Then we check if the next rater has the same assignment (and so on for rater3 etc.).
Say, rater1
assigend tag1
, and the other raters assigned tag2
. Hence, we have a mismatch. We could check for that mismatch like this:
rater1 <- ("tag1")
assignments <- c("tag1", "tag2", "tag2")
comps <- map2(rater1, assignments, `==`)
comps
#> [[1]]
#> [1] TRUE
#>
#> [[2]]
#> [1] FALSE
#>
#> [[3]]
#> [1] FALSE
map2
recyclyes the value(s) of rater1
so that it has the same length as the second parameter, assignments
, which decodes the ratings of all three raters. Now map2
takes one element from rater1
and one from assignments
and checks for equality (==
), and repeats that for each element.
Finally, we can reduce that to one value:
comps %>%
reduce(`&`)
#> [1] FALSE
We reduce the vector by the logical “AND”. Hence, we get a mismatch for this tweet (FALSE
).
For example, let’s apply this idea to item 7:
i <- 7
map2(list(d_tidy2$rater1[[i]],
d_tidy2$rater2[[i]],
d_tidy2$rater3[[i]]),
list(d_tidy2$rater1[[i]]), `==`)
#> [[1]]
#> [1] TRUE TRUE
#>
#> [[2]]
#> [1] FALSE FALSE
#>
#> [[3]]
#> [1] FALSE FALSE
Note that rater 1 assigned two tags so that we have to check two times.
Now we reduce these three vectors to one:
map2(list(d_tidy2$rater1[[i]],
d_tidy2$rater2[[i]],
d_tidy2$rater3[[i]]),
list(d_tidy2$rater1[[i]]), `==`) %>%
reduce(`&`)
#> [1] FALSE FALSE
The three raters did not agree, neither for the first tag nor for the second.
Let’s double check that:
d %>%
filter(tweet == 7)
#> # A tibble: 4 x 4
#> text tweet selection rater_name
#> <chr> <dbl> <dbl> <chr>
#> 1 Unser 7 32 rater1
#> 2 Unser 7 5 rater1
#> 3 Unser 7 23 rater3
#> 4 Unser 7 23 rater2
Correct, they did not agree about the first tag (“32”), because the other raters assigned “23”. And they did not agree about “5” (as it was assigned only by rater 1).
Repeat that for all tweets
There are 200 tweets, hence we repeat the above idea accordingly:
accuracy <- NA
for (i in 1:200){ # for each tweet
accuracy[[i]] <- map2(list(d_tidy2$rater1[[i]],
d_tidy2$rater2[[i]],
d_tidy2$rater3[[i]]),
list(d_tidy2$rater1[[i]]), `==`) %>%
purrr::discard(function(x) all(is.na(x))) %>%
reduce(`&`) %>% # summarise over rater for this tweet
mean(na.rm = T) # summarise over all responses for this tweet
}
accuracy
#> [1] 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.0000000
#> [8] 0.0000000 1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 1.0000000 1.0000000 0.0000000 0.0000000 0.3333333
#> [22] 0.0000000 0.0000000 1.0000000 0.5000000 0.0000000 0.5000000 0.5000000
#> [29] 1.0000000 0.0000000 0.0000000 1.0000000 0.0000000 1.0000000 1.0000000
#> [36] 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 1.0000000
#> [43] 1.0000000 0.0000000 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [50] 1.0000000 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 0.5000000
#> [57] 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 0.0000000 0.3333333
#> [64] 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
#> [71] 0.0000000 0.0000000 1.0000000 0.0000000 1.0000000 0.5000000 0.5000000
#> [78] 1.0000000 1.0000000 1.0000000 0.0000000 1.0000000 0.0000000 0.0000000
#> [85] 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [92] 1.0000000 1.0000000 1.0000000 0.0000000 1.0000000 0.0000000 0.0000000
#> [99] 1.0000000 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000
#> [106] 1.0000000 1.0000000 0.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [113] 1.0000000 1.0000000 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000
#> [120] 1.0000000 1.0000000 1.0000000 0.0000000 1.0000000 1.0000000 0.0000000
#> [127] 0.0000000 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 0.0000000
#> [134] 0.0000000 0.0000000 0.0000000 1.0000000 1.0000000 1.0000000 1.0000000
#> [141] 0.0000000 0.5000000 0.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [148] 0.5000000 1.0000000 0.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [155] 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000 0.0000000 0.0000000
#> [162] 1.0000000 1.0000000 1.0000000 0.0000000 1.0000000 1.0000000 1.0000000
#> [169] 0.0000000 0.0000000 1.0000000 1.0000000 0.0000000 0.0000000 1.0000000
#> [176] 0.0000000 1.0000000 1.0000000 1.0000000 0.0000000 0.0000000 0.0000000
#> [183] 1.0000000 1.0000000 0.0000000 1.0000000 0.5000000 1.0000000 1.0000000
#> [190] 0.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.0000000 0.5000000
#> [197] 1.0000000 0.0000000 0.0000000 1.0000000
What we got back is the accuracy per tweet.
We discard
ed NAs, because &
will handle NA
as FALSE, thereby lowering the interreater reliability.
Overall accuracy
To get the overall accuracy, just average over the accuracies per tweet:
(mean_accuracy <- mean(accuracy, na.rm = T))
#> [1] 0.5858333
Not too high. Sigh.
Debrief
The beauty of this approach is that it generalizes an existing concept - Cohen’s Kappa - in two regards. First, the restriction of \(n=2\) raters is relaxed. Second, the restriction to \(k=1\) rating criteria is relaxed. In both regards, an arbitrary number can be processed.
Granted, the approach above does not weight in expected agreement as it is considered in eg Cohens Kappa. It simply computes the number (or the proportion rather) of hits (cases in agreement). And, despite its generalization “beauty”, in some regards the approach may be cumbersome. To be honest, it took me a while to find the R idioms to get this going.