Load packages
library(tidyverse)
Basic testing for equality
Testing for equality in a kind of very basic function in computer (and data) science. There is a straightforward function in R to test for equality:
identical(1, 1)
#> [1] TRUE
identical("A", "A")
#> [1] TRUE
identical(1, 2)
#> [1] FALSE
identical(1, NA)
#> [1] FALSE
However this get more complicated if we want to compare more than two elements. One way to achieve this is to compute the number of the different items. If there’s only one different item, then all maybe are the same. But, luckily we can at least say, if there are different numbers of different elements, the vectors are not the same.
x <- c(1, 1, 1)
y <- c(1, 1, 99)
z <- c(1, 1, 1)
length(unique(x))
#> [1] 1
length(unique(y))
#> [1] 2
This approach generalizes to comparing two vector to see if they are identical.
length(unique(x)) == length(unique(y))
#> [1] FALSE
The vectors x
and y
are not identical.
Be aware
x2 <- c(99, 99, 99)
length(unique(x)) == length(unique(x2))
#> [1] TRUE
Ths result delivers TRUE
but the vectors are different.
identical(x, x2) # different vectors
#> [1] FALSE
identical(x, z) # identical vectors
#> [1] TRUE
Testing columnwise in a data frame
Let’s take this method to a data frame.
d <- tribble(
~ colA, ~colB, ~colC,
1, 1, 1,
1, 1, 99,
1, NA, 1
)
First, we compute the number of different values per column:
d %>% summarise_all(list( ~ length(unique(.))))
#> # A tibble: 1 x 3
#> colA colB colC
#> <int> <int> <int>
#> 1 1 2 2
Then we can check whether all are identical:
d2 <- d %>%
summarise_all(list( ~ length(unique(.))))
Often, when working with rowwise operations, it is helpful to transposte the data frame, as columnwise operations are easier. gather
is a way to transpose a data frame.
d2 %>%
gather() %>%
summarise(length(unique(value)))
#> # A tibble: 1 x 1
#> `length(unique(value))`
#> <int>
#> 1 2
The length of different values is greater than one (ie, 2), which tells us that not all values are identical.
Testing rowwise in a data frame
Let’s say we want to know whether row contains only identical elements, row 2 only identical elements, and so on. Put shortly, we test for equality rowwise in this data frame.
d3 <- d %>%
mutate(concatenated = pmap(., c)) %>%
mutate(length_unique = map_int(concatenated, ~ length(unique(.))))
d3
#> # A tibble: 3 x 5
#> colA colB colC concatenated length_unique
#> <dbl> <dbl> <dbl> <list> <int>
#> 1 1 1 1 <dbl [3]> 1
#> 2 1 1 99 <dbl [3]> 2
#> 3 1 NA 1 <dbl [3]> 2
Let’s deconstruct that to get a grip on it. The first mutate
call simply constructs rowwise vectors of all columns. That is, for the first line, 1, 1, 1
, and 1, 1, 99
for the second line, and so on.
To access the list column concatenated
, use list indexing:
d3[["concatenated"]]
#> [[1]]
#> colA colB colC
#> 1 1 1
#>
#> [[2]]
#> colA colB colC
#> 1 1 99
#>
#> [[3]]
#> colA colB colC
#> 1 NA 1
Each line is a simple (named) vector:
d3[["concatenated"]] %>% str()
#> List of 3
#> $ : Named num [1:3] 1 1 1
#> ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"
#> $ : Named num [1:3] 1 1 99
#> ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"
#> $ : Named num [1:3] 1 NA 1
#> ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"
Ignoring NAs rowwise
There is a function called discard
which discards elements of a list/vector if matching a condition such as is.na
:
d$colB %>% discard(is.na)
#> [1] 1 1
Let’s apply that on our list column concatenated
:
d4 <- d3 %>%
mutate(c_nona = map(concatenated, ~ discard(., is.na)))
d4
#> # A tibble: 3 x 6
#> colA colB colC concatenated length_unique c_nona
#> <dbl> <dbl> <dbl> <list> <int> <list>
#> 1 1 1 1 <dbl [3]> 1 <dbl [3]>
#> 2 1 1 99 <dbl [3]> 2 <dbl [3]>
#> 3 1 NA 1 <dbl [3]> 2 <dbl [2]>
In the third line of d
there was one missing value, so the length of the vector in line 3 should be shorter:
d4$c_nona[[3]]
#> colA colC
#> 1 1
d5 <- d4 %>%
mutate(lu2 = map_int(c_nona, ~ length(unique(.))))
d5
#> # A tibble: 3 x 7
#> colA colB colC concatenated length_unique c_nona lu2
#> <dbl> <dbl> <dbl> <list> <int> <list> <int>
#> 1 1 1 1 <dbl [3]> 1 <dbl [3]> 1
#> 2 1 1 99 <dbl [3]> 2 <dbl [3]> 2
#> 3 1 NA 1 <dbl [3]> 2 <dbl [2]> 1
Note that the normal way of working with mutate
in dplyr does not work with list columns, as they are no normal columns.
d4 %>%
mutate(lu2 = length(unique(c_nona)))
#> # A tibble: 3 x 7
#> colA colB colC concatenated length_unique c_nona lu2
#> <dbl> <dbl> <dbl> <list> <int> <list> <int>
#> 1 1 1 1 <dbl [3]> 1 <dbl [3]> 3
#> 2 1 1 99 <dbl [3]> 2 <dbl [3]> 3
#> 3 1 NA 1 <dbl [3]> 2 <dbl [2]> 3
List columns consists of multiple elements, that’s way we need map
and friends.
Limitations
Note that this approach only works for testing “one-equality”, ie., if all values are the same. If you want to test, for example, of a target vector [1,2] is identical to the reference vector [3,4], testing the number of different items will not work. Obviously, in both cases there are 2 different values, but they are not identical. In other words, we can test for inequality but not for equality.
In addition, using this method, only two vectors can be compared at a time.