Load packages

library(tidyverse)

Basic testing for equality

Testing for equality in a kind of very basic function in computer (and data) science. There is a straightforward function in R to test for equality:

identical(1, 1)
#> [1] TRUE
identical("A", "A")
#> [1] TRUE
identical(1, 2)
#> [1] FALSE
identical(1, NA)
#> [1] FALSE

However this get more complicated if we want to compare more than two elements. One way to achieve this is to compute the number of the different items. If there’s only one different item, then all ~~maybe~~ are the same. But, luckily we can at least say, if there are different numbers of different elements, the vectors are not the same.

x <- c(1, 1, 1)
y <- c(1, 1, 99)
z <- c(1, 1, 1)

length(unique(x))
#> [1] 1

length(unique(y))
#> [1] 2

This approach generalizes to comparing two vector to see if they are identical.

length(unique(x)) == length(unique(y))
#> [1] FALSE

The vectors x and y are not identical.

Be aware

x2 <- c(99, 99, 99)

length(unique(x)) == length(unique(x2))
#> [1] TRUE

Ths result delivers TRUE but the vectors are different.

identical(x, x2)  # different vectors
#> [1] FALSE
identical(x, z)  # identical vectors
#> [1] TRUE

Testing columnwise in a data frame

Let’s take this method to a data frame.

d <- tribble(
  ~ colA, ~colB, ~colC,
  1, 1, 1,
  1, 1, 99,
  1, NA, 1
)

First, we compute the number of different values per column:

d %>% summarise_all(list( ~ length(unique(.))))
#> # A tibble: 1 x 3
#>    colA  colB  colC
#>   <int> <int> <int>
#> 1     1     2     2

Then we can check whether all are identical:

d2 <- d %>% 
  summarise_all(list( ~ length(unique(.))))

Often, when working with rowwise operations, it is helpful to transposte the data frame, as columnwise operations are easier. gather is a way to transpose a data frame.

d2 %>% 
  gather() %>% 
  summarise(length(unique(value)))
#> # A tibble: 1 x 1
#>   `length(unique(value))`
#>                     <int>
#> 1                       2

The length of different values is greater than one (ie, 2), which tells us that not all values are identical.

Testing rowwise in a data frame

Let’s say we want to know whether row contains only identical elements, row 2 only identical elements, and so on. Put shortly, we test for equality rowwise in this data frame.

d3 <- d %>% 
  mutate(concatenated = pmap(., c)) %>% 
  mutate(length_unique = map_int(concatenated, ~ length(unique(.))))
d3
#> # A tibble: 3 x 5
#>    colA  colB  colC concatenated length_unique
#>   <dbl> <dbl> <dbl> <list>               <int>
#> 1     1     1     1 <dbl [3]>                1
#> 2     1     1    99 <dbl [3]>                2
#> 3     1    NA     1 <dbl [3]>                2

Let’s deconstruct that to get a grip on it. The first mutate call simply constructs rowwise vectors of all columns. That is, for the first line, 1, 1, 1, and 1, 1, 99 for the second line, and so on.

To access the list column concatenated, use list indexing:

d3[["concatenated"]]
#> [[1]]
#> colA colB colC 
#>    1    1    1 
#> 
#> [[2]]
#> colA colB colC 
#>    1    1   99 
#> 
#> [[3]]
#> colA colB colC 
#>    1   NA    1

Each line is a simple (named) vector:

d3[["concatenated"]] %>% str()
#> List of 3
#>  $ : Named num [1:3] 1 1 1
#>   ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"
#>  $ : Named num [1:3] 1 1 99
#>   ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"
#>  $ : Named num [1:3] 1 NA 1
#>   ..- attr(*, "names")= chr [1:3] "colA" "colB" "colC"

Ignoring NAs rowwise

There is a function called discard which discards elements of a list/vector if matching a condition such as is.na:

d$colB %>% discard(is.na)
#> [1] 1 1

Let’s apply that on our list column concatenated:

d4 <- d3 %>% 
  mutate(c_nona = map(concatenated, ~ discard(., is.na)))
d4
#> # A tibble: 3 x 6
#>    colA  colB  colC concatenated length_unique c_nona   
#>   <dbl> <dbl> <dbl> <list>               <int> <list>   
#> 1     1     1     1 <dbl [3]>                1 <dbl [3]>
#> 2     1     1    99 <dbl [3]>                2 <dbl [3]>
#> 3     1    NA     1 <dbl [3]>                2 <dbl [2]>

In the third line of d there was one missing value, so the length of the vector in line 3 should be shorter:

d4$c_nona[[3]]
#> colA colC 
#>    1    1

d5 <- d4 %>% 
  mutate(lu2 = map_int(c_nona, ~ length(unique(.))))
d5
#> # A tibble: 3 x 7
#>    colA  colB  colC concatenated length_unique c_nona      lu2
#>   <dbl> <dbl> <dbl> <list>               <int> <list>    <int>
#> 1     1     1     1 <dbl [3]>                1 <dbl [3]>     1
#> 2     1     1    99 <dbl [3]>                2 <dbl [3]>     2
#> 3     1    NA     1 <dbl [3]>                2 <dbl [2]>     1

Note that the normal way of working with mutate in dplyr does not work with list columns, as they are no normal columns.

d4 %>% 
  mutate(lu2 = length(unique(c_nona)))
#> # A tibble: 3 x 7
#>    colA  colB  colC concatenated length_unique c_nona      lu2
#>   <dbl> <dbl> <dbl> <list>               <int> <list>    <int>
#> 1     1     1     1 <dbl [3]>                1 <dbl [3]>     3
#> 2     1     1    99 <dbl [3]>                2 <dbl [3]>     3
#> 3     1    NA     1 <dbl [3]>                2 <dbl [2]>     3

List columns consists of multiple elements, that’s way we need map and friends.

Limitations

Note that this approach only works for testing “one-equality”, ie., if all values are the same. If you want to test, for example, of a target vector [1,2] is identical to the reference vector [3,4], testing the number of different items will not work. Obviously, in both cases there are 2 different values, but they are not identical. In other words, we can test for inequality but not for equality.

In addition, using this method, only two vectors can be compared at a time.

Testing for equality rowwise