Compute all pairwise differences in matrix

A quite frequent task in many fields of applied math is to compute pairwise differences of elements in a matrix. Actually, it need not be a difference; a product is frequent, too. In this post, we explore some (base) R ways to achieve this.

library(mosaic)
library(gdata)
library(tidyverse)

Using outer()

An elegant approach, using base R, is applying outer(). That’s useful if one has two vectors, and wants to compute the outer product:

v1 <- c(1:3)
v2 <- c(2:4)

outer(v1, v2)
##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    4    6    8
## [3,]    6    9   12

As per default, outer() computes the product, because that’s the definition of the outer product in linear algebra.

Notice that v1 and v2 are just simple numeric vectors (albeit named):

str(v1)
##  int [1:3] 1 2 3

However, the function is quite flexible; we can give some other function as well, say, a pairwise difference:

o1 <- outer(v1, v2, FUN = "-")
o1 
##      [,1] [,2] [,3]
## [1,]   -1   -2   -3
## [2,]    0   -1   -2
## [3,]    1    0   -1

What type of object is returned?

class(o1)
## [1] "matrix"
str(o1)
##  int [1:3, 1:3] -1 0 1 -2 -1 0 -3 -2 -1

Add names to the margins (ie., to the vectors):

names(v1) <- v1
names(v2) <- v2
o1 <- outer(v1, v2, FUN = "-")
o1
##    2  3  4
## 1 -1 -2 -3
## 2  0 -1 -2
## 3  1  0 -1
str(o1)
##  int [1:3, 1:3] -1 0 1 -2 -1 0 -3 -2 -1
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "1" "2" "3"
##   ..$ : chr [1:3] "2" "3" "4"
class(o1)
## [1] "matrix"

At times, we are interested rather in:

vec <- c(v1, v2)
vec
## 1 2 3 2 3 4 
## 1 2 3 2 3 4
str(vec)
##  Named int [1:6] 1 2 3 2 3 4
##  - attr(*, "names")= chr [1:6] "1" "2" "3" "2" ...
o2 <- outer(vec, vec, FUN = "-")
str(o2)
##  int [1:6, 1:6] 0 1 2 1 2 3 -1 0 1 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:6] "1" "2" "3" "2" ...
##   ..$ : chr [1:6] "1" "2" "3" "2" ...
o2
##   1  2  3  2  3  4
## 1 0 -1 -2 -1 -2 -3
## 2 1  0 -1  0 -1 -2
## 3 2  1  0  1  0 -1
## 2 1  0 -1  0 -1 -2
## 3 2  1  0  1  0 -1
## 4 3  2  1  2  1  0

What we have just computed, is the difference between all pairs of elements, even between elements within the same vector. Previously, we have computed the difference between pairs of elements, where the first element is not part of the second vector (and v.v.).

If the absolute differences are needed, apply the abs() function on the whole matrix:

o1_abs <- abs(o1)
o1_abs
##   2 3 4
## 1 1 2 3
## 2 0 1 2
## 3 1 0 1
o2_abs <- abs(o2)

If only the different combinations are needed (ie., without considering ordering), extract the upper triangle:

o1_different <- upper.tri(o1) * o1_abs
o1_different
##   2 3 4
## 1 0 2 3
## 2 0 0 2
## 3 0 0 0
o2_different <- upper.tri(o2) * o2_abs

Notice that we do not want to use regular matrix multiplication, but cellwise computation only. For matrix multiplication, use %*%.

To get the mean absolute pairwise difference, take the mean of these values:

mean(o1_different)
## [1] 0.7777778

Wait; that’s probably not what we expected. Let’s tell R only to consider the values of the upper triangle when computing the mean (or whatever function, for that purpose).

o1_diff_upptri <- ifelse(o1_different == 0, NA, o1_different)
o1_diff_upptri
##    2  3  4
## 1 NA  2  3
## 2 NA NA  2
## 3 NA NA NA

More elegant, or, well, more direct, and much faster (no for-loops under the hood):

o1_diff_upptri <- o1_different
o1_diff_upptri[o1_diff_upptri == 0] <- NA
o1_diff_upptri
##    2  3  4
## 1 NA  2  3
## 2 NA NA  2
## 3 NA NA NA

That’s a point where R shines: Not only vector arithmetic are applied automatically, also matrices can be digested.

Hang on, this one is nice, too:

!o1_different
##      2     3     4
## 1 TRUE FALSE FALSE
## 2 TRUE  TRUE FALSE
## 3 TRUE  TRUE  TRUE

The ! operator transform the matrix in a logical one (0 = FALSE, else TRUE), and negates the values.

Now:

o1_variant <- o1_different
is.na(o1_variant) <- !o1_variant
o1_variant
##    2  3  4
## 1 NA  2  3
## 2 NA NA  2
## 3 NA NA NA

However, those methods fail if there are zeros in the upper triangle:

o2_variant <- o2_different
is.na(o2_variant) <- !o2_variant
o2_variant
##    1  2  3  2  3  4
## 1 NA  1  2  1  2  3
## 2 NA NA  1 NA  1  2
## 3 NA NA NA  1 NA  1
## 2 NA NA NA NA  1  2
## 3 NA NA NA NA NA  1
## 4 NA NA NA NA NA NA

We would not want the NAs in the upper triangle.

That’s why this way comes handy:

o2_diff_upptri <- o2_different
gdata::lowerTriangle(o2_diff_upptri, diag = TRUE) <- NA
o2_diff_upptri
##    1  2  3  2  3  4
## 1 NA  1  2  1  2  3
## 2 NA NA  1  0  1  2
## 3 NA NA NA  1  0  1
## 2 NA NA NA NA  1  2
## 3 NA NA NA NA NA  1
## 4 NA NA NA NA NA NA

Notice, however, that this function is not from base R.

Now, again the mean:

mean(o1_diff_upptri, na.rm = TRUE)
## [1] 2.333333
sum(o1_diff_upptri, na.rm = TRUE)
## [1] 7

And for o2:

o2_mean <- mean(o2_diff_upptri, na.rm = TRUE)
o2_mean
## [1] 1.266667

Let’s also compute the sum of the absolute differences:

o2_sum <- sum(o2_diff_upptri, na.rm = TRUE)
o2_sum
## [1] 19

What’s the number of different elements?

o2_length <- length(o2_diff_upptri[!is.na(o2_diff_upptri)])
o2_length
## [1] 15

Divide sum by n to get the mean:

o2_sum / o2_length
## [1] 1.266667

Using MAD()

Incidentally, that’s similar to what the functions SAD() and MAD() from mosaic compute. That is, the sum (SAD()) is the same:

mosaic::SAD(v1, v2)
## [1] 19
mosaic::SAD(v1, v2) == o2_sum
## [1] TRUE

But the mean is different:

mosaic::MAD(v1, v2)
## [1] 4.75
mosaic::MAD(v1, v2)  == o2_mean
## [1] FALSE

Internally, MAD does the following:

x <- c(v1, v2)
x
## 1 2 3 2 3 4 
## 1 2 3 2 3 4
M <- outer(x, x, "-")
M
##   1  2  3  2  3  4
## 1 0 -1 -2 -1 -2 -3
## 2 1  0 -1  0 -1 -2
## 3 2  1  0  1  0 -1
## 2 1  0 -1  0 -1 -2
## 3 2  1  0  1  0 -1
## 4 3  2  1  2  1  0
M2 <- upper.tri(M) * abs(M)
M2
##   1 2 3 2 3 4
## 1 0 1 2 1 2 3
## 2 0 0 1 0 1 2
## 3 0 0 0 1 0 1
## 2 0 0 0 0 1 2
## 3 0 0 0 0 0 1
## 4 0 0 0 0 0 0
M3 <- sum(M2)
M3
## [1] 19
length_elements <-  length(v1) + length(list(v2))
length_elements
## [1] 4
M4 <- M3 / (length_elements)
M4
## [1] 4.75

So, we divided by the number of the upper triangle elements. That is, by all pairwise comparisons. MAD() divides by the length of the first vector + 1. Not quite sure when we would want that. Presumably, these functions are not meant to be given two vectors, only one.

However, if we only have one vector, MAD() makes sense:

MAD(vec)
## [1] 3.166667

Here, first the sum of the differences is computed:

SAD(vec)
## [1] 19
SAD(vec) == o2_sum
## [1] TRUE

Next, this sum is divided by the number of elements (of vec):

vec_length <- length(vec)

SAD(vec) / vec_length
## [1] 3.166667
SAD(vec) / vec_length == MAD(vec)
## [1] TRUE

A glimpse to the code of those two functions:

MAD_
## function (x, ..., na.rm = getOption("na.omit", FALSE)) 
## {
##     SAD_(x, ..., na.rm = na.rm)/(length(x) + length(list(...)))
## }
## <bytecode: 0x7fa6951cdcd8>
## <environment: namespace:mosaic>
SAD_
## function (x, ..., na.rm = getOption("na.omit", FALSE)) 
## {
##     x <- c(x, unlist(list(...)))
##     x <- na.omit(x)
##     M <- outer(x, x, "-")
##     base::sum(upper.tri(M) * abs(M))
## }
## <bytecode: 0x7fa694185380>
## <environment: namespace:mosaic>

Debrief

We have explored some ways to compute the pairwise differences of two vectors, with some thoughts walking astray. We have discussed two ways of computing the “mean” by dividing by the number of comparisons or by the number of elements.