Applying a function to each row of a data frame

A typical and quite straight forward operation in R and the tidyverse is to apply a function on each column of a data frame (or on each element of a list, which is the same for that regard).

However, the orthogonal question of “how to apply a function on each row” is much less labored. We will look at this question in this post, and explore some (of many) answers to this question.

A typical operation could be to compute the row sums of some variables.

Setup

data(mtcars)
library(tidyverse)
library(purrrlyr)

d <- mtcars %>% 
  select(1:3) %>% 
  head()

Using rowSums() and friends

For typical row wise operations such as mean or sums there are some simple and well-known functions in base R.

d %>% 
  mutate(sum_per_row = rowSums(.),
         avg_per_row = rowMeans(.))
#>    mpg cyl disp sum_per_row avg_per_row
#> 1 21.0   6  160       187.0    62.33333
#> 2 21.0   6  160       187.0    62.33333
#> 3 22.8   4  108       134.8    44.93333
#> 4 21.4   6  258       285.4    95.13333
#> 5 18.7   8  360       386.7   128.90000
#> 6 18.1   6  225       249.1    83.03333

The choice of columns could be curated in the typical tidyverse way:

d %>% 
  mutate(avg_per_row = rowMeans(select(., 1,2,3)))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160    62.33333
#> 2 21.0   6  160    62.33333
#> 3 22.8   4  108    44.93333
#> 4 21.4   6  258    95.13333
#> 5 18.7   8  360   128.90000
#> 6 18.1   6  225    83.03333

Using apply()

A quite general approach is to use the apply() family of base R.

d %>% 
  mutate(avg_per_row = apply(X = d, MARGIN = 1, FUN = mean))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160    62.33333
#> 2 21.0   6  160    62.33333
#> 3 22.8   4  108    44.93333
#> 4 21.4   6  258    95.13333
#> 5 18.7   8  360   128.90000
#> 6 18.1   6  225    83.03333

Using rowwise() from dplyr

rowwise() is a function from dplyr that groups the data-frame row-wise, that is each row is a group.

d %>% 
  rowwise() %>% 
  mutate(avg_per_row = mean(c(mpg, cyl, disp)))
#> Source: local data frame [6 x 4]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 4
#>     mpg   cyl  disp avg_per_row
#>   <dbl> <dbl> <dbl>       <dbl>
#> 1  21       6   160        62.3
#> 2  21       6   160        62.3
#> 3  22.8     4   108        44.9
#> 4  21.4     6   258        95.1
#> 5  18.7     8   360       129. 
#> 6  18.1     6   225        83.0

One disadvantage of this approach is that the relevant variables (such as mpg, cyl, disp) cannot be (easily) referred to using the dot notation.

Using by_row from purrrlyr

by_row() applies a function row-wise, hence “by row”. However, it expects a list (or data frame) as input.

d %>% 
  purrrlyr::by_row(lift_vl(mean), .collate = "cols")
#> # tibble [6 × 4]
#>     mpg   cyl  disp  .out
#>   <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160  62.3
#> 2  21       6   160  62.3
#> 3  22.8     4   108  44.9
#> 4  21.4     6   258  95.1
#> 5  18.7     8   360 129. 
#> 6  18.1     6   225  83.0

lift_vl() lifts the input of a function from “v” to “list” (hence “vl”). .collate is used to determine the output format; default is a list-column. "cols" provides columns as outputs.

Using pmap() from purrr

This is maybe the most abstract approach but with some beauty.

d %>% 
  mutate(avg_per_row = pmap(d, lift_vd(mean)))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160    62.33333
#> 2 21.0   6  160    62.33333
#> 3 22.8   4  108    44.93333
#> 4 21.4   6  258    95.13333
#> 5 18.7   8  360       128.9
#> 6 18.1   6  225    83.03333

pmap() cycles through p vectors in parallel (hence pmap), that means, it reads the first element from vector 1, and the first element of vector 2, the first element of vector 3 and so on. The resulting vector is then mapped to the chosen function, here mean(). However, mean() is a function that expects a vector-valued input (v). That’s why we need to “lift” its domain to except a variable number of inputs, ie., dots (d). lift_vd() does this job.

From the help page mean():

mean(x, ...)

mean() takes a vector as its primary data input. In contrast, sum() accepts dots:

sum(..., na.rm = FALSE)

That’s why sum does not need to be lifted when used with pmap().

d %>% 
  mutate(avg_per_row = pmap(d, sum))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160         187
#> 2 21.0   6  160         187
#> 3 22.8   4  108       134.8
#> 4 21.4   6  258       285.4
#> 5 18.7   8  360       386.7
#> 6 18.1   6  225       249.1

And the winner is … pmap()

To me, pmap() is the most elegant approach, because it is the most general, and it avoids strange acrobatics.

It’s quite cool that the columns (list elements) to be considered for pmap() can be considered quite easily:

d %>% 
  mutate(avg_per_row = pmap(list(mpg, cyl), lift_vd(mean)))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160        13.5
#> 2 21.0   6  160        13.5
#> 3 22.8   4  108        13.4
#> 4 21.4   6  258        13.7
#> 5 18.7   8  360       13.35
#> 6 18.1   6  225       12.05

Synonymously:

d %>% 
  mutate(avg_per_row = pmap(select(., mpg, cyl), lift_vd(mean)))
#>    mpg cyl disp avg_per_row
#> 1 21.0   6  160        13.5
#> 2 21.0   6  160        13.5
#> 3 22.8   4  108        13.4
#> 4 21.4   6  258        13.7
#> 5 18.7   8  360       13.35
#> 6 18.1   6  225       12.05

Debrief

Of course, a data frame could be transposed by the t() function, and then the typical column oriented functions could be applied. However, this procedure does seem to exercise too much brute force on the data frame. It seems (to me) much more natural to tell the computer to “apply the function to each row”, essentially parallel to the (much more common) idiom of applying a function to each column. Under the hood the technical fabric of the data frame limits the (lower level) functions, sure enough, but from a more high level perspective, it is desirable to clearly state what the machine should work out without gymnastics due to the technical setup of the machine.

There are a number of great resources on that stuff out there. Check out Jenny Bryan’s talk. Again, Jenny Bryan has this great tutorial out there on row-y work using the tidyverse techniques.