A typical and quite straight forward operation in R and the tidyverse is to apply a function on each column of a data frame (or on each element of a list, which is the same for that regard).
However, the orthogonal question of “how to apply a function on each row” is much less labored. We will look at this question in this post, and explore some (of many) answers to this question.
A typical operation could be to compute the row sums of some variables.
Setup
data(mtcars)
library(tidyverse)
library(purrrlyr)
d <- mtcars %>%
select(1:3) %>%
head()
Using rowSums()
and friends
For typical row wise operations such as mean or sums there are some simple and well-known functions in base R.
d %>%
mutate(sum_per_row = rowSums(.),
avg_per_row = rowMeans(.))
#> mpg cyl disp sum_per_row avg_per_row
#> 1 21.0 6 160 187.0 62.33333
#> 2 21.0 6 160 187.0 62.33333
#> 3 22.8 4 108 134.8 44.93333
#> 4 21.4 6 258 285.4 95.13333
#> 5 18.7 8 360 386.7 128.90000
#> 6 18.1 6 225 249.1 83.03333
The choice of columns could be curated in the typical tidyverse way:
d %>%
mutate(avg_per_row = rowMeans(select(., 1,2,3)))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 62.33333
#> 2 21.0 6 160 62.33333
#> 3 22.8 4 108 44.93333
#> 4 21.4 6 258 95.13333
#> 5 18.7 8 360 128.90000
#> 6 18.1 6 225 83.03333
Using apply()
A quite general approach is to use the apply()
family of base R.
d %>%
mutate(avg_per_row = apply(X = d, MARGIN = 1, FUN = mean))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 62.33333
#> 2 21.0 6 160 62.33333
#> 3 22.8 4 108 44.93333
#> 4 21.4 6 258 95.13333
#> 5 18.7 8 360 128.90000
#> 6 18.1 6 225 83.03333
Using rowwise()
from dplyr
rowwise()
is a function from dplyr that groups the data-frame row-wise, that is each row is a group.
d %>%
rowwise() %>%
mutate(avg_per_row = mean(c(mpg, cyl, disp)))
#> Source: local data frame [6 x 4]
#> Groups: <by row>
#>
#> # A tibble: 6 x 4
#> mpg cyl disp avg_per_row
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 62.3
#> 2 21 6 160 62.3
#> 3 22.8 4 108 44.9
#> 4 21.4 6 258 95.1
#> 5 18.7 8 360 129.
#> 6 18.1 6 225 83.0
One disadvantage of this approach is that the relevant variables (such as mpg
, cyl
, disp
) cannot be (easily) referred to using the dot notation.
Using by_row
from purrrlyr
by_row()
applies a function row-wise, hence “by row”. However, it expects a list (or data frame) as input.
d %>%
purrrlyr::by_row(lift_vl(mean), .collate = "cols")
#> # tibble [6 × 4]
#> mpg cyl disp .out
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 62.3
#> 2 21 6 160 62.3
#> 3 22.8 4 108 44.9
#> 4 21.4 6 258 95.1
#> 5 18.7 8 360 129.
#> 6 18.1 6 225 83.0
lift_vl()
lifts the input of a function from “v” to “list” (hence “vl”). .collate
is used to determine the output format; default is a list-column. "cols"
provides columns as outputs.
Using pmap()
from purrr
This is maybe the most abstract approach but with some beauty.
d %>%
mutate(avg_per_row = pmap(d, lift_vd(mean)))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 62.33333
#> 2 21.0 6 160 62.33333
#> 3 22.8 4 108 44.93333
#> 4 21.4 6 258 95.13333
#> 5 18.7 8 360 128.9
#> 6 18.1 6 225 83.03333
pmap()
cycles through p vectors in parallel (hence pmap), that means, it reads the first element from vector 1, and the first element of vector 2, the first element of vector 3 and so on. The resulting vector is then mapped to the chosen function, here mean()
. However, mean()
is a function that expects a vector-valued input (v). That’s why we need to “lift” its domain to except a variable number of inputs, ie., dots (d). lift_vd()
does this job.
From the help page mean()
:
mean(x, ...)
mean()
takes a vector as its primary data input. In contrast, sum()
accepts dots:
sum(..., na.rm = FALSE)
That’s why sum does not need to be lifted when used with pmap()
.
d %>%
mutate(avg_per_row = pmap(d, sum))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 187
#> 2 21.0 6 160 187
#> 3 22.8 4 108 134.8
#> 4 21.4 6 258 285.4
#> 5 18.7 8 360 386.7
#> 6 18.1 6 225 249.1
And the winner is … pmap()
To me, pmap()
is the most elegant approach, because it is the most general, and it avoids strange acrobatics.
It’s quite cool that the columns (list elements) to be considered for pmap()
can be considered quite easily:
d %>%
mutate(avg_per_row = pmap(list(mpg, cyl), lift_vd(mean)))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 13.5
#> 2 21.0 6 160 13.5
#> 3 22.8 4 108 13.4
#> 4 21.4 6 258 13.7
#> 5 18.7 8 360 13.35
#> 6 18.1 6 225 12.05
Synonymously:
d %>%
mutate(avg_per_row = pmap(select(., mpg, cyl), lift_vd(mean)))
#> mpg cyl disp avg_per_row
#> 1 21.0 6 160 13.5
#> 2 21.0 6 160 13.5
#> 3 22.8 4 108 13.4
#> 4 21.4 6 258 13.7
#> 5 18.7 8 360 13.35
#> 6 18.1 6 225 12.05
Debrief
Of course, a data frame could be transposed by the t()
function, and then the typical column oriented functions could be applied. However, this procedure does seem to exercise too much brute force on the data frame. It seems (to me) much more natural to tell the computer to “apply the function to each row”, essentially parallel to the (much more common) idiom of applying a function to each column. Under the hood the technical fabric of the data frame limits the (lower level) functions, sure enough, but from a more high level perspective, it is desirable to clearly state what the machine should work out without gymnastics due to the technical setup of the machine.
There are a number of great resources on that stuff out there. Check out Jenny Bryan’s talk. Again, Jenny Bryan has this great tutorial out there on row-y work using the tidyverse techniques.