UPDATE 2018-12-11 - I’m talking about the package DPLYR, not PURRR, as I had mistakenly written.
There are many approaches for what is called the “split-apply-combine” approach (see this paper by Hadley Wickham).
I recently thought about the best approach to use split-apply-combine approaches in R (see tweet, and this post).
And I retweeted some criticism on the “present era” tidyverse approach (see this tweet), and check out the mentioned post by @coolbutuseless.
Nice wrap-up on different Split-Apply-Combine methods in #rstats. Spoiler: #tidyverse way loses. https://t.co/QGZOxxcTwJ
— Sebastian Sauer 🇺🇦 (@sauer_sebastian) December 8, 2018
Then, Erich Neuwirth (@neuwirthe
) informed me on Twitter, that there’s a new idiom in purrr() dplyr1 (as of version 0.8.0) that might come as a remedy: group_split()
. This post explore some of the uses of this idiom.
dplyr 0.8.0 (github only) has group_split. Does that offer a reasonable solution?
— Erich Neuwirth (@neuwirthe) December 8, 2018
Thanks, Erich!
👍
First, load the tidyverse packages (that is, dplyr, for our purposes):
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.99.9000
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.2.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(broom)
Mind that I run this version of dplyr (loaded from github on 2018-12-10):
packageVersion("dplyr")
## [1] '0.7.99.9000'
The typical tidyverse approach for split-apply-combine
The typical tidyverse approach is the following:
mtcars %>%
group_by(cyl) %>%
summarise(hp_mean = mean(hp))
## # A tibble: 3 x 2
## cyl hp_mean
## <dbl> <dbl>
## 1 4 82.6
## 2 6 122.
## 3 8 209.
Contrast to the overall mean:
mtcars %>%
summarise(mean(hp))
## mean(hp)
## 1 146.6875
Using list()
in summarise()
does NOT convey the grouping
This approach works fine unless one wants to apply a more complex function to each group. “Complex” refers to a function that yields back more than one number, such as lm()
.
One might hope that the following works:
d <- mtcars %>%
group_by(cyl) %>%
summarise(hp_mean = list(tidy(lm(hp ~ 1, data = .))))
d
## # A tibble: 3 x 2
## cyl hp_mean
## <dbl> <list>
## 1 4 <tibble [1 × 5]>
## 2 6 <tibble [1 × 5]>
## 3 8 <tibble [1 × 5]>
This gives what is called a “list column” in tidyverse parlance. To “unpack” or “unnest” this list column use unnest()
. Notice that this only works if the list column is “tidy”, that is, if it can be unpacked to a data frame like structure.
d %>%
unnest()
## # A tibble: 3 x 6
## cyl term estimate std.error statistic p.value
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4 (Intercept) 147. 12.1 12.1 2.79e-13
## 2 6 (Intercept) 147. 12.1 12.1 2.79e-13
## 3 8 (Intercept) 147. 12.1 12.1 2.79e-13
However, the summarize function did apparently not consider the grouping.
Using split()
and map()
A more “purrr-ish” approach is this:
mtcars %>%
split(.$cyl) %>%
map( ~ tidy(lm(hp ~ 1, data = .))) %>%
map_dfc("estimate")
## # A tibble: 1 x 3
## `4` `6` `8`
## <dbl> <dbl> <dbl>
## 1 82.6 122. 209.
Works. But.
This approach is different to the dplyr approach, mainly because the grouping idiom of dplyr does not work here (and split(.$cyl)
looks not very consistent to the tidyverse code).
Using group_by()
and nest()
ed list-columns
List columns do work when used with group_by()
and mutate()
:
apply_lm <- function(df){
tidy(lm(data = df, hp ~ 1))
}
d <- mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(hp_mean = purrr::map(.f = apply_lm, . = data))
d
## # A tibble: 3 x 3
## cyl data hp_mean
## <dbl> <list> <list>
## 1 6 <tibble [7 × 10]> <tibble [1 × 5]>
## 2 4 <tibble [11 × 10]> <tibble [1 × 5]>
## 3 8 <tibble [14 × 10]> <tibble [1 × 5]>
Let’s see what’s in the list columns:
d %>%
unnest(hp_mean)
## # A tibble: 3 x 7
## cyl data term estimate std.error statistic p.value
## <dbl> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 6 <tibble [7 × 1… (Interce… 122. 9.17 13.3 1.10e-5
## 2 4 <tibble [11 × … (Interce… 82.6 6.31 13.1 1.28e-7
## 3 8 <tibble [14 × … (Interce… 209. 13.6 15.4 1.03e-9
Using group_split()
Since recently dplyr features group_split()
(still experimental as of this writing).
The idea of this function is to convey the grouping information from group_by()
to split()
.
mtcars %>%
group_by(cyl) %>%
group_split() %>%
map_dfr(~lm(hp ~ 1, data = .) %>% tidy())
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 82.6 6.31 13.1 0.000000128
## 2 (Intercept) 122. 9.17 13.3 0.0000110
## 3 (Intercept) 209. 13.6 15.4 0.00000000103
When the data frame is ungrouped, group_split()
can be used to group it:
mtcars %>%
group_split(cyl) %>%
map_dfr(~lm(hp ~ 1, data = .) %>% tidy())
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 82.6 6.31 13.1 0.000000128
## 2 (Intercept) 122. 9.17 13.3 0.0000110
## 3 (Intercept) 209. 13.6 15.4 0.00000000103
Join model results back to full (raw data) data frame
This dataset could now joined with the initial data:
mtcars %>%
group_split(cyl) %>%
map_dfr(~lm(hp ~ 1, data = .) %>% tidy()) %>%
mutate(group = mtcars %>% group_keys(cyl) %>% pull(cyl)) %>%
full_join(mtcars, by = c("group" = "cyl"))
## # A tibble: 32 x 16
## term estimate std.error statistic p.value group mpg disp hp drat
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Int… 82.6 6.31 13.1 1.28e-7 4 22.8 108 93 3.85
## 2 (Int… 82.6 6.31 13.1 1.28e-7 4 24.4 147. 62 3.69
## 3 (Int… 82.6 6.31 13.1 1.28e-7 4 22.8 141. 95 3.92
## 4 (Int… 82.6 6.31 13.1 1.28e-7 4 32.4 78.7 66 4.08
## 5 (Int… 82.6 6.31 13.1 1.28e-7 4 30.4 75.7 52 4.93
## 6 (Int… 82.6 6.31 13.1 1.28e-7 4 33.9 71.1 65 4.22
## 7 (Int… 82.6 6.31 13.1 1.28e-7 4 21.5 120. 97 3.7
## 8 (Int… 82.6 6.31 13.1 1.28e-7 4 27.3 79 66 4.08
## 9 (Int… 82.6 6.31 13.1 1.28e-7 4 26 120. 91 4.43
## 10 (Int… 82.6 6.31 13.1 1.28e-7 4 30.4 95.1 113 3.77
## # ... with 22 more rows, and 6 more variables: wt <dbl>, qsec <dbl>,
## # vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>
Of course, the join builds a lot of duplicated data which can be a waste.
One wish remains open - Adding columns using group_slit()
However, at the moment at least, I do not see a convenient way to add columns to a existing data frame (e.g., add the estimate) via this approach:
mtcars %>%
mutate(hp_mean = group_split(., cyl) %>% map(~lm(hp ~ 1, data = .) %>% tidy() %>% map("estimate")))
## Error: Column `hp_mean` must be length 32 (the number of rows) or one, not 3
The error tells us that the nested version seems more reasonable:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(hp_mean = map(.f = apply_lm, .x = data)) %>%
unnest(hp_mean)
## # A tibble: 3 x 7
## cyl data term estimate std.error statistic p.value
## <dbl> <list> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 6 <tibble [7 × 1… (Interce… 122. 9.17 13.3 1.10e-5
## 2 4 <tibble [11 × … (Interce… 82.6 6.31 13.1 1.28e-7
## 3 8 <tibble [14 × … (Interce… 209. 13.6 15.4 1.03e-9
Individual columns can also be extracted using map()
followed by the name of the element:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(hp_mean = map(apply_lm, .x = data)) %>%
mutate(hp_est = map_dbl("estimate", .x = hp_mean))
## # A tibble: 3 x 4
## cyl data hp_mean hp_est
## <dbl> <list> <list> <dbl>
## 1 6 <tibble [7 × 10]> <tibble [1 × 5]> 122.
## 2 4 <tibble [11 × 10]> <tibble [1 × 5]> 82.6
## 3 8 <tibble [14 × 10]> <tibble [1 × 5]> 209.
The same without externally defined function
It may be more direct to define the function for map()
right within mutate()
:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(hp_lm = map( ~ lm(hp ~ 1, data = .) %>% tidy(), .x = data)) %>%
mutate(hp_mean = map_dbl("estimate", .x = hp_lm))
## # A tibble: 3 x 4
## cyl data hp_lm hp_mean
## <dbl> <list> <list> <dbl>
## 1 6 <tibble [7 × 10]> <tibble [1 × 5]> 122.
## 2 4 <tibble [11 × 10]> <tibble [1 × 5]> 82.6
## 3 8 <tibble [14 × 10]> <tibble [1 × 5]> 209.
Debrief
split_groups()
is a new function (as of today) which provides some remedy for split-apply-combine actions within the tidyverse.
Thanks @romain_francois for pointing out↩