New split-apply-combine variant in dplyr: group_split()


UPDATE 2018-12-11 - I’m talking about the package DPLYR, not PURRR, as I had mistakenly written.


There are many approaches for what is called the “split-apply-combine” approach (see this paper by Hadley Wickham).

I recently thought about the best approach to use split-apply-combine approaches in R (see tweet, and this post).

And I retweeted some criticism on the “present era” tidyverse approach (see this tweet), and check out the mentioned post by @coolbutuseless.

Then, Erich Neuwirth (@neuwirthe) informed me on Twitter, that there’s a new idiom in purrr() dplyr1 (as of version 0.8.0) that might come as a remedy: group_split(). This post explore some of the uses of this idiom.


Thanks, Erich!

👍


First, load the tidyverse packages (that is, dplyr, for our purposes):

library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0           ✔ purrr   0.2.5      
## ✔ tibble  1.4.2           ✔ dplyr   0.7.99.9000
## ✔ tidyr   0.8.2           ✔ stringr 1.3.1      
## ✔ readr   1.2.1           ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(broom)

Mind that I run this version of dplyr (loaded from github on 2018-12-10):

packageVersion("dplyr")
## [1] '0.7.99.9000'

The typical tidyverse approach for split-apply-combine

The typical tidyverse approach is the following:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(hp_mean = mean(hp))
## # A tibble: 3 x 2
##     cyl hp_mean
##   <dbl>   <dbl>
## 1     4    82.6
## 2     6   122. 
## 3     8   209.

Contrast to the overall mean:

mtcars %>% 
  summarise(mean(hp))
##   mean(hp)
## 1 146.6875

Using list() in summarise() does NOT convey the grouping

This approach works fine unless one wants to apply a more complex function to each group. “Complex” refers to a function that yields back more than one number, such as lm().

One might hope that the following works:

d <- mtcars %>% 
  group_by(cyl) %>% 
  summarise(hp_mean = list(tidy(lm(hp ~ 1, data  = .))))

d
## # A tibble: 3 x 2
##     cyl hp_mean         
##   <dbl> <list>          
## 1     4 <tibble [1 × 5]>
## 2     6 <tibble [1 × 5]>
## 3     8 <tibble [1 × 5]>

This gives what is called a “list column” in tidyverse parlance. To “unpack” or “unnest” this list column use unnest(). Notice that this only works if the list column is “tidy”, that is, if it can be unpacked to a data frame like structure.

d %>% 
  unnest()
## # A tibble: 3 x 6
##     cyl term        estimate std.error statistic  p.value
##   <dbl> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1     4 (Intercept)     147.      12.1      12.1 2.79e-13
## 2     6 (Intercept)     147.      12.1      12.1 2.79e-13
## 3     8 (Intercept)     147.      12.1      12.1 2.79e-13

However, the summarize function did apparently not consider the grouping.

Using split() and map()

A more “purrr-ish” approach is this:

mtcars %>% 
  split(.$cyl) %>% 
  map( ~ tidy(lm(hp ~ 1, data = .))) %>% 
  map_dfc("estimate")
## # A tibble: 1 x 3
##     `4`   `6`   `8`
##   <dbl> <dbl> <dbl>
## 1  82.6  122.  209.

Works. But.

This approach is different to the dplyr approach, mainly because the grouping idiom of dplyr does not work here (and split(.$cyl) looks not very consistent to the tidyverse code).

Using group_by() and nest()ed list-columns

List columns do work when used with group_by() and mutate():

apply_lm <- function(df){
      tidy(lm(data = df, hp ~ 1))      
}

d <- mtcars %>% 
  group_by(cyl) %>% 
  nest() %>% 
  mutate(hp_mean = purrr::map(.f = apply_lm, . = data)) 
d
## # A tibble: 3 x 3
##     cyl data               hp_mean         
##   <dbl> <list>             <list>          
## 1     6 <tibble [7 × 10]>  <tibble [1 × 5]>
## 2     4 <tibble [11 × 10]> <tibble [1 × 5]>
## 3     8 <tibble [14 × 10]> <tibble [1 × 5]>

Let’s see what’s in the list columns:

d %>% 
  unnest(hp_mean)
## # A tibble: 3 x 7
##     cyl data            term      estimate std.error statistic     p.value
##   <dbl> <list>          <chr>        <dbl>     <dbl>     <dbl>       <dbl>
## 1     6 <tibble [7 × 1… (Interce…    122.       9.17      13.3     1.10e-5
## 2     4 <tibble [11 × … (Interce…     82.6      6.31      13.1     1.28e-7
## 3     8 <tibble [14 × … (Interce…    209.      13.6       15.4     1.03e-9

Using group_split()

Since recently dplyr features group_split() (still experimental as of this writing).

The idea of this function is to convey the grouping information from group_by() to split().

mtcars %>% 
  group_by(cyl) %>% 
  group_split() %>% 
  map_dfr(~lm(hp ~ 1, data = .) %>% tidy())
## # A tibble: 3 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)     82.6      6.31      13.1 0.000000128  
## 2 (Intercept)    122.       9.17      13.3 0.0000110    
## 3 (Intercept)    209.      13.6       15.4 0.00000000103

When the data frame is ungrouped, group_split() can be used to group it:

mtcars %>% 
  group_split(cyl) %>% 
  map_dfr(~lm(hp ~ 1, data = .) %>% tidy())
## # A tibble: 3 x 5
##   term        estimate std.error statistic       p.value
##   <chr>          <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept)     82.6      6.31      13.1 0.000000128  
## 2 (Intercept)    122.       9.17      13.3 0.0000110    
## 3 (Intercept)    209.      13.6       15.4 0.00000000103

Join model results back to full (raw data) data frame

This dataset could now joined with the initial data:

mtcars %>% 
  group_split(cyl) %>% 
  map_dfr(~lm(hp ~ 1, data = .) %>% tidy()) %>%
  mutate(group = mtcars %>% group_keys(cyl) %>% pull(cyl)) %>% 
  full_join(mtcars, by = c("group" = "cyl"))
## # A tibble: 32 x 16
##    term  estimate std.error statistic p.value group   mpg  disp    hp  drat
##    <chr>    <dbl>     <dbl>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 (Int…     82.6      6.31      13.1 1.28e-7     4  22.8 108      93  3.85
##  2 (Int…     82.6      6.31      13.1 1.28e-7     4  24.4 147.     62  3.69
##  3 (Int…     82.6      6.31      13.1 1.28e-7     4  22.8 141.     95  3.92
##  4 (Int…     82.6      6.31      13.1 1.28e-7     4  32.4  78.7    66  4.08
##  5 (Int…     82.6      6.31      13.1 1.28e-7     4  30.4  75.7    52  4.93
##  6 (Int…     82.6      6.31      13.1 1.28e-7     4  33.9  71.1    65  4.22
##  7 (Int…     82.6      6.31      13.1 1.28e-7     4  21.5 120.     97  3.7 
##  8 (Int…     82.6      6.31      13.1 1.28e-7     4  27.3  79      66  4.08
##  9 (Int…     82.6      6.31      13.1 1.28e-7     4  26   120.     91  4.43
## 10 (Int…     82.6      6.31      13.1 1.28e-7     4  30.4  95.1   113  3.77
## # ... with 22 more rows, and 6 more variables: wt <dbl>, qsec <dbl>,
## #   vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

Of course, the join builds a lot of duplicated data which can be a waste.

One wish remains open - Adding columns using group_slit()

However, at the moment at least, I do not see a convenient way to add columns to a existing data frame (e.g., add the estimate) via this approach:

mtcars %>%
  mutate(hp_mean = group_split(., cyl) %>% map(~lm(hp ~ 1, data = .) %>% tidy() %>% map("estimate")))
## Error: Column `hp_mean` must be length 32 (the number of rows) or one, not 3

The error tells us that the nested version seems more reasonable:

mtcars %>% 
  group_by(cyl) %>%
  nest() %>% 
  mutate(hp_mean = map(.f = apply_lm, .x = data)) %>% 
  unnest(hp_mean)
## # A tibble: 3 x 7
##     cyl data            term      estimate std.error statistic     p.value
##   <dbl> <list>          <chr>        <dbl>     <dbl>     <dbl>       <dbl>
## 1     6 <tibble [7 × 1… (Interce…    122.       9.17      13.3     1.10e-5
## 2     4 <tibble [11 × … (Interce…     82.6      6.31      13.1     1.28e-7
## 3     8 <tibble [14 × … (Interce…    209.      13.6       15.4     1.03e-9

Individual columns can also be extracted using map() followed by the name of the element:

mtcars %>% 
  group_by(cyl) %>%
  nest() %>% 
  mutate(hp_mean = map(apply_lm, .x = data)) %>% 
  mutate(hp_est = map_dbl("estimate", .x = hp_mean))
## # A tibble: 3 x 4
##     cyl data               hp_mean          hp_est
##   <dbl> <list>             <list>            <dbl>
## 1     6 <tibble [7 × 10]>  <tibble [1 × 5]>  122. 
## 2     4 <tibble [11 × 10]> <tibble [1 × 5]>   82.6
## 3     8 <tibble [14 × 10]> <tibble [1 × 5]>  209.

The same without externally defined function

It may be more direct to define the function for map() right within mutate():

mtcars %>% 
  group_by(cyl) %>%
  nest() %>% 
  mutate(hp_lm = map( ~ lm(hp ~ 1, data = .) %>% tidy(), .x = data)) %>% 
  mutate(hp_mean = map_dbl("estimate", .x = hp_lm))
## # A tibble: 3 x 4
##     cyl data               hp_lm            hp_mean
##   <dbl> <list>             <list>             <dbl>
## 1     6 <tibble [7 × 10]>  <tibble [1 × 5]>   122. 
## 2     4 <tibble [11 × 10]> <tibble [1 × 5]>    82.6
## 3     8 <tibble [14 × 10]> <tibble [1 × 5]>   209.

Debrief

split_groups() is a new function (as of today) which provides some remedy for split-apply-combine actions within the tidyverse.


  1. Thanks @romain_francois for pointing out