Recently, some one asked me in a workshop this question: “What are the names of the cars with 4 (6,8) cylinders?” (he referred to the mtcars data set). That was a workshop on the tidyverse, so the question is how to answer this question using tidyverse techniques.

First, let’s load the usual culprits.

library(tidyverse)
library(purrrlyr)
library(knitr)
library(stringr)

data(mtcars)

d <- as_tibble(mtcars) %>% 
  rownames_to_column(var = "car_names")

d %>% 
  head() %>% 
  kable()

car_names	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Let’s explore several ways.

Way 1 - using `paste()` aund `pull()`

First, it makes sense to group the data, as our question refers to the car names of each cylinder group. Next, we can summarize the data using paste() to collate the names into one string:

d2 <- 
  d %>% 
  group_by(cyl) %>% 
  summarise(car_names_coll = paste(car_names, collapse = " "),
            hp_mean_per_cyl = mean(hp))

d2
#> # A tibble: 3 x 3
#>     cyl car_names_coll                                     hp_mean_per_cyl
#>   <dbl> <chr>                                                        <dbl>
#> 1     4 Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civi…            82.6
#> 2     6 Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Me…           122. 
#> 3     8 Hornet Sportabout Duster 360 Merc 450SE Merc 450S…           209.

d2 %>% 
  pull(car_names_coll)
#> [1] "Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civic Toyota Corolla Toyota Corona Fiat X1-9 Porsche 914-2 Lotus Europa Volvo 142E"                                                                              
#> [2] "Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Merc 280 Merc 280C Ferrari Dino"                                                                                                                               
#> [3] "Hornet Sportabout Duster 360 Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora"

Way 2 - using `nest()`

Of interest, list columns provide a more data frame type answer. We “fold” or “nest” a vector, list, or data frame in one cell:

d %>% 
  group_by(cyl) %>% 
  nest(car_names)
#> # A tibble: 3 x 2
#>     cyl data             
#>   <dbl> <list>           
#> 1     6 <tibble [7 × 1]> 
#> 2     4 <tibble [11 × 1]>
#> 3     8 <tibble [14 × 1]>

How to access it?

d_nest <- d %>% 
  group_by(cyl) %>% 
  nest(car_names)


d_nest$data[[1]]
#> # A tibble: 7 x 1
#>   car_names     
#>   <chr>         
#> 1 Mazda RX4     
#> 2 Mazda RX4 Wag 
#> 3 Hornet 4 Drive
#> 4 Valiant       
#> 5 Merc 280      
#> 6 Merc 280C     
#> 7 Ferrari Dino
d_nest[[1, "data"]]
#> # A tibble: 7 x 1
#>   car_names     
#>   <chr>         
#> 1 Mazda RX4     
#> 2 Mazda RX4 Wag 
#> 3 Hornet 4 Drive
#> 4 Valiant       
#> 5 Merc 280      
#> 6 Merc 280C     
#> 7 Ferrari Dino

How to unnest?

Simple:

d_nest %>% 
  unnest()
#> # A tibble: 32 x 2
#>      cyl car_names     
#>    <dbl> <chr>         
#>  1     6 Mazda RX4     
#>  2     6 Mazda RX4 Wag 
#>  3     6 Hornet 4 Drive
#>  4     6 Valiant       
#>  5     6 Merc 280      
#>  6     6 Merc 280C     
#>  7     6 Ferrari Dino  
#>  8     4 Datsun 710    
#>  9     4 Merc 240D     
#> 10     4 Merc 230      
#> # ... with 22 more rows

Difference between `[` and `[[`

By the way, that’s the difference between [ and [[? Or is there none?

identical(d_nest$data[[1]] , d_nest$data[1])
#> [1] FALSE

There is. See the difference between the classes:

d_nest$data[[1]] %>% class()
#> [1] "tbl_df"     "tbl"        "data.frame"
d_nest$data[1] %>% class()
#> [1] "list"

[[ extracts the actual object, a data frame.

[ extracts a list (containing the actual object).

Processing list columns

How to process list column data further?

Say we would like to know if there ist least one Mercedes in each group:

d_nest %>% 
  mutate(mercs_lgl = str_detect(data, "Merc"))
#> # A tibble: 3 x 3
#>     cyl data              mercs_lgl
#>   <dbl> <list>            <lgl>    
#> 1     6 <tibble [7 × 1]>  TRUE     
#> 2     4 <tibble [11 × 1]> TRUE     
#> 3     8 <tibble [14 × 1]> TRUE

We get a warning because data is not vector, but a data frame (with one column, ie., the car names).

And how many Mercs are there in each group?

To start, consider this way of counting instances of Mercs:

str_detect(d_nest$data[[1]]$car_names, "Merc") %>% sum()
#> [1] 2

Next, we build that into our dplyr pipeline:

d_nest %>% 
  mutate(names_list = map(data, "car_names")) %>% 
  mutate(mercs_n = map(names_list, ~{str_detect(., pattern = "Merc") %>% sum()})) %>% 
  unnest(mercs_n)
#> # A tibble: 3 x 4
#>     cyl data              names_list mercs_n
#>   <dbl> <list>            <list>       <int>
#> 1     6 <tibble [7 × 1]>  <chr [7]>        2
#> 2     4 <tibble [11 × 1]> <chr [11]>       2
#> 3     8 <tibble [14 × 1]> <chr [14]>       3

Phish, that looks frightening. Let’s go throught it:

First, we pull out the names of the cars, because str_detect() is not happy to work with a data frame as input. Now we have the column names_list which is of type character
Second, we map str_detect to each row of this new column names_list. This fuctions looks for Merc instances. Wait, we want to sum up the Marc instances, that’s way we also use sum(). The curly braces make sure that only the last evaluated expression is handed back.
Third, we have to unnest the mercs_n column, because it is still in nested format even it consists of one value only.

Way 3 - like using `nest` but without a special idiom

Again, we fall back to the classical dplyr way of summarising data to a single (scalar) value:

d3 <-
  d %>% 
  group_by(cyl) %>% 
  summarise(names_per_cyl = list(car_names),
            hp_mean_per_cyl = mean(hp))

d3$names_per_cyl[[1]]
#>  [1] "Datsun 710"     "Merc 240D"      "Merc 230"       "Fiat 128"      
#>  [5] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
#>  [9] "Porsche 914-2"  "Lotus Europa"   "Volvo 142E"

Fold ’em all in a list data frame

Let’s take the nesting into list data frames to its extreme:

d %>% 
  group_by(cyl) %>% 
  summarise_all("list")
#> # A tibble: 3 x 12
#>     cyl car_names mpg   disp  hp    drat  wt    qsec  vs    am    gear 
#>   <dbl> <list>    <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis>
#> 1     4 <chr [11… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 2     6 <chr [7]> <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 3     8 <chr [14… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> # ... with 1 more variable: carb <list>

Way 4 - using `map()`:

d %>% 
  split(.$cyl) %>% 
  map("car_names")
#> $`4`
#>  [1] "Datsun 710"     "Merc 240D"      "Merc 230"       "Fiat 128"      
#>  [5] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
#>  [9] "Porsche 914-2"  "Lotus Europa"   "Volvo 142E"    
#> 
#> $`6`
#> [1] "Mazda RX4"      "Mazda RX4 Wag"  "Hornet 4 Drive" "Valiant"       
#> [5] "Merc 280"       "Merc 280C"      "Ferrari Dino"  
#> 
#> $`8`
#>  [1] "Hornet Sportabout"   "Duster 360"          "Merc 450SE"         
#>  [4] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#>  [7] "Lincoln Continental" "Chrysler Imperial"   "Dodge Challenger"   
#> [10] "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
#> [13] "Ford Pantera L"      "Maserati Bora"

Debrief

summarise() in dplyr, summarizes a (column) vector to a scalar (single value). This is often handy, but sometimes limiting, as you are only allowed to apply functions that return a scalar. For more complex functions, such as lm() different approaches need be chosen. One way is to work with list columns as they provide a neat way to plug stuff into one cell of a data frame. More flexible approaches can be built upon the family of apply or map() and alike.

What are the names of the cars with 4 cylinders?

Way 1 - using paste() aund pull()

Way 2 - using nest()

Difference between [ and [[