What are the names of the cars with 4 cylinders?

Recently, some one asked me in a workshop this question: “What are the names of the cars with 4 (6,8) cylinders?” (he referred to the mtcars data set). That was a workshop on the tidyverse, so the question is how to answer this question using tidyverse techniques.

First, let’s load the usual culprits.

library(tidyverse)
library(purrrlyr)
library(knitr)
library(stringr)
data(mtcars)

d <- as_tibble(mtcars) %>% 
  rownames_to_column(var = "car_names")

d %>% 
  head() %>% 
  kable()
car_names mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Let’s explore several ways.

Way 1 - using paste() aund pull()

First, it makes sense to group the data, as our question refers to the car names of each cylinder group. Next, we can summarize the data using paste() to collate the names into one string:

d2 <- 
  d %>% 
  group_by(cyl) %>% 
  summarise(car_names_coll = paste(car_names, collapse = " "),
            hp_mean_per_cyl = mean(hp))

d2
#> # A tibble: 3 x 3
#>     cyl car_names_coll                                     hp_mean_per_cyl
#>   <dbl> <chr>                                                        <dbl>
#> 1     4 Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civi…            82.6
#> 2     6 Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Me…           122. 
#> 3     8 Hornet Sportabout Duster 360 Merc 450SE Merc 450S…           209.

d2 %>% 
  pull(car_names_coll)
#> [1] "Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civic Toyota Corolla Toyota Corona Fiat X1-9 Porsche 914-2 Lotus Europa Volvo 142E"                                                                              
#> [2] "Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Merc 280 Merc 280C Ferrari Dino"                                                                                                                               
#> [3] "Hornet Sportabout Duster 360 Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora"

Way 2 - using nest()

Of interest, list columns provide a more data frame type answer. We “fold” or “nest” a vector, list, or data frame in one cell:

d %>% 
  group_by(cyl) %>% 
  nest(car_names)
#> # A tibble: 3 x 2
#>     cyl data             
#>   <dbl> <list>           
#> 1     6 <tibble [7 × 1]> 
#> 2     4 <tibble [11 × 1]>
#> 3     8 <tibble [14 × 1]>

How to access it?

d_nest <- d %>% 
  group_by(cyl) %>% 
  nest(car_names)


d_nest$data[[1]]
#> # A tibble: 7 x 1
#>   car_names     
#>   <chr>         
#> 1 Mazda RX4     
#> 2 Mazda RX4 Wag 
#> 3 Hornet 4 Drive
#> 4 Valiant       
#> 5 Merc 280      
#> 6 Merc 280C     
#> 7 Ferrari Dino
d_nest[[1, "data"]]
#> # A tibble: 7 x 1
#>   car_names     
#>   <chr>         
#> 1 Mazda RX4     
#> 2 Mazda RX4 Wag 
#> 3 Hornet 4 Drive
#> 4 Valiant       
#> 5 Merc 280      
#> 6 Merc 280C     
#> 7 Ferrari Dino

How to unnest?

Simple:

d_nest %>% 
  unnest()
#> # A tibble: 32 x 2
#>      cyl car_names     
#>    <dbl> <chr>         
#>  1     6 Mazda RX4     
#>  2     6 Mazda RX4 Wag 
#>  3     6 Hornet 4 Drive
#>  4     6 Valiant       
#>  5     6 Merc 280      
#>  6     6 Merc 280C     
#>  7     6 Ferrari Dino  
#>  8     4 Datsun 710    
#>  9     4 Merc 240D     
#> 10     4 Merc 230      
#> # ... with 22 more rows

Difference between [ and [[

By the way, that’s the difference between [ and [[? Or is there none?

identical(d_nest$data[[1]] , d_nest$data[1])
#> [1] FALSE

There is. See the difference between the classes:

d_nest$data[[1]] %>% class()
#> [1] "tbl_df"     "tbl"        "data.frame"
d_nest$data[1] %>% class()
#> [1] "list"

[[ extracts the actual object, a data frame.

[ extracts a list (containing the actual object).

Processing list columns

How to process list column data further?

Say we would like to know if there ist least one Mercedes in each group:

d_nest %>% 
  mutate(mercs_lgl = str_detect(data, "Merc"))
#> # A tibble: 3 x 3
#>     cyl data              mercs_lgl
#>   <dbl> <list>            <lgl>    
#> 1     6 <tibble [7 × 1]>  TRUE     
#> 2     4 <tibble [11 × 1]> TRUE     
#> 3     8 <tibble [14 × 1]> TRUE

We get a warning because data is not vector, but a data frame (with one column, ie., the car names).

And how many Mercs are there in each group?

To start, consider this way of counting instances of Mercs:

str_detect(d_nest$data[[1]]$car_names, "Merc") %>% sum()
#> [1] 2

Next, we build that into our dplyr pipeline:

d_nest %>% 
  mutate(names_list = map(data, "car_names")) %>% 
  mutate(mercs_n = map(names_list, ~{str_detect(., pattern = "Merc") %>% sum()})) %>% 
  unnest(mercs_n)
#> # A tibble: 3 x 4
#>     cyl data              names_list mercs_n
#>   <dbl> <list>            <list>       <int>
#> 1     6 <tibble [7 × 1]>  <chr [7]>        2
#> 2     4 <tibble [11 × 1]> <chr [11]>       2
#> 3     8 <tibble [14 × 1]> <chr [14]>       3

Phish, that looks frightening. Let’s go throught it:

  • First, we pull out the names of the cars, because str_detect() is not happy to work with a data frame as input. Now we have the column names_list which is of type character
  • Second, we map str_detect to each row of this new column names_list. This fuctions looks for Merc instances. Wait, we want to sum up the Marc instances, that’s way we also use sum(). The curly braces make sure that only the last evaluated expression is handed back.
  • Third, we have to unnest the mercs_n column, because it is still in nested format even it consists of one value only.

Way 3 - like using nest but without a special idiom

Again, we fall back to the classical dplyr way of summarising data to a single (scalar) value:

d3 <-
  d %>% 
  group_by(cyl) %>% 
  summarise(names_per_cyl = list(car_names),
            hp_mean_per_cyl = mean(hp))

d3$names_per_cyl[[1]]
#>  [1] "Datsun 710"     "Merc 240D"      "Merc 230"       "Fiat 128"      
#>  [5] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
#>  [9] "Porsche 914-2"  "Lotus Europa"   "Volvo 142E"

Fold ’em all in a list data frame

Let’s take the nesting into list data frames to its extreme:

d %>% 
  group_by(cyl) %>% 
  summarise_all("list")
#> # A tibble: 3 x 12
#>     cyl car_names mpg   disp  hp    drat  wt    qsec  vs    am    gear 
#>   <dbl> <list>    <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis>
#> 1     4 <chr [11… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 2     6 <chr [7]> <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 3     8 <chr [14… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> # ... with 1 more variable: carb <list>

Way 4 - using map():

d %>% 
  split(.$cyl) %>% 
  map("car_names")
#> $`4`
#>  [1] "Datsun 710"     "Merc 240D"      "Merc 230"       "Fiat 128"      
#>  [5] "Honda Civic"    "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"     
#>  [9] "Porsche 914-2"  "Lotus Europa"   "Volvo 142E"    
#> 
#> $`6`
#> [1] "Mazda RX4"      "Mazda RX4 Wag"  "Hornet 4 Drive" "Valiant"       
#> [5] "Merc 280"       "Merc 280C"      "Ferrari Dino"  
#> 
#> $`8`
#>  [1] "Hornet Sportabout"   "Duster 360"          "Merc 450SE"         
#>  [4] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
#>  [7] "Lincoln Continental" "Chrysler Imperial"   "Dodge Challenger"   
#> [10] "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
#> [13] "Ford Pantera L"      "Maserati Bora"

Debrief

summarise() in dplyr, summarizes a (column) vector to a scalar (single value). This is often handy, but sometimes limiting, as you are only allowed to apply functions that return a scalar. For more complex functions, such as lm() different approaches need be chosen. One way is to work with list columns as they provide a neat way to plug stuff into one cell of a data frame. More flexible approaches can be built upon the family of apply or map() and alike.