Recently, some one asked me in a workshop this question: “What are the names of the cars with 4 (6,8) cylinders?” (he referred to the mtcars
data set). That was a workshop on the tidyverse, so the question is how to answer this question using tidyverse techniques.
First, let’s load the usual culprits.
library(tidyverse)
library(purrrlyr)
library(knitr)
library(stringr)
data(mtcars)
d <- as_tibble(mtcars) %>%
rownames_to_column(var = "car_names")
d %>%
head() %>%
kable()
car_names | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Let’s explore several ways.
Way 1 - using paste()
aund pull()
First, it makes sense to group the data, as our question refers to the car names of each cylinder group. Next, we can summarize the data using paste()
to collate the names into one string:
d2 <-
d %>%
group_by(cyl) %>%
summarise(car_names_coll = paste(car_names, collapse = " "),
hp_mean_per_cyl = mean(hp))
d2
#> # A tibble: 3 x 3
#> cyl car_names_coll hp_mean_per_cyl
#> <dbl> <chr> <dbl>
#> 1 4 Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civi… 82.6
#> 2 6 Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Me… 122.
#> 3 8 Hornet Sportabout Duster 360 Merc 450SE Merc 450S… 209.
d2 %>%
pull(car_names_coll)
#> [1] "Datsun 710 Merc 240D Merc 230 Fiat 128 Honda Civic Toyota Corolla Toyota Corona Fiat X1-9 Porsche 914-2 Lotus Europa Volvo 142E"
#> [2] "Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Valiant Merc 280 Merc 280C Ferrari Dino"
#> [3] "Hornet Sportabout Duster 360 Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial Dodge Challenger AMC Javelin Camaro Z28 Pontiac Firebird Ford Pantera L Maserati Bora"
Way 2 - using nest()
Of interest, list columns provide a more data frame type answer. We “fold” or “nest” a vector, list, or data frame in one cell:
d %>%
group_by(cyl) %>%
nest(car_names)
#> # A tibble: 3 x 2
#> cyl data
#> <dbl> <list>
#> 1 6 <tibble [7 × 1]>
#> 2 4 <tibble [11 × 1]>
#> 3 8 <tibble [14 × 1]>
How to access it?
d_nest <- d %>%
group_by(cyl) %>%
nest(car_names)
d_nest$data[[1]]
#> # A tibble: 7 x 1
#> car_names
#> <chr>
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Hornet 4 Drive
#> 4 Valiant
#> 5 Merc 280
#> 6 Merc 280C
#> 7 Ferrari Dino
d_nest[[1, "data"]]
#> # A tibble: 7 x 1
#> car_names
#> <chr>
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Hornet 4 Drive
#> 4 Valiant
#> 5 Merc 280
#> 6 Merc 280C
#> 7 Ferrari Dino
How to unnest
?
Simple:
d_nest %>%
unnest()
#> # A tibble: 32 x 2
#> cyl car_names
#> <dbl> <chr>
#> 1 6 Mazda RX4
#> 2 6 Mazda RX4 Wag
#> 3 6 Hornet 4 Drive
#> 4 6 Valiant
#> 5 6 Merc 280
#> 6 6 Merc 280C
#> 7 6 Ferrari Dino
#> 8 4 Datsun 710
#> 9 4 Merc 240D
#> 10 4 Merc 230
#> # ... with 22 more rows
Difference between [
and [[
By the way, that’s the difference between [
and [[
? Or is there none?
identical(d_nest$data[[1]] , d_nest$data[1])
#> [1] FALSE
There is. See the difference between the class
es:
d_nest$data[[1]] %>% class()
#> [1] "tbl_df" "tbl" "data.frame"
d_nest$data[1] %>% class()
#> [1] "list"
[[
extracts the actual object, a data frame.
[
extracts a list (containing the actual object).
Processing list columns
How to process list column data further?
Say we would like to know if there ist least one Mercedes in each group:
d_nest %>%
mutate(mercs_lgl = str_detect(data, "Merc"))
#> # A tibble: 3 x 3
#> cyl data mercs_lgl
#> <dbl> <list> <lgl>
#> 1 6 <tibble [7 × 1]> TRUE
#> 2 4 <tibble [11 × 1]> TRUE
#> 3 8 <tibble [14 × 1]> TRUE
We get a warning because data
is not vector, but a data frame (with one column, ie., the car names).
And how many Mercs are there in each group?
To start, consider this way of counting instances of Mercs:
str_detect(d_nest$data[[1]]$car_names, "Merc") %>% sum()
#> [1] 2
Next, we build that into our dplyr pipeline:
d_nest %>%
mutate(names_list = map(data, "car_names")) %>%
mutate(mercs_n = map(names_list, ~{str_detect(., pattern = "Merc") %>% sum()})) %>%
unnest(mercs_n)
#> # A tibble: 3 x 4
#> cyl data names_list mercs_n
#> <dbl> <list> <list> <int>
#> 1 6 <tibble [7 × 1]> <chr [7]> 2
#> 2 4 <tibble [11 × 1]> <chr [11]> 2
#> 3 8 <tibble [14 × 1]> <chr [14]> 3
Phish, that looks frightening. Let’s go throught it:
- First, we pull out the names of the cars, because
str_detect()
is not happy to work with a data frame as input. Now we have the columnnames_list
which is of type character - Second, we map
str_detect
to each row of this new columnnames_list
. This fuctions looks for Merc instances. Wait, we want to sum up the Marc instances, that’s way we also usesum()
. The curly braces make sure that only the last evaluated expression is handed back. - Third, we have to unnest the
mercs_n
column, because it is still in nested format even it consists of one value only.
Way 3 - like using nest
but without a special idiom
Again, we fall back to the classical dplyr way of summarising data to a single (scalar) value:
d3 <-
d %>%
group_by(cyl) %>%
summarise(names_per_cyl = list(car_names),
hp_mean_per_cyl = mean(hp))
d3$names_per_cyl[[1]]
#> [1] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128"
#> [5] "Honda Civic" "Toyota Corolla" "Toyota Corona" "Fiat X1-9"
#> [9] "Porsche 914-2" "Lotus Europa" "Volvo 142E"
Fold ’em all in a list data frame
Let’s take the nesting into list data frames to its extreme:
d %>%
group_by(cyl) %>%
summarise_all("list")
#> # A tibble: 3 x 12
#> cyl car_names mpg disp hp drat wt qsec vs am gear
#> <dbl> <list> <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis> <lis>
#> 1 4 <chr [11… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 2 6 <chr [7]> <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> 3 8 <chr [14… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl… <dbl…
#> # ... with 1 more variable: carb <list>
Way 4 - using map()
:
d %>%
split(.$cyl) %>%
map("car_names")
#> $`4`
#> [1] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128"
#> [5] "Honda Civic" "Toyota Corolla" "Toyota Corona" "Fiat X1-9"
#> [9] "Porsche 914-2" "Lotus Europa" "Volvo 142E"
#>
#> $`6`
#> [1] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant"
#> [5] "Merc 280" "Merc 280C" "Ferrari Dino"
#>
#> $`8`
#> [1] "Hornet Sportabout" "Duster 360" "Merc 450SE"
#> [4] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
#> [7] "Lincoln Continental" "Chrysler Imperial" "Dodge Challenger"
#> [10] "AMC Javelin" "Camaro Z28" "Pontiac Firebird"
#> [13] "Ford Pantera L" "Maserati Bora"
Debrief
summarise()
in dplyr, summarizes a (column) vector to a scalar (single value). This is often handy, but sometimes limiting, as you are only allowed to apply functions that return a scalar. For more complex functions, such as lm()
different approaches need be chosen. One way is to work with list columns as they provide a neat way to plug stuff into one cell of a data frame. More flexible approaches can be built upon the family of apply
or map()
and alike.