# Prevent dropping from non-occuring levels using dplyr

``library(tidyverse)  # data wrangling``

# 2 Problem

Consider the following situation:

``````mtcars |>
group_by(high_hp = hp > 1000) |>
count(high_hp)
#> # A tibble: 1 × 2
#> # Groups:   high_hp [1]
#>   high_hp     n
#>   <lgl>   <int>
#> 1 FALSE      32``````

The summary table does not show the level `TRUE`, as it is not occuring in the data. This can be problematic: If the data is unknown before summarizing and you would expect that both/all levels (TRUE, FALSE) occur. Just imagine that a subsequent function will count the level `TRUE` and the level `FALSE`. If one level is missing, your system may brake down.

# 3 Solution

Using dplyr and `count` or `summarize`, one solution is to turn the grouping variable in a factor, and then using the parameter `.drop = FALSE` in `count` or `summarise`.

Here’s an example for `count`:

``````mtcars |>
group_by(high_hp = hp > 1000) |>
mutate(high_hp = factor(high_hp, levels = c(FALSE, TRUE))) |>
count(high_hp, .drop = FALSE)
#> # A tibble: 2 × 2
#> # Groups:   high_hp [2]
#>   high_hp     n
#>   <fct>   <int>
#> 1 FALSE      32
#> 2 TRUE        0``````

And here a similar one for `summarise`:

``````mtcars %>%
group_by(hp_over_1000 = factor(hp > 1000, levels = c(FALSE, TRUE)), .drop = FALSE) %>%
summarise(mean_hp = mean(hp))
#> # A tibble: 2 × 2
#>   hp_over_1000 mean_hp
#>   <fct>          <dbl>
#> 1 FALSE           147.
#> 2 TRUE            NaN``````

If we would like to make sure that the resulting variable is not `NaN`, then we could use `complete`:

``````mtcars %>%
group_by(hp_over_1000 = factor(hp > 1000, levels = c(FALSE, TRUE)), .drop = FALSE) %>%
summarise(mean_hp = mean(hp)) %>%
complete(hp_over_1000, fill = list(mean_hp = NA))
#> # A tibble: 2 × 2
#>   hp_over_1000 mean_hp
#>   <fct>          <dbl>
#> 1 FALSE           147.
#> 2 TRUE             NA``````

# 4 Reproducibility

