Load packages
library(tidyverse)
Starters
Assume you have this data frame:
x <- tribble(
~ colA, ~colB, ~colC,
NA, 1, NA,
1, NA, 1
)
x
#> # A tibble: 2 x 3
#> colA colB colC
#> <dbl> <dbl> <dbl>
#> 1 NA 1 NA
#> 2 1 NA 1
But you want this one:
y <- tribble(
~ colA, ~colB, ~colC,
1, 1, 1
)
y
#> # A tibble: 1 x 3
#> colA colB colC
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
That is, you’d like to collapse rows so that if there’s a NA in a column it is replaced by the value found in some other line. We assume there’s only one value to be found, so no need to worry to decide which value to take.
A simple, tidy way is to summarise each column using sum
, where NAs are ignored. This amounts to nothing more than to collapse the rows, eliminating the NAs (assuming there’s only one value, so there’s nothing really to sum up).
x %>%
summarise_all(list(~ sum(., na.rm = TRUE)))
#> # A tibble: 1 x 3
#> colA colB colC
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
More realistic example
A typical use case would be that you have multiple rows per (say) observation with a lot of NAs, and you’d like to collapse rows per observation. Again, we assume that there will be only one value per observation per column (and the rest of elements are NAs).
z <- tribble(
~id, ~ colA, ~colB, ~colC,
1, NA, 1, NA,
1, 1, NA, 1,
2, NA, NA, 2,
2, NA, 2, NA
)
z
#> # A tibble: 4 x 4
#> id colA colB colC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA 1 NA
#> 2 1 1 NA 1
#> 3 2 NA NA 2
#> 4 2 NA 2 NA
The Lego structure of the Tidyverse helps: All we need to add is water a grouping command, ie., group_by()
.
z %>%
group_by(id) %>%
summarise_all(list(~ sum(., na.rm = T)))
#> # A tibble: 2 x 4
#> id colA colB colC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 2 0 2 2
See what happened? For colA
for id
2, there where only NAs. This fact has been documented by the zero value.
Here’s a way to get a NA for that case:
z %>%
group_by(id) %>%
summarise_all(list(~ sum(., na.rm = T))) %>%
transmute_all(list(~ na_if(., 0)))
#> # A tibble: 2 x 4
#> id colA colB colC
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 2 NA 2 2
The Purrr(er) way
As each column is a list, we can make use of list oriented functions from purrr
:
x %>%
summarise_all(list(~ discard(., is.na)))
#> # A tibble: 1 x 3
#> colA colB colC
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
x2 <- tribble(
~ colA, ~colB, ~colC,
NA, "A", NA,
"A", NA, "A"
)
x
#> # A tibble: 2 x 3
#> colA colB colC
#> <dbl> <dbl> <dbl>
#> 1 NA 1 NA
#> 2 1 NA 1
The beauty is that discard
is more general than sum
. discard
only discards some elements, and keeps others. sum
with na.rm=T
is a special case where the elements are summed up and NA elements are dropped.
x2 %>%
summarise_all(list(~ discard(., is.na)))
#> # A tibble: 1 x 3
#> colA colB colC
#> <chr> <chr> <chr>
#> 1 A A A
Of course, grouping operations are possible too.