3 min read

Collapse rows to eliminate NAs

Load packages

library(tidyverse)

Starters

Assume you have this data frame:

x <- tribble(
  ~ colA, ~colB, ~colC,
  NA, 1, NA,
  1, NA, 1
)

x
#> # A tibble: 2 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1    NA     1    NA
#> 2     1    NA     1

But you want this one:

y <- tribble(
  ~ colA, ~colB, ~colC,
  1, 1, 1
)
y
#> # A tibble: 1 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1

That is, you’d like to collapse rows so that if there’s a NA in a column it is replaced by the value found in some other line. We assume there’s only one value to be found, so no need to worry to decide which value to take.

A simple, tidy way is to summarise each column using sum, where NAs are ignored. This amounts to nothing more than to collapse the rows, eliminating the NAs (assuming there’s only one value, so there’s nothing really to sum up).

x %>% 
  summarise_all(list(~ sum(., na.rm = TRUE)))
#> # A tibble: 1 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1

More realistic example

A typical use case would be that you have multiple rows per (say) observation with a lot of NAs, and you’d like to collapse rows per observation. Again, we assume that there will be only one value per observation per column (and the rest of elements are NAs).

z <- tribble(
  ~id, ~ colA, ~colB, ~colC,
  1, NA, 1, NA,
  1, 1, NA, 1,
  2, NA, NA, 2,
  2, NA, 2, NA
)

z
#> # A tibble: 4 x 4
#>      id  colA  colB  colC
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    NA     1    NA
#> 2     1     1    NA     1
#> 3     2    NA    NA     2
#> 4     2    NA     2    NA

The Lego structure of the Tidyverse helps: All we need to add is water a grouping command, ie., group_by().

z %>% 
  group_by(id) %>% 
  summarise_all(list(~ sum(., na.rm = T)))
#> # A tibble: 2 x 4
#>      id  colA  colB  colC
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     1     1
#> 2     2     0     2     2

See what happened? For colA for id 2, there where only NAs. This fact has been documented by the zero value.

Here’s a way to get a NA for that case:

z %>% 
  group_by(id) %>% 
  summarise_all(list(~ sum(., na.rm = T))) %>% 
  transmute_all(list(~ na_if(., 0)))
#> # A tibble: 2 x 4
#>      id  colA  colB  colC
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     1     1
#> 2     2    NA     2     2

The Purrr(er) way

As each column is a list, we can make use of list oriented functions from purrr:

x %>% 
  summarise_all(list(~ discard(., is.na)))
#> # A tibble: 1 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1     1     1     1
x2 <- tribble(
  ~ colA, ~colB, ~colC,
  NA, "A", NA,
  "A", NA, "A"
)

x
#> # A tibble: 2 x 3
#>    colA  colB  colC
#>   <dbl> <dbl> <dbl>
#> 1    NA     1    NA
#> 2     1    NA     1

The beauty is that discard is more general than sum. discard only discards some elements, and keeps others. sum with na.rm=T is a special case where the elements are summed up and NA elements are dropped.

x2 %>% 
  summarise_all(list(~ discard(., is.na)))
#> # A tibble: 1 x 3
#>   colA  colB  colC 
#>   <chr> <chr> <chr>
#> 1 A     A     A

Of course, grouping operations are possible too.