Assume you would like to check for missing data, but not for one column only but for several columns.
First, data and some packages:
data(mtcars)
library(tidyverse)
Then, let’s introduce some missing data:
mtcars[c(1,2), 1] <- NA
mtcars[c(1, 3:4), 2] <- NA
Don’t check columns individually
Of course, you do not want to repeat yourself, and check each column individually, like this:
sum(is.na(mtcars[[1]]))
#> [1] 2
sum(is.na(mtcars[, 1])) # same
#> [1] 2
Neither one would like to check each row individually:
sum(is.na(mtcars[1, ]))
#> [1] 2
Apply a function to each column
We need to apply()
the function above to each column (or row). map
works similar to apply
but has some niceties included (map
comes from R package purrr
).
mtcars %>%
map(~sum(is.na(.)))
#> $mpg
#> [1] 2
#>
#> $cyl
#> [1] 3
#>
#> $disp
#> [1] 0
#>
#> $hp
#> [1] 0
#>
#> $drat
#> [1] 0
#>
#> $wt
#> [1] 0
#>
#> $qsec
#> [1] 0
#>
#> $vs
#> [1] 0
#>
#> $am
#> [1] 0
#>
#> $gear
#> [1] 0
#>
#> $carb
#> [1] 0
Note that ~
is shorthand for function(x)
, but less verbose. .
refers to each element of mtcars
, ie each column.
Count NAs per row
Now assume you want to know the missing per case (ie row). One way is this:
mtcars %>%
mutate(NA_count = rowSums(is.na(.))) %>%
head()
#> mpg cyl disp hp drat wt qsec vs am gear carb NA_count
#> 1 NA NA 160 110 3.90 2.620 16.46 0 1 4 4 2
#> 2 NA 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 NA 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 NA 258 110 3.08 3.215 19.44 1 0 3 1 1
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0
Here, the dot .
refers to the data frame as of the last pipe step. In this case, it’s just the plain data frame mtcars
.
Of course, the pipe is not needed:
rowSums(is.na(mtcars))
#> Mazda RX4 Mazda RX4 Wag Datsun 710
#> 2 1 1
#> Hornet 4 Drive Hornet Sportabout Valiant
#> 1 0 0
#> Duster 360 Merc 240D Merc 230
#> 0 0 0
#> Merc 280 Merc 280C Merc 450SE
#> 0 0 0
#> Merc 450SL Merc 450SLC Cadillac Fleetwood
#> 0 0 0
#> Lincoln Continental Chrysler Imperial Fiat 128
#> 0 0 0
#> Honda Civic Toyota Corolla Toyota Corona
#> 0 0 0
#> Dodge Challenger AMC Javelin Camaro Z28
#> 0 0 0
#> Pontiac Firebird Fiat X1-9 Porsche 914-2
#> 0 0 0
#> Lotus Europa Ford Pantera L Ferrari Dino
#> 0 0 0
#> Maserati Bora Volvo 142E
#> 0 0
A more classical R-way would consist of the following:
- Apply a function over each row of dataframe
- This function would be
sum(is.na(x))
in this case, wherex
refers to each row
apply(mtcars,MARGIN = 1, FUN = function(x) sum(is.na(x)))
#> Mazda RX4 Mazda RX4 Wag Datsun 710
#> 2 1 1
#> Hornet 4 Drive Hornet Sportabout Valiant
#> 1 0 0
#> Duster 360 Merc 240D Merc 230
#> 0 0 0
#> Merc 280 Merc 280C Merc 450SE
#> 0 0 0
#> Merc 450SL Merc 450SLC Cadillac Fleetwood
#> 0 0 0
#> Lincoln Continental Chrysler Imperial Fiat 128
#> 0 0 0
#> Honda Civic Toyota Corolla Toyota Corona
#> 0 0 0
#> Dodge Challenger AMC Javelin Camaro Z28
#> 0 0 0
#> Pontiac Firebird Fiat X1-9 Porsche 914-2
#> 0 0 0
#> Lotus Europa Ford Pantera L Ferrari Dino
#> 0 0 0
#> Maserati Bora Volvo 142E
#> 0 0
Count NAs of whole dataframe
Note that is.na()
expects a data.frame as input.
is.na(mtcars) %>%
head()
#> mpg cyl disp hp drat wt qsec vs am
#> Mazda RX4 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Mazda RX4 Wag TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Datsun 710 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Hornet 4 Drive FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Hornet Sportabout FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Valiant FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> gear carb
#> Mazda RX4 FALSE FALSE
#> Mazda RX4 Wag FALSE FALSE
#> Datsun 710 FALSE FALSE
#> Hornet 4 Drive FALSE FALSE
#> Hornet Sportabout FALSE FALSE
#> Valiant FALSE FALSE
Note that sum()
also accepts a data.frame as input:
is.na(mtcars) %>%
sum()
#> [1] 5
Some musings
To inspect missing values (or cases with NAs), filter()
may be a solution:
mtcars %>%
filter(is.na(mpg) | is.na(cyl) | is.na(disp))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 NA NA 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 NA 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 NA 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 NA 258 110 3.08 3.215 19.44 1 0 3 1
More conveniently using complete.cases()
:
complete.cases(mtcars)
#> [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
And the complement:
mtcars %>%
filter(!complete.cases(mtcars))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 NA NA 160 110 3.90 2.620 16.46 0 1 4 4
#> 2 NA 6 160 110 3.90 2.875 17.02 0 1 4 4
#> 3 22.8 NA 108 93 3.85 2.320 18.61 1 1 4 1
#> 4 21.4 NA 258 110 3.08 3.215 19.44 1 0 3 1