Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.
Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.
Let’s load some example data:
data(tips, package = "reshape2")
Some packages:
library(mosaic)
One nice way is using the function case_when()
from the tidyverse
community. Consider this example:
tips$tip_gruppe <- case_when(
tips$tip < 2 ~ "scrooge",
tips$tip < 4 ~ "ok",
tips$tip < 8 ~ "generous",
TRUE ~ "in love"
)
Wait, case_when
is pipe-friendly, see:
tips <- tips %>%
mutate(tip_gruppe = case_when(
tip < 2 ~ "scrooge",
tip < 4 ~ "ok",
tip < 8 ~ "generous",
TRUE ~ "in love"
))
One subsequent step could be to use the new variable in a \(\chi^2\) test:
xchisq.test(tip_gruppe ~ sex, data = tips)
#>
#> Pearson's Chi-squared test
#>
#> data: tally(x, data = data)
#> X-squared = 1.7171, df = 3, p-value = 0.6331
#>
#> 16 35
#> (18.18) (32.82)
#> [0.262] [0.145]
#> <-0.51> < 0.38>
#>
#> 0 2
#> ( 0.71) ( 1.29)
#> [0.713] [0.395]
#> <-0.84> < 0.63>
#>
#> 54 92
#> (52.06) (93.94)
#> [0.072] [0.040]
#> < 0.27> <-0.20>
#>
#> 17 28
#> (16.05) (28.95)
#> [0.057] [0.031]
#> < 0.24> <-0.18>
#>
#> key:
#> observed
#> (expected)
#> [contribution to X-squared]
#> <Pearson residual>
Similarly, use case_when
for nominal variables:
tips <- tips %>%
mutate(weekend = case_when(
day == "Fri" ~ "weekend",
day == "Sat" ~ "weekend",
TRUE ~ "keep on working"
))
Note that TRUE
indicates “else do …”, in this case read “else ‘weekend’ is ‘keep on working’”.
A convinient way to bin several values (such as “Fri”, “Sat”) into one (such as “weekend”) is the %in
operator:
tips <- tips %>%
mutate(weekend = case_when(
day %in% c("Fri", "Sat") ~ "weekend",
TRUE ~ "keep on working"
))
Another convenient way is using rec
from the r package sjmisc
:
library(sjmisc)
tips <- rec(tips, day,
rec = "Fri=Weekend; Sat=Weekend; else = keep_working")
count(tips, day_r)
#> # A tibble: 2 x 2
#> day_r n
#> <fct> <int>
#> 1 keep_working 138
#> 2 Weekend 106
Note that a new, recoded variable is appended using the suffix _r
. See:
glimpse(tips)
#> Observations: 244
#> Variables: 10
#> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
#> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
#> $ sex <fct> Female, Male, Male, Male, Female, Male, Male, Male,...
#> $ smoker <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
#> $ day <fct> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, S...
#> $ time <fct> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Din...
#> $ size <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
#> $ tip_gruppe <chr> "scrooge", "scrooge", "ok", "ok", "ok", "generous",...
#> $ weekend <chr> "keep on working", "keep on working", "keep on work...
#> $ day_r <fct> keep_working, keep_working, keep_working, keep_work...
Note that the pipe will work too:
tips <- tips %>%
rec(day,
rec = "Fri=Weekend; Sat=Weekend; else = keep_working")
rec
is convenient as one does not need to use mutate
.
Use ?rec
for more infos.
The good thing on both ways (case_when
and rec
) is that both functions can be used both for recoding and some binning purposes.