Converting factors to numbers in R can be frustrating. Consider the following sitation: We have some data, and try to convert a factor (sex
in tips
, see below) to a numeric variable:
library(tidyverse)
library(sjmisc) # for recoding
data(tips, package = "reshape2")
glimpse(tips)
#> Observations: 244
#> Variables: 7
#> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
#> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
#> $ sex <fct> Female, Male, Male, Male, Female, Male, Male, Male,...
#> $ smoker <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
#> $ day <fct> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, S...
#> $ time <fct> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Din...
#> $ size <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
Ok, here we go. Say, we want "Female" = 1
and "Male" = 0
.
tips <- sjmisc::rec(tips, sex, rec = "Female = 1; Male = 0")
glimpse(tips)
#> Observations: 244
#> Variables: 8
#> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
#> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
#> $ sex <fct> Female, Male, Male, Male, Female, Male, Male, Male,...
#> $ smoker <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
#> $ day <fct> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, S...
#> $ time <fct> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Din...
#> $ size <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
#> $ sex_r <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, ...
Seems to have worked so far. Wait, sex_r
ist still a factor, not numeric. So convert it using as.numeric
:
tips$sex_num <- as.numeric(tips$sex_r)
glimpse(tips)
#> Observations: 244
#> Variables: 9
#> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26....
#> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.00, 3.12, 1.9...
#> $ sex <fct> Female, Male, Male, Male, Female, Male, Male, Male,...
#> $ smoker <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
#> $ day <fct> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, S...
#> $ time <fct> Dinner, Dinner, Dinner, Dinner, Dinner, Dinner, Din...
#> $ size <int> 2, 3, 3, 2, 4, 4, 2, 4, 2, 2, 2, 4, 2, 4, 2, 2, 3, ...
#> $ sex_r <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, ...
#> $ sex_num <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, ...
Oh no! That’s not what we wanted! R has messed the thing up (?). The reason is that R sees the first factor level internally as the number 1 . The second level as number two. What’s the first factor level in our case? Let’s see:
factor(tips$sex) %>% head()
#> [1] Female Male Male Male Female Male
#> Levels: Female Male
factor(tips$sex_r) %>% head()
#> [1] 1 0 0 0 1 0
#> Levels: 0 1
That’s confusing: “0” is the first level of sex_r
- internally for R represented by “1”. The second level of sex_r
is “1” - internally represented by “2”. That’s why we get these numbers:
head(tips$sex_num)
#> [1] 2 1 1 1 2 1
Solution
One solution is to use readr::parse_number()
:
tips$sex_num <- parse_number(tips$sex_r)
head(tips$sex_num)
#> [1] 1 0 0 0 1 0
head(tips$sex_r)
#> [1] 1 0 0 0 1 0
#> Levels: 0 1
Worked!