Assume we have a vector (of type character) such as countries, names, or products. Each element is allowed to show up multiple times. Further assume that there is a rather large number of unique (different) elements. What we would like to achieve is to give each element a unique ID, where the ID ranges from 1 to k (k is the number of different elements).
Of course there are different ways to achieve this goal, we’ll explore one or two.
library(tidyverse)
data(tips, package = "reshape2")
head(tips) %>%
knitr::kable()
total_bill | tip | sex | smoker | day | time | size |
---|---|---|---|---|---|---|
16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
Say, day
is the vector of interest. How many different values are there?
unique(tips$day) %>% length()
#> [1] 4
Not too many, but for the sake of the example that’s okay.
Way 1 - via factors
What’ the the type of this variable?
tips$day %>% class()
#> [1] "factor"
factor
. That’s good. If it were not, we should transform it to factor:
tips$day <- factor(tips$day)
The trick is simple: Convert the factor to numeric. Due to the way factor variables are stored under the hood, it’s very simple to convert them to an index. The underlying reason is that factors are actually stored by the index of the different values (plus the label of each element).
tips$day_ix <- as.numeric(tips$day)
Let’s check:
tips$day
#> [1] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [15] Sun Sun Sun Sun Sun Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [29] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sun
#> [43] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [57] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [71] Sat Sat Sat Sat Sat Sat Sat Thur Thur Thur Thur Thur Thur Thur
#> [85] Thur Thur Thur Thur Thur Thur Fri Fri Fri Fri Fri Fri Fri Fri
#> [99] Fri Fri Fri Fri Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [113] Sun Sun Sun Sun Sun Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [127] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [141] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sun Sun Sun Sun
#> [155] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [169] Sat Sat Sat Sat Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [183] Sun Sun Sun Sun Sun Sun Sun Sun Sun Thur Thur Thur Thur Thur
#> [197] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sat Sat Sat Sat
#> [211] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Fri Fri Fri Fri
#> [225] Fri Fri Fri Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [239] Sat Sat Sat Sat Sat Thur
#> Levels: Fri Sat Sun Thur
tips$day_ix
#> [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [36] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [71] 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
#> [106] 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#> [141] 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3
#> [176] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2
#> [211] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4
Looks good.
Way 2 - via a join
tips$day %>% unique()
#> [1] Sun Sat Thur Fri
#> Levels: Fri Sat Sun Thur
Get the levels as a data frame:
levels_df <- data_frame(
id = 1 : length(tips$day %>% unique()),
levels = tips$day %>% unique()
)
levels_df
#> # A tibble: 4 x 2
#> id levels
#> <int> <fct>
#> 1 1 Sun
#> 2 2 Sat
#> 3 3 Thur
#> 4 4 Fri
Now join:
tips2 <-tips %>%
full_join(levels_df, by = c("day" = "levels"))
tips2$day
#> [1] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [15] Sun Sun Sun Sun Sun Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [29] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sun
#> [43] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [57] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [71] Sat Sat Sat Sat Sat Sat Sat Thur Thur Thur Thur Thur Thur Thur
#> [85] Thur Thur Thur Thur Thur Thur Fri Fri Fri Fri Fri Fri Fri Fri
#> [99] Fri Fri Fri Fri Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [113] Sun Sun Sun Sun Sun Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [127] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [141] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sun Sun Sun Sun
#> [155] Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [169] Sat Sat Sat Sat Sun Sun Sun Sun Sun Sun Sun Sun Sun Sun
#> [183] Sun Sun Sun Sun Sun Sun Sun Sun Sun Thur Thur Thur Thur Thur
#> [197] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sat Sat Sat Sat
#> [211] Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Fri Fri Fri Fri
#> [225] Fri Fri Fri Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat Sat
#> [239] Sat Sat Sat Sat Sat Thur
#> Levels: Fri Sat Sun Thur
tips2$id
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [36] 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> [71] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2
#> [106] 2 2 2 2 2 2 2 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [141] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1
#> [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2
#> [211] 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
The ordering is different here (but could easily be fixed), but this way is also useful.
Debrief
Coercing an index over an character of factor vector is quite straight forward and may be useful in some situation.