Coercing an index over a character vector

Assume we have a vector (of type character) such as countries, names, or products. Each element is allowed to show up multiple times. Further assume that there is a rather large number of unique (different) elements. What we would like to achieve is to give each element a unique ID, where the ID ranges from 1 to k (k is the number of different elements).

Of course there are different ways to achieve this goal, we’ll explore one or two.

library(tidyverse)
data(tips, package = "reshape2")
head(tips) %>% 
  knitr::kable()
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4

Say, day is the vector of interest. How many different values are there?

unique(tips$day) %>% length()
#> [1] 4

Not too many, but for the sake of the example that’s okay.

Way 1 - via factors

What’ the the type of this variable?

tips$day %>% class()
#> [1] "factor"

factor. That’s good. If it were not, we should transform it to factor:

tips$day <- factor(tips$day)

The trick is simple: Convert the factor to numeric. Due to the way factor variables are stored under the hood, it’s very simple to convert them to an index. The underlying reason is that factors are actually stored by the index of the different values (plus the label of each element).

tips$day_ix <- as.numeric(tips$day)

Let’s check:

tips$day
#>   [1] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#>  [15] Sun  Sun  Sun  Sun  Sun  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#>  [29] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sun 
#>  [43] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#>  [57] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#>  [71] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Thur Thur Thur Thur Thur Thur Thur
#>  [85] Thur Thur Thur Thur Thur Thur Fri  Fri  Fri  Fri  Fri  Fri  Fri  Fri 
#>  [99] Fri  Fri  Fri  Fri  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#> [113] Sun  Sun  Sun  Sun  Sun  Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [127] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [141] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sun  Sun  Sun  Sun 
#> [155] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#> [169] Sat  Sat  Sat  Sat  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#> [183] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Thur Thur Thur Thur Thur
#> [197] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sat  Sat  Sat  Sat 
#> [211] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Fri  Fri  Fri  Fri 
#> [225] Fri  Fri  Fri  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#> [239] Sat  Sat  Sat  Sat  Sat  Thur
#> Levels: Fri Sat Sun Thur
tips$day_ix
#>   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [36] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [71] 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2
#> [106] 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
#> [141] 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 3 3 3
#> [176] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2 2
#> [211] 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4

Looks good.

Way 2 - via a join

tips$day %>% unique()
#> [1] Sun  Sat  Thur Fri 
#> Levels: Fri Sat Sun Thur

Get the levels as a data frame:

levels_df <- data_frame(
  id = 1 : length(tips$day %>% unique()),
  levels = tips$day %>% unique()
)
levels_df
#> # A tibble: 4 x 2
#>      id levels
#>   <int> <fct> 
#> 1     1 Sun   
#> 2     2 Sat   
#> 3     3 Thur  
#> 4     4 Fri

Now join:

tips2 <-tips %>% 
  full_join(levels_df, by = c("day" = "levels"))
tips2$day
#>   [1] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#>  [15] Sun  Sun  Sun  Sun  Sun  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#>  [29] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sun 
#>  [43] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#>  [57] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#>  [71] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Thur Thur Thur Thur Thur Thur Thur
#>  [85] Thur Thur Thur Thur Thur Thur Fri  Fri  Fri  Fri  Fri  Fri  Fri  Fri 
#>  [99] Fri  Fri  Fri  Fri  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#> [113] Sun  Sun  Sun  Sun  Sun  Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [127] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur
#> [141] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sun  Sun  Sun  Sun 
#> [155] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#> [169] Sat  Sat  Sat  Sat  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun 
#> [183] Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Sun  Thur Thur Thur Thur Thur
#> [197] Thur Thur Thur Thur Thur Thur Thur Thur Thur Thur Sat  Sat  Sat  Sat 
#> [211] Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Fri  Fri  Fri  Fri 
#> [225] Fri  Fri  Fri  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat  Sat 
#> [239] Sat  Sat  Sat  Sat  Sat  Thur
#> Levels: Fri Sat Sun Thur
tips2$id
#>   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [36] 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#>  [71] 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 2 2 2
#> [106] 2 2 2 2 2 2 2 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> [141] 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1
#> [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2
#> [211] 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3

The ordering is different here (but could easily be fixed), but this way is also useful.

Debrief

Coercing an index over an character of factor vector is quite straight forward and may be useful in some situation.