5 min read

How to create columns in a dataframe in R

Note that we will use this library for this post:

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

By the way, loading mosaic, will load dplyr too.

One of the major data wrangling activities (in R and elsewhere) is to create a new column in a data frame. For example, assume you have some students who have completed some exercises. In each row of the dataframe - one student. In each column - one exercise (called item). The dataframe might then look like this:

df <- data.frame(name = c("John", "Joan", "Jeanne"),
                 item1 = c(TRUE, TRUE, FALSE),
                 item2 = c(TRUE, FALSE, TRUE))
df
##     name item1 item2
## 1   John  TRUE  TRUE
## 2   Joan  TRUE FALSE
## 3 Jeanne FALSE  TRUE

Note that TRUE indicates that the exercise has been solved correctly, if otherwise, we note FALSE.

Now assume you would like to sum up the number of correct items: How many items were solved by John, Joan, and Jeanne, respectively?

R provides a number of ways to compute this.

Sum as a vector

df$item1 + df$item2
## [1] 2 1 1

So, John, solved 2 items, the other students 1 item. It comes handsome that R interpretes TRUE as 1 and FALSE as 0. As a consequence, we can add up logical variables with no hazzle.

But we don’t want that as a freefloating column. That sum-column should be attached to our data frame. There should be a items_sum column in our dataframe.

Base R way to create a column

The Base R way is this:

df$items_sum <- df$item1 + df$item2

Let’s check the dataframe:

df
##     name item1 item2 items_sum
## 1   John  TRUE  TRUE         2
## 2   Joan  TRUE FALSE         1
## 3 Jeanne FALSE  TRUE         1

Worked. But there are other ways too.

Using dplyr

df <- mutate(df, items_sum2 = item1 + item2)
df
##     name item1 item2 items_sum items_sum2
## 1   John  TRUE  TRUE         2          2
## 2   Joan  TRUE FALSE         1          1
## 3 Jeanne FALSE  TRUE         1          1

Worked. Now let’s use the pipe.

Using the pipe

df <- df %>% 
  mutate(items_sum3 = item1 + item2) 
df
##     name item1 item2 items_sum items_sum2 items_sum3
## 1   John  TRUE  TRUE         2          2          2
## 2   Joan  TRUE FALSE         1          1          1
## 3 Jeanne FALSE  TRUE         1          1          1

Note that the assignment arrow operator can be headed to either direction, ie left or right:

df %>% 
  mutate(items_sum3 = item1 + item2) -> df
df
##     name item1 item2 items_sum items_sum2 items_sum3
## 1   John  TRUE  TRUE         2          2          2
## 2   Joan  TRUE FALSE         1          1          1
## 3 Jeanne FALSE  TRUE         1          1          1

Recode variable with car::Recode

In R parlance, recoding a variable is different from creating one. Again, different ways exist for recoding variable. Let’s assume that we would like to recode TRUE to 1 and FALSE to 0. But if we want to save the recoded variable as a new variable, then yes, we need to create a new variable with the same machinery as above. For that purpose, the package car comes handy:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
df %>% 
  mutate(item1r = car::Recode(item1, 
                              "1 = TRUE; 0 = FALSE")) -> df
df
##     name item1 item2 items_sum items_sum2 items_sum3 item1r
## 1   John  TRUE  TRUE         2          2          2      1
## 2   Joan  TRUE FALSE         1          1          1      1
## 3 Jeanne FALSE  TRUE         1          1          1      0

car::recode can be used for more complex recoding schemes, too:

df <- df %>% 
  mutate(items_sum3 = car::recode(items_sum, 
                                 "lo:1 = 'failed';
                                  2 = 'passed';
                                  3:hi = 'cant believe'")) 
df
##     name item1 item2 items_sum items_sum2 items_sum3 item1r
## 1   John  TRUE  TRUE         2          2     passed      1
## 2   Joan  TRUE FALSE         1          1     failed      1
## 3 Jeanne FALSE  TRUE         1          1     failed      0

See ?Recode for details of the syntax.

Recode variable with ifelse

Another frequently used way is using ifelse:

df %>% 
  mutate(item2r = ifelse(item2 == TRUE, 1L, 0L)) -> df
df
##     name item1 item2 items_sum items_sum2 items_sum3 item1r item2r
## 1   John  TRUE  TRUE         2          2     passed      1      1
## 2   Joan  TRUE FALSE         1          1     failed      1      0
## 3 Jeanne FALSE  TRUE         1          1     failed      0      1

The general form of ifelse is ifelse(condition, what_if_true, what_if_not).

By the way, if we want to coerce a numeric variable to integer, we can call it 1L, instead of 1.

Recode using case_when

The more general form of ifelse is case_when. If generalizes ifelse to more than 2 cases:

df %>% 
  mutate(items_sum4 = case_when(
    items_sum < 2 ~ "failed",
    items_sum == 2 ~ "passed",
    items_sum > 2 ~ "awesome")) -> df
df
##     name item1 item2 items_sum items_sum2 items_sum3 item1r item2r
## 1   John  TRUE  TRUE         2          2     passed      1      1
## 2   Joan  TRUE FALSE         1          1     failed      1      0
## 3 Jeanne FALSE  TRUE         1          1     failed      0      1
##   items_sum4
## 1     passed
## 2     failed
## 3     failed