Note that we will use this library for this post:
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
By the way, loading mosaic
, will load dplyr
too.
One of the major data wrangling activities (in R and elsewhere) is to create a new column in a data frame. For example, assume you have some students who have completed some exercises. In each row of the dataframe - one student. In each column - one exercise (called item). The dataframe might then look like this:
df <- data.frame(name = c("John", "Joan", "Jeanne"),
item1 = c(TRUE, TRUE, FALSE),
item2 = c(TRUE, FALSE, TRUE))
df
## name item1 item2
## 1 John TRUE TRUE
## 2 Joan TRUE FALSE
## 3 Jeanne FALSE TRUE
Note that TRUE
indicates that the exercise has been solved correctly, if otherwise, we note FALSE
.
Now assume you would like to sum up the number of correct items: How many items were solved by John, Joan, and Jeanne, respectively?
R provides a number of ways to compute this.
Sum as a vector
df$item1 + df$item2
## [1] 2 1 1
So, John, solved 2 items, the other students 1 item. It comes handsome that R interpretes TRUE
as 1 and FALSE
as 0. As a consequence, we can add up logical variables with no hazzle.
But we don’t want that as a freefloating column. That sum-column should be attached to our data frame. There should be a items_sum
column in our dataframe.
Base R way to create a column
The Base R way is this:
df$items_sum <- df$item1 + df$item2
Let’s check the dataframe:
df
## name item1 item2 items_sum
## 1 John TRUE TRUE 2
## 2 Joan TRUE FALSE 1
## 3 Jeanne FALSE TRUE 1
Worked. But there are other ways too.
Using dplyr
df <- mutate(df, items_sum2 = item1 + item2)
df
## name item1 item2 items_sum items_sum2
## 1 John TRUE TRUE 2 2
## 2 Joan TRUE FALSE 1 1
## 3 Jeanne FALSE TRUE 1 1
Worked. Now let’s use the pipe.
Using the pipe
df <- df %>%
mutate(items_sum3 = item1 + item2)
df
## name item1 item2 items_sum items_sum2 items_sum3
## 1 John TRUE TRUE 2 2 2
## 2 Joan TRUE FALSE 1 1 1
## 3 Jeanne FALSE TRUE 1 1 1
Note that the assignment arrow operator can be headed to either direction, ie left or right:
df %>%
mutate(items_sum3 = item1 + item2) -> df
df
## name item1 item2 items_sum items_sum2 items_sum3
## 1 John TRUE TRUE 2 2 2
## 2 Joan TRUE FALSE 1 1 1
## 3 Jeanne FALSE TRUE 1 1 1
Recode variable with car::Recode
In R parlance, recoding a variable is different from creating one. Again, different ways exist for recoding variable. Let’s assume that we would like to recode TRUE
to 1 and FALSE
to 0. But if we want to save the recoded variable as a new variable, then yes, we need to create a new variable with the same machinery as above. For that purpose, the package car
comes handy:
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
df %>%
mutate(item1r = car::Recode(item1,
"1 = TRUE; 0 = FALSE")) -> df
df
## name item1 item2 items_sum items_sum2 items_sum3 item1r
## 1 John TRUE TRUE 2 2 2 1
## 2 Joan TRUE FALSE 1 1 1 1
## 3 Jeanne FALSE TRUE 1 1 1 0
car::recode
can be used for more complex recoding schemes, too:
df <- df %>%
mutate(items_sum3 = car::recode(items_sum,
"lo:1 = 'failed';
2 = 'passed';
3:hi = 'cant believe'"))
df
## name item1 item2 items_sum items_sum2 items_sum3 item1r
## 1 John TRUE TRUE 2 2 passed 1
## 2 Joan TRUE FALSE 1 1 failed 1
## 3 Jeanne FALSE TRUE 1 1 failed 0
See ?Recode
for details of the syntax.
Recode variable with ifelse
Another frequently used way is using ifelse
:
df %>%
mutate(item2r = ifelse(item2 == TRUE, 1L, 0L)) -> df
df
## name item1 item2 items_sum items_sum2 items_sum3 item1r item2r
## 1 John TRUE TRUE 2 2 passed 1 1
## 2 Joan TRUE FALSE 1 1 failed 1 0
## 3 Jeanne FALSE TRUE 1 1 failed 0 1
The general form of ifelse
is ifelse(condition, what_if_true, what_if_not)
.
By the way, if we want to coerce a numeric variable to integer, we can call it 1L
, instead of 1
.
Recode using case_when
The more general form of ifelse
is case_when
. If generalizes ifelse
to more than 2 cases:
df %>%
mutate(items_sum4 = case_when(
items_sum < 2 ~ "failed",
items_sum == 2 ~ "passed",
items_sum > 2 ~ "awesome")) -> df
df
## name item1 item2 items_sum items_sum2 items_sum3 item1r item2r
## 1 John TRUE TRUE 2 2 passed 1 1
## 2 Joan TRUE FALSE 1 1 failed 1 0
## 3 Jeanne FALSE TRUE 1 1 failed 0 1
## items_sum4
## 1 passed
## 2 failed
## 3 failed