Say, you have a data frame with a number of columns, and you need to change every column in a similar way. A common example might be to standardize all (numeric) variables. How to do that in R? This post shows and explains an example using mutate_all()
from the tidyverse.
Let’s stick to the question “how to z-standardize all columns” for the sake of simplicity (and neglect that there are precooked solutions, for example from the superb package sjmisc
by strengejacke.)
library(tidyverse)
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
data(iris)
Easy but inefficient way
iris %>%
mutate(Sepal.Length_z = (Sepal.Length - mean(iris$Sepal.Length)) / sd(iris$Sepal.Length),
Sepal.Width_z = (Sepal.Width - mean(iris$Sepal.Width)) / sd(iris$Sepal.Width)) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
## 1 5.1 3.5 1.4 0.2 setosa -0.8976739
## 2 4.9 3.0 1.4 0.2 setosa -1.1392005
## 3 4.7 3.2 1.3 0.2 setosa -1.3807271
## 4 4.6 3.1 1.5 0.2 setosa -1.5014904
## 5 5.0 3.6 1.4 0.2 setosa -1.0184372
## 6 5.4 3.9 1.7 0.4 setosa -0.5353840
## Sepal.Width_z
## 1 1.01560199
## 2 -0.13153881
## 3 0.32731751
## 4 0.09788935
## 5 1.24503015
## 6 1.93331463
Beware the braces; it’s easy to get bitten (happened to me).
Clearly, this appraoch dow not scale well. In addition, you’ll strain your hand. And will enjoy the funniest typos.
Define helper function
A first useful step is to define a helper function which we will apply on every column:
z_std <- function(observed) {
result <- (observed - mean(observed)) / sd(observed)
}
Of course, such a fucntion already exists a myriad times in other scripts, and yes, it is not crafted beautifully, but it will serve as a prgramatic start.
Now let’s apply it:
iris %>%
mutate(Sepal.Length_z = z_std(Sepal.Length)) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
## 1 5.1 3.5 1.4 0.2 setosa -0.8976739
## 2 4.9 3.0 1.4 0.2 setosa -1.1392005
## 3 4.7 3.2 1.3 0.2 setosa -1.3807271
## 4 4.6 3.1 1.5 0.2 setosa -1.5014904
## 5 5.0 3.6 1.4 0.2 setosa -1.0184372
## 6 5.4 3.9 1.7 0.4 setosa -0.5353840
Much cleaner, simpler, more relaxing.
Now to the conveyer belt
Now let’s apply it to each column:
iris %>%
select_if(is.numeric) %>%
mutate_all(funs(z = z_std(.))) %>%
head()
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length_z
## 1 5.1 3.5 1.4 0.2 -0.8976739
## 2 4.9 3.0 1.4 0.2 -1.1392005
## 3 4.7 3.2 1.3 0.2 -1.3807271
## 4 4.6 3.1 1.5 0.2 -1.5014904
## 5 5.0 3.6 1.4 0.2 -1.0184372
## 6 5.4 3.9 1.7 0.4 -0.5353840
## Sepal.Width_z Petal.Length_z Petal.Width_z
## 1 1.01560199 -1.335752 -1.311052
## 2 -0.13153881 -1.335752 -1.311052
## 3 0.32731751 -1.392399 -1.311052
## 4 0.09788935 -1.279104 -1.311052
## 5 1.24503015 -1.335752 -1.311052
## 6 1.93331463 -1.165809 -1.048667
Changes in dplyr
You might have noticed this warning:
Warning: funs() is soft deprecated as of dplyr 0.8.0
So let’s change the code above to reflect the change in dplyr.
iris %>%
select_if(is.numeric) %>%
mutate_all(list(z = ~ z_std(.))) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length_z
## 1 5.1 3.5 1.4 0.2 -0.8976739
## 2 4.9 3.0 1.4 0.2 -1.1392005
## 3 4.7 3.2 1.3 0.2 -1.3807271
## 4 4.6 3.1 1.5 0.2 -1.5014904
## 5 5.0 3.6 1.4 0.2 -1.0184372
## 6 5.4 3.9 1.7 0.4 -0.5353840
## Sepal.Width_z Petal.Length_z Petal.Width_z
## 1 1.01560199 -1.335752 -1.311052
## 2 -0.13153881 -1.335752 -1.311052
## 3 0.32731751 -1.392399 -1.311052
## 4 0.09788935 -1.279104 -1.311052
## 5 1.24503015 -1.335752 -1.311052
## 6 1.93331463 -1.165809 -1.048667
This code can be written more compact, see below.
iris %>%
select_if(is.numeric) %>%
mutate_all( ~ z_std(.)) %>%
head()
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -0.8976739 1.01560199 -1.335752 -1.311052
## 2 -1.1392005 -0.13153881 -1.335752 -1.311052
## 3 -1.3807271 0.32731751 -1.392399 -1.311052
## 4 -1.5014904 0.09788935 -1.279104 -1.311052
## 5 -1.0184372 1.24503015 -1.335752 -1.311052
## 6 -0.5353840 1.93331463 -1.165809 -1.048667
Note that if you don’t supply a name (suffix) such as z
in the example above, the function will silently overwrite the original variables.