Some say, the pipe (#tidyverse) makes analyses in R easier. I agree. This post demonstrates some examples.
Let’s take the mtcars
dataset as an example.
data(mtcars)
?mtcars
Say, we would like to compute the correlation between gasoline consumption (mpg
) and horsepower (hp
).
Base approach 1
cor(mtcars[, c("mpg", "hp")])
## mpg hp
## mpg 1.0000000 -0.7761684
## hp -0.7761684 1.0000000
We use the [
-operator (function) to select the columns; note that df[, c(col1, col2)]
sees dataframes as matrices, and spits out a dataframe, not a vector:
class(mtcars[, c("mpg", "hp")])
## [1] "data.frame"
That’s ok, because cor
expects a matrix or a dataframe as input. Alternatively, we can understand dataframes as lists as in the following example.
Base approach 2
cor.test(x = mtcars[["mpg"]], y = mtcars[["hp"]])
##
## Pearson's product-moment correlation
##
## data: mtcars[["mpg"]] and mtcars[["hp"]]
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
the [[
-operator extracts a column from a list (a dataframe is technically a list), and extracts it as a vector. This is useful as some functions, such as cor.test
don’t digest dataframes, but want vectors as input (here x, y).
Pipe approach 1
We will use dplyr
for demonstrating the pipe approach.
library(dplyr)
mtcars %>%
select(mpg, hp) %>%
cor
## mpg hp
## mpg 1.0000000 -0.7761684
## hp -0.7761684 1.0000000
If you are not acquainted with dplyr, the %>%
operator can be translated as then do
. More specifically, the result of the the lefthand side (lhs) is transferred as input to the righthand side (rhs).
Easy, right?
Pipe approach 2
We will need broom
here, a package that renders some R output into a nice (ie, tidy) dataframe. For example, cor.test
does not spit a nice dataframe when left in the wild. Applying tidy()
from broom
on the output, we will get a nice dataframe:
library(broom)
cor.test(x = mtcars[["mpg"]], y = mtcars[["hp"]]) %>% tidy
## estimate statistic p.value parameter conf.low conf.high
## 1 -0.7761684 -6.742389 1.787835e-07 30 -0.8852686 -0.5860994
## method alternative
## 1 Pearson's product-moment correlation two.sided
# same:
tidy(cor.test(x = mtcars[["mpg"]], y = mtcars[["hp"]]))
## estimate statistic p.value parameter conf.low conf.high
## 1 -0.7761684 -6.742389 1.787835e-07 30 -0.8852686 -0.5860994
## method alternative
## 1 Pearson's product-moment correlation two.sided
This code can be made simpler using dplyr:
mtcars %>%
do(tidy(cor.test(.$mpg, .$hp)))
## estimate statistic p.value parameter conf.low conf.high
## 1 -0.7761684 -6.742389 1.787835e-07 30 -0.8852686 -0.5860994
## method alternative
## 1 Pearson's product-moment correlation two.sided
The function do
from dplyr
runs any function, provided it spits a dataframe. That’s why we first apply tidy
from broom
, and run do
afterwards.
The .
dot refers to the dataframe as handed over from the last step. We need this piece because cor.test
does not know any variable by the name mpg
(unless you have attached mtcars
beforehands).
This code produces the same result:
mtcars %>%
do(cor.test(.$mpg, .$hp) %>% tidy) %>%
knitr::kable()
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
-0.7761684 | -6.742388 | 2e-07 | 30 | -0.8852686 | -0.5860994 | Pearson’s product-moment correlation | two.sided |
Pipe appraoch 3
The package magrittr
provides some pipe variants, most importantly perhaps the “exposition pipe”, %$%
:
mtcars %$%
cor.test(mpg, hp) %>%
tidy
## estimate statistic p.value parameter conf.low conf.high
## 1 -0.7761684 -6.742389 1.787835e-07 30 -0.8852686 -0.5860994
## method alternative
## 1 Pearson's product-moment correlation two.sided
Why is it useful? Let’s spell out the code above in more detail.
- Line 1: “Hey R, pick up
mtcars
but do not simply pass over this dataframe, but pull out each column and pass those columns over” - Line 2: “Run the function
cor.test
withhp
andmpg
” and then … - Line 3: “Tidy the result up. Not necessary here but quite nice”.
Remember that cor.test
does not accept a dataframe as input. It expects two vectors. That’s why we need to transform the dataframe mtcars
to a bundle of vectors (ie., the columns).
Recap
In sum, I think the pipe makes life easier. Of course, one needs to get used to it. But after a while, it’s much simpler than working with deeply nested [
brackets.
Enjoy the pipe!