Let’s get purrr. Recently, I ran across this issue: A data frame with many columns; I wanted to select all numeric columns and submit them to a t-test with some grouping variables.
As this is a quite common task, and the purrr-approach (package purrr by @HadleyWickham) is quite elegant, I present the approach in this post.
Let’s load the data, the Affairs data set, and some packages:
data(Affairs, package = "AER")
library(purrr) # functional programming
library(dplyr) # dataframe wrangling
library(ggplot2) # plotting
library(tidyr) # reshaping df
Don’t forget that the four packages need to be installed in the first place.
So, now let’s select all numeric variables from the data set. dplyr and purrr
provide functions for that purpose, very convenient:
Affairs %>%
select_if(is.numeric) %>% head
## affairs age yearsmarried religiousness education occupation rating
## 4 0 37 10.00 3 18 7 4
## 5 0 27 4.00 4 14 6 4
## 11 0 32 15.00 1 12 1 4
## 16 0 57 15.00 5 18 6 5
## 23 0 22 0.75 2 17 6 3
## 29 0 32 1.50 2 17 5 5
In the next step, we “map” each of these columns to a function, here the t-test.
Affairs %>%
select_if(is.numeric) %>%
map(~t.test(. ~ Affairs$gender)$p.value)
## $affairs
## [1] 0.7739606
##
## $age
## [1] 2.848452e-06
##
## $yearsmarried
## [1] 0.458246
##
## $religiousness
## [1] 0.8513998
##
## $education
## [1] 9.772643e-24
##
## $occupation
## [1] 8.887471e-35
##
## $rating
## [1] 0.8533625
the map
function may look obscure if you have not seen it before. As said, the map function maps each column to the function you mention. The ~t.test()
bit means that you define an anonymous function, just as you would for normal apply
calls, for example. So equivalently, one could write:
Affairs %>%
select_if(is.numeric) %>%
#map(~t.test(. ~ Affairs$gender)$p.value) %>%
map(function(x) t.test(x ~ Affairs$gender)$p.value)
## $affairs
## [1] 0.7739606
##
## $age
## [1] 2.848452e-06
##
## $yearsmarried
## [1] 0.458246
##
## $religiousness
## [1] 0.8513998
##
## $education
## [1] 9.772643e-24
##
## $occupation
## [1] 8.887471e-35
##
## $rating
## [1] 0.8533625
The ~
is just a convenient short hand for the normal way of writing anonymous functions. And the dot .
is then again a shorthand for the column that is handed through the function (just as x
in the normal apply
call).
Well, that’s basically it! The $p.value
bit just extracts the p-value statistic out of the t-test object.
The more familiar, lapply
approach would be something like:
lapply(Affairs[c("affairs", "age", "yearsmarried")], function(x) t.test(x ~ Affairs$gender))
## $affairs
##
## Welch Two Sample t-test
##
## data: x by Affairs$gender
## t = -0.28733, df = 594.01, p-value = 0.774
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6068861 0.4519744
## sample estimates:
## mean in group female mean in group male
## 1.419048 1.496503
##
##
## $age
##
## Welch Two Sample t-test
##
## data: x by Affairs$gender
## t = -4.7285, df = 575.26, p-value = 2.848e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.014417 -2.071219
## sample estimates:
## mean in group female mean in group male
## 30.80159 34.34441
##
##
## $yearsmarried
##
## Welch Two Sample t-test
##
## data: x by Affairs$gender
## t = -0.74222, df = 595.53, p-value = 0.4582
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2306829 0.5556058
## sample estimates:
## mean in group female mean in group male
## 8.017070 8.354608
Now, finally, let’s plot the result for easier comprehension. Some minor wrangling of the data is necessary:
Affairs %>%
select_if(is.numeric) %>%
#na.omit() %>%
map(~t.test(. ~ Affairs$gender)$p.value) %>%
as.data.frame %>%
gather %>%
mutate(signif = ifelse(value < .05, "significant", "ns")) %>%
ggplot(aes(x = reorder(key, value), y = value)) +
geom_point(aes(color = signif)) +
coord_flip() +
ylab("p value")