A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.
library(tidyverse)
data(mtcars)
library(mosaic)
library(knitr)
library(sjmisc)
library(sjPlot)
Visualizing mean values between two groups
First, let’s compute the mean hp for automatic cars (am == 0
) vs. manual cars (am == 1
).
mtcars %>%
group_by(am) %>%
summarise(hp_am = mean(hp)) -> hp_am
Now just hand over this data frame of summarized data to ggplot:
hp_am %>%
ggplot() +
aes(x = factor(am), y = hp_am) +
geom_point(aes(shape = factor(am)), size = 5, alpha = .7) +
geom_line(group = 1) -> p1
p1
An alternative approach would be using the R package mosaic
:
favstats(hp ~ am, data = mtcars) %>%
rownames_to_column() -> mtcars_favstats
Now hand over this data frame to gf_XXX()
from mosaic
(wrappers around ggplot()
):
gf_point(mean ~ rowname, data = mtcars_favstats, shape = ~rowname, size = 5, alpha = .7) %>%
gf_line(mean ~ rowname, group = 1)
Adding uncertainty measures
If we are not primarily interested in sample description but population estimation we need to add uncertainty measures - such as the standard error (SE).
One easy solution is to make use of the package gplots
:
library(gplots)
plotmeans(hp ~ am, data = mtcars)
Note that the errors describe the 95% confidence interval of the respective mean.
Using ggplot2, there are some helper functions from hmisc
that will do to computation of the standard error for us. One way is to use the stat_summary()
function:
p1 +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", data = mtcars, aes(y = hp))
stat_summary()
provides an alternative to geom_XXX()
for building a plot. Here the focus lies on “which summary statistic do I want to compute?
mean_cl_normal
computes the mean and the confidence limits based on a t-distribution. As an alternative mean_cl_boot
would produce a less assumption laden bootstrapped confidence interval.
geom_errorbar()
yields essentially the same result. Here, we also specify that we do not want the 2-SE-interval (95%) but the 1-SE-interval (via fun.args
).
p1 +
geom_errorbar(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position = position_dodge(width = 0.9),
data = mtcars,
aes(y = hp))
stat = "summary"
computes a summary at each position of x. In this case, the summary function mean_sdl
is called. fun_data
returns three values: y
, ymin
and ymax
.
Less intrusive bars
I personally find the bars to intrusive; let’s try to deemphasize their appearance.
mtcars %>%
ggplot() +
aes(x = factor(am), y = hp) +
geom_pointrange(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position = position_dodge(width = 0.9),
size = 2,
aes(y = hp, shape = factor(am), color = factor(am))) +
geom_line(stat = "summary", fun.data = "mean_sdl",
group = 1)
Plot table
Sometimes, we do not (only) want a diagram but also the exact figures. How to get that? kable()
provides a convenient way:
hp_am %>%
kable()
am | hp_am |
---|---|
0 | 160.2632 |
1 | 126.8462 |
The package sjmisc
provides a convenience function for summary statistics:
mtcars %>%
group_by(am) %>%
select(hp) %>%
descr()
#>
#> ## Basic descriptive statistics
#>
#> Grouped by:
#> am: 0
#> var type label n NA.prc mean sd se md trimmed range
#> hp numeric hp 19 0 160.26 53.91 12.37 175 161.06 183 (62-245)
#> skew
#> -0.02
#>
#> Grouped by:
#> am: 1
#> var type label n NA.prc mean sd se md trimmed range
#> hp numeric hp 13 0 126.85 84.06 23.31 109 114.73 283 (52-335)
#> skew
#> 1.74
Double grouping
Let’s try a double grouping and see where it takes us:
mtcars %>%
group_by(am, cyl) %>%
summarise(hp_mean = mean(hp)) %>%
ungroup() -> hp_am_cyl
hp_am_cyl
#> # A tibble: 6 x 3
#> am cyl hp_mean
#> <dbl> <dbl> <dbl>
#> 1 0 4 84.7
#> 2 0 6 115.
#> 3 0 8 194.
#> 4 1 4 81.9
#> 5 1 6 132.
#> 6 1 8 300.
Now let’s add the mean per am
group (the “overall” means):
hp_am_cyl %>%
group_by(am) %>%
mutate(overall_mean = mean(hp_mean)) -> hp_am_cyl
hp_am_cyl %>% kable()
am | cyl | hp_mean | overall_mean |
---|---|---|---|
0 | 4 | 84.66667 | 131.3611 |
0 | 6 | 115.25000 | 131.3611 |
0 | 8 | 194.16667 | 131.3611 |
1 | 4 | 81.87500 | 171.0139 |
1 | 6 | 131.66667 | 171.0139 |
1 | 8 | 299.50000 | 171.0139 |
This format is called the long format. We can shift it to the wide format:
hp_am_cyl %>%
spread(key = cyl, value = hp_mean) %>%
kable()
am | overall_mean | 4 | 6 | 8 |
---|---|---|---|---|
0 | 131.3611 | 84.66667 | 115.2500 | 194.1667 |
1 | 171.0139 | 81.87500 | 131.6667 | 299.5000 |
Plot multiple grouping
hp_am_cyl %>%
ggplot(aes(x = factor(am), color = factor(cyl), y = hp_mean)) +
geom_point(aes(shape = factor(am))) +
geom_line(aes(group = factor(cyl)))