A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.
library(tidyverse) data(mtcars) library(mosaic) library(knitr) library(sjmisc) library(sjPlot)
Visualizing mean values between two groups
First, let’s compute the mean hp for automatic cars (
am == 0) vs. manual cars (
am == 1).
mtcars %>% group_by(am) %>% summarise(hp_am = mean(hp)) -> hp_am
Now just hand over this data frame of summarized data to ggplot:
hp_am %>% ggplot() + aes(x = factor(am), y = hp_am) + geom_point(aes(shape = factor(am)), size = 5, alpha = .7) + geom_line(group = 1) -> p1 p1
An alternative approach would be using the R package
favstats(hp ~ am, data = mtcars) %>% rownames_to_column() -> mtcars_favstats
Now hand over this data frame to
mosaic (wrappers around
gf_point(mean ~ rowname, data = mtcars_favstats, shape = ~rowname, size = 5, alpha = .7) %>% gf_line(mean ~ rowname, group = 1)
Adding uncertainty measures
If we are not primarily interested in sample description but population estimation we need to add uncertainty measures - such as the standard error (SE).
One easy solution is to make use of the package
library(gplots) plotmeans(hp ~ am, data = mtcars)
Note that the errors describe the 95% confidence interval of the respective mean.
Using ggplot2, there are some helper functions from
hmisc that will do to computation of the standard error for us. One way is to use the
p1 + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", data = mtcars, aes(y = hp))
stat_summary() provides an alternative to
geom_XXX() for building a plot. Here the focus lies on “which summary statistic do I want to compute?
mean_cl_normal computes the mean and the confidence limits based on a t-distribution. As an alternative
mean_cl_boot would produce a less assumption laden bootstrapped confidence interval.
geom_errorbar() yields essentially the same result. Here, we also specify that we do not want the 2-SE-interval (95%) but the 1-SE-interval (via
p1 + geom_errorbar(stat = "summary", fun.data = "mean_sdl", fun.args = list(mult = 1), position = position_dodge(width = 0.9), data = mtcars, aes(y = hp))
stat = "summary" computes a summary at each position of x. In this case, the summary function
mean_sdl is called.
fun_data returns three values:
Less intrusive bars
I personally find the bars to intrusive; let’s try to deemphasize their appearance.
mtcars %>% ggplot() + aes(x = factor(am), y = hp) + geom_pointrange(stat = "summary", fun.data = "mean_sdl", fun.args = list(mult = 1), position = position_dodge(width = 0.9), size = 2, aes(y = hp, shape = factor(am), color = factor(am))) + geom_line(stat = "summary", fun.data = "mean_sdl", group = 1)
Sometimes, we do not (only) want a diagram but also the exact figures. How to get that?
kable() provides a convenient way:
hp_am %>% kable()
sjmisc provides a convenience function for summary statistics:
mtcars %>% group_by(am) %>% select(hp) %>% descr() #> #> ## Basic descriptive statistics #> #> Grouped by: #> am: 0 #> var type label n NA.prc mean sd se md trimmed range #> hp numeric hp 19 0 160.26 53.91 12.37 175 161.06 183 (62-245) #> skew #> -0.02 #> #> Grouped by: #> am: 1 #> var type label n NA.prc mean sd se md trimmed range #> hp numeric hp 13 0 126.85 84.06 23.31 109 114.73 283 (52-335) #> skew #> 1.74
Let’s try a double grouping and see where it takes us:
mtcars %>% group_by(am, cyl) %>% summarise(hp_mean = mean(hp)) %>% ungroup() -> hp_am_cyl hp_am_cyl #> # A tibble: 6 x 3 #> am cyl hp_mean #> <dbl> <dbl> <dbl> #> 1 0 4 84.7 #> 2 0 6 115. #> 3 0 8 194. #> 4 1 4 81.9 #> 5 1 6 132. #> 6 1 8 300.
Now let’s add the mean per
am group (the “overall” means):
hp_am_cyl %>% group_by(am) %>% mutate(overall_mean = mean(hp_mean)) -> hp_am_cyl hp_am_cyl %>% kable()
This format is called the long format. We can shift it to the wide format:
hp_am_cyl %>% spread(key = cyl, value = hp_mean) %>% kable()
Plot multiple grouping
hp_am_cyl %>% ggplot(aes(x = factor(am), color = factor(cyl), y = hp_mean)) + geom_point(aes(shape = factor(am))) + geom_line(aes(group = factor(cyl)))