Visualizing mean values between two groups - the tidyverse way

A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.

library(tidyverse)
data(mtcars)
library(mosaic)
library(knitr)
library(sjmisc)
library(sjPlot)

Visualizing mean values between two groups

First, let’s compute the mean hp for automatic cars (am == 0) vs. manual cars (am == 1).

mtcars %>% 
  group_by(am) %>% 
  summarise(hp_am = mean(hp)) -> hp_am

Now just hand over this data frame of summarized data to ggplot:

hp_am %>% 
  ggplot() +
  aes(x = factor(am), y = hp_am) +
  geom_point(aes(shape = factor(am)), size = 5, alpha = .7) +
  geom_line(group = 1) -> p1
p1

An alternative approach would be using the R package mosaic:

favstats(hp ~ am, data = mtcars) %>% 
  rownames_to_column() -> mtcars_favstats

Now hand over this data frame to gf_XXX() from mosaic (wrappers around ggplot()):

gf_point(mean ~ rowname, data = mtcars_favstats, shape = ~rowname, size = 5, alpha = .7) %>% 
  gf_line(mean ~ rowname, group = 1)

Adding uncertainty measures

If we are not primarily interested in sample description but population estimation we need to add uncertainty measures - such as the standard error (SE).

One easy solution is to make use of the package gplots:

library(gplots)
plotmeans(hp ~ am, data = mtcars)

Note that the errors describe the 95% confidence interval of the respective mean.

Using ggplot2, there are some helper functions from hmisc that will do to computation of the standard error for us. One way is to use the stat_summary() function:


p1 + 
  stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", data = mtcars, aes(y = hp))

stat_summary() provides an alternative to geom_XXX() for building a plot. Here the focus lies on “which summary statistic do I want to compute?

mean_cl_normal computes the mean and the confidence limits based on a t-distribution. As an alternative mean_cl_boot would produce a less assumption laden bootstrapped confidence interval.

geom_errorbar() yields essentially the same result. Here, we also specify that we do not want the 2-SE-interval (95%) but the 1-SE-interval (via fun.args).

p1 +
  geom_errorbar(stat = "summary", fun.data = "mean_sdl", 
                  fun.args = list(mult = 1),
                  position =  position_dodge(width = 0.9),
                data = mtcars,
                aes(y = hp))

stat = "summary" computes a summary at each position of x. In this case, the summary function mean_sdl is called. fun_data returns three values: y, ymin and ymax.

Less intrusive bars

I personally find the bars to intrusive; let’s try to deemphasize their appearance.


mtcars %>%
  ggplot() +
  aes(x = factor(am), y = hp) +
  geom_pointrange(stat = "summary", fun.data = "mean_sdl", 
                  fun.args = list(mult = 1),
                  position =  position_dodge(width = 0.9),
                size = 2,
                aes(y = hp, shape = factor(am), color = factor(am))) +
  geom_line(stat = "summary", fun.data = "mean_sdl",
            group = 1)

Plot table

Sometimes, we do not (only) want a diagram but also the exact figures. How to get that? kable() provides a convenient way:

hp_am %>% 
  kable()
am hp_am
0 160.2632
1 126.8462

The package sjmisc provides a convenience function for summary statistics:

mtcars %>%
  group_by(am) %>% 
  select(hp) %>% 
  descr()
#> 
#> ## Basic descriptive statistics 
#> 
#> Grouped by:
#> am: 0 
#>  var    type label  n NA.prc   mean    sd    se  md trimmed        range
#>   hp numeric    hp 19      0 160.26 53.91 12.37 175  161.06 183 (62-245)
#>   skew
#>  -0.02
#> 
#> Grouped by:
#> am: 1 
#>  var    type label  n NA.prc   mean    sd    se  md trimmed        range
#>   hp numeric    hp 13      0 126.85 84.06 23.31 109  114.73 283 (52-335)
#>  skew
#>  1.74

Double grouping

Let’s try a double grouping and see where it takes us:

mtcars %>% 
  group_by(am, cyl) %>% 
  summarise(hp_mean = mean(hp)) %>%  
  ungroup() -> hp_am_cyl

hp_am_cyl
#> # A tibble: 6 x 3
#>      am   cyl hp_mean
#>   <dbl> <dbl>   <dbl>
#> 1     0     4    84.7
#> 2     0     6   115. 
#> 3     0     8   194. 
#> 4     1     4    81.9
#> 5     1     6   132. 
#> 6     1     8   300.

Now let’s add the mean per am group (the “overall” means):

hp_am_cyl %>% 
  group_by(am) %>% 
  mutate(overall_mean = mean(hp_mean)) -> hp_am_cyl

hp_am_cyl %>% kable()
am cyl hp_mean overall_mean
0 4 84.66667 131.3611
0 6 115.25000 131.3611
0 8 194.16667 131.3611
1 4 81.87500 171.0139
1 6 131.66667 171.0139
1 8 299.50000 171.0139

This format is called the long format. We can shift it to the wide format:

hp_am_cyl %>% 
  spread(key = cyl, value = hp_mean) %>% 
  kable()
am overall_mean 4 6 8
0 131.3611 84.66667 115.2500 194.1667
1 171.0139 81.87500 131.6667 299.5000

Plot multiple grouping

hp_am_cyl %>% 
  ggplot(aes(x = factor(am), color = factor(cyl), y = hp_mean)) +
  geom_point(aes(shape = factor(am))) +
  geom_line(aes(group = factor(cyl)))