# Visualizing mean values between two groups - the tidyverse way

A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.

``````library(tidyverse)
data(mtcars)
library(mosaic)
library(knitr)
library(sjmisc)
library(sjPlot)``````

# Visualizing mean values between two groups

First, let’s compute the mean hp for automatic cars (`am == 0`) vs. manual cars (`am == 1`).

``````mtcars %>%
group_by(am) %>%
summarise(hp_am = mean(hp)) -> hp_am``````

Now just hand over this data frame of summarized data to ggplot:

``````hp_am %>%
ggplot() +
aes(x = factor(am), y = hp_am) +
geom_point(aes(shape = factor(am)), size = 5, alpha = .7) +
geom_line(group = 1) -> p1
p1`````` An alternative approach would be using the R package `mosaic`:

``````favstats(hp ~ am, data = mtcars) %>%
rownames_to_column() -> mtcars_favstats``````

Now hand over this data frame to `gf_XXX()` from `mosaic` (wrappers around `ggplot()`):

``````gf_point(mean ~ rowname, data = mtcars_favstats, shape = ~rowname, size = 5, alpha = .7) %>%
gf_line(mean ~ rowname, group = 1)`````` If we are not primarily interested in sample description but population estimation we need to add uncertainty measures - such as the standard error (SE).

One easy solution is to make use of the package `gplots`:

``````library(gplots)
plotmeans(hp ~ am, data = mtcars)`````` Note that the errors describe the 95% confidence interval of the respective mean.

Using ggplot2, there are some helper functions from `hmisc` that will do to computation of the standard error for us. One way is to use the `stat_summary()` function:

``````
p1 +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", data = mtcars, aes(y = hp))`````` `stat_summary()` provides an alternative to `geom_XXX()` for building a plot. Here the focus lies on “which summary statistic do I want to compute?

`mean_cl_normal` computes the mean and the confidence limits based on a t-distribution. As an alternative `mean_cl_boot` would produce a less assumption laden bootstrapped confidence interval.

`geom_errorbar()` yields essentially the same result. Here, we also specify that we do not want the 2-SE-interval (95%) but the 1-SE-interval (via `fun.args`).

``````p1 +
geom_errorbar(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position =  position_dodge(width = 0.9),
data = mtcars,
aes(y = hp))`````` `stat = "summary"` computes a summary at each position of x. In this case, the summary function `mean_sdl` is called. `fun_data` returns three values: `y`, `ymin` and `ymax`.

# Less intrusive bars

I personally find the bars to intrusive; let’s try to deemphasize their appearance.

``````
mtcars %>%
ggplot() +
aes(x = factor(am), y = hp) +
geom_pointrange(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position =  position_dodge(width = 0.9),
size = 2,
aes(y = hp, shape = factor(am), color = factor(am))) +
geom_line(stat = "summary", fun.data = "mean_sdl",
group = 1)`````` # Plot table

Sometimes, we do not (only) want a diagram but also the exact figures. How to get that? `kable()` provides a convenient way:

``````hp_am %>%
kable()``````
am hp_am
0 160.2632
1 126.8462

The package `sjmisc` provides a convenience function for summary statistics:

``````mtcars %>%
group_by(am) %>%
select(hp) %>%
descr()
#>
#> ## Basic descriptive statistics
#>
#> Grouped by:
#> am: 0
#>  var    type label  n NA.prc   mean    sd    se  md trimmed        range
#>   hp numeric    hp 19      0 160.26 53.91 12.37 175  161.06 183 (62-245)
#>   skew
#>  -0.02
#>
#> Grouped by:
#> am: 1
#>  var    type label  n NA.prc   mean    sd    se  md trimmed        range
#>   hp numeric    hp 13      0 126.85 84.06 23.31 109  114.73 283 (52-335)
#>  skew
#>  1.74``````

## Double grouping

Let’s try a double grouping and see where it takes us:

``````mtcars %>%
group_by(am, cyl) %>%
summarise(hp_mean = mean(hp)) %>%
ungroup() -> hp_am_cyl

hp_am_cyl
#> # A tibble: 6 x 3
#>      am   cyl hp_mean
#>   <dbl> <dbl>   <dbl>
#> 1     0     4    84.7
#> 2     0     6   115.
#> 3     0     8   194.
#> 4     1     4    81.9
#> 5     1     6   132.
#> 6     1     8   300.``````

Now let’s add the mean per `am` group (the “overall” means):

``````hp_am_cyl %>%
group_by(am) %>%
mutate(overall_mean = mean(hp_mean)) -> hp_am_cyl

hp_am_cyl %>% kable()``````
am cyl hp_mean overall_mean
0 4 84.66667 131.3611
0 6 115.25000 131.3611
0 8 194.16667 131.3611
1 4 81.87500 171.0139
1 6 131.66667 171.0139
1 8 299.50000 171.0139

This format is called the long format. We can shift it to the wide format:

``````hp_am_cyl %>%
spread(key = cyl, value = hp_mean) %>%
kable()``````
am overall_mean 4 6 8
0 131.3611 84.66667 115.2500 194.1667
1 171.0139 81.87500 131.6667 299.5000

## Plot multiple grouping

``````hp_am_cyl %>%
ggplot(aes(x = factor(am), color = factor(cyl), y = hp_mean)) +
geom_point(aes(shape = factor(am))) +
geom_line(aes(group = factor(cyl)))`````` 