A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.

```
library(tidyverse)
data(mtcars)
library(mosaic)
library(knitr)
library(sjmisc)
library(sjPlot)
```

# Visualizing mean values between two groups

First, let’s compute the mean hp for automatic cars (`am == 0`

) vs. manual cars (`am == 1`

).

```
mtcars %>%
group_by(am) %>%
summarise(hp_am = mean(hp)) -> hp_am
```

Now just hand over this data frame of summarized data to ggplot:

```
hp_am %>%
ggplot() +
aes(x = factor(am), y = hp_am) +
geom_point(aes(shape = factor(am)), size = 5, alpha = .7) +
geom_line(group = 1) -> p1
p1
```

An alternative approach would be using the R package `mosaic`

:

```
favstats(hp ~ am, data = mtcars) %>%
rownames_to_column() -> mtcars_favstats
```

Now hand over this data frame to `gf_XXX()`

from `mosaic`

(wrappers around `ggplot()`

):

```
gf_point(mean ~ rowname, data = mtcars_favstats, shape = ~rowname, size = 5, alpha = .7) %>%
gf_line(mean ~ rowname, group = 1)
```

# Adding uncertainty measures

If we are not primarily interested in sample description but population estimation we need to add uncertainty measures - such as the standard error (SE).

One easy solution is to make use of the package `gplots`

:

```
library(gplots)
plotmeans(hp ~ am, data = mtcars)
```

Note that the errors describe the 95% confidence interval of the respective mean.

Using ggplot2, there are some helper functions from `hmisc`

that will do to computation of the standard error for us. One way is to use the `stat_summary()`

function:

```
p1 +
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", data = mtcars, aes(y = hp))
```

`stat_summary()`

provides an alternative to `geom_XXX()`

for building a plot. Here the focus lies on “which summary statistic do I want to compute?

`mean_cl_normal`

computes the *m*ean and the *c*onfidence *l*imits based on a t-distribution. As an alternative `mean_cl_boot`

would produce a less assumption laden bootstrapped confidence interval.

`geom_errorbar()`

yields essentially the same result. Here, we also specify that we do not want the 2-SE-interval (95%) but the 1-SE-interval (via `fun.args`

).

```
p1 +
geom_errorbar(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position = position_dodge(width = 0.9),
data = mtcars,
aes(y = hp))
```

`stat = "summary"`

computes a summary at each position of x. In this case, the summary function `mean_sdl`

is called. `fun_data`

returns three values: `y`

, `ymin`

and `ymax`

.

# Less intrusive bars

I personally find the bars to intrusive; let’s try to deemphasize their appearance.

```
mtcars %>%
ggplot() +
aes(x = factor(am), y = hp) +
geom_pointrange(stat = "summary", fun.data = "mean_sdl",
fun.args = list(mult = 1),
position = position_dodge(width = 0.9),
size = 2,
aes(y = hp, shape = factor(am), color = factor(am))) +
geom_line(stat = "summary", fun.data = "mean_sdl",
group = 1)
```

# Plot table

Sometimes, we do not (only) want a diagram but also the exact figures. How to get that? `kable()`

provides a convenient way:

```
hp_am %>%
kable()
```

am | hp_am |
---|---|

0 | 160.2632 |

1 | 126.8462 |

The package `sjmisc`

provides a convenience function for summary statistics:

```
mtcars %>%
group_by(am) %>%
select(hp) %>%
descr()
#>
#> ## Basic descriptive statistics
#>
#> Grouped by:
#> am: 0
#> var type label n NA.prc mean sd se md trimmed range
#> hp numeric hp 19 0 160.26 53.91 12.37 175 161.06 183 (62-245)
#> skew
#> -0.02
#>
#> Grouped by:
#> am: 1
#> var type label n NA.prc mean sd se md trimmed range
#> hp numeric hp 13 0 126.85 84.06 23.31 109 114.73 283 (52-335)
#> skew
#> 1.74
```

## Double grouping

Let’s try a double grouping and see where it takes us:

```
mtcars %>%
group_by(am, cyl) %>%
summarise(hp_mean = mean(hp)) %>%
ungroup() -> hp_am_cyl
hp_am_cyl
#> # A tibble: 6 x 3
#> am cyl hp_mean
#> <dbl> <dbl> <dbl>
#> 1 0 4 84.7
#> 2 0 6 115.
#> 3 0 8 194.
#> 4 1 4 81.9
#> 5 1 6 132.
#> 6 1 8 300.
```

Now let’s add the mean per `am`

group (the “overall” means):

```
hp_am_cyl %>%
group_by(am) %>%
mutate(overall_mean = mean(hp_mean)) -> hp_am_cyl
hp_am_cyl %>% kable()
```

am | cyl | hp_mean | overall_mean |
---|---|---|---|

0 | 4 | 84.66667 | 131.3611 |

0 | 6 | 115.25000 | 131.3611 |

0 | 8 | 194.16667 | 131.3611 |

1 | 4 | 81.87500 | 171.0139 |

1 | 6 | 131.66667 | 171.0139 |

1 | 8 | 299.50000 | 171.0139 |

This format is called the long format. We can shift it to the wide format:

```
hp_am_cyl %>%
spread(key = cyl, value = hp_mean) %>%
kable()
```

am | overall_mean | 4 | 6 | 8 |
---|---|---|---|---|

0 | 131.3611 | 84.66667 | 115.2500 | 194.1667 |

1 | 171.0139 | 81.87500 | 131.6667 | 299.5000 |

## Plot multiple grouping

```
hp_am_cyl %>%
ggplot(aes(x = factor(am), color = factor(cyl), y = hp_mean)) +
geom_point(aes(shape = factor(am))) +
geom_line(aes(group = factor(cyl)))
```