In this post we are plotting an interaction for a logistic regression. Interaction per se is a concept difficult to grasp; for a GLM it may be even more difficult especially for continuous variables’ interaction. Plotting helps to better or more easy grasp what a model tries to tell us.

First, load some packages.

`library(tidyverse)`

`## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──`

```
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
```

`## Warning: package 'dplyr' was built under R version 3.5.1`

```
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

```
library(titanic)
library(broom)
library(modelr)
```

```
##
## Attaching package: 'modelr'
```

```
## The following object is masked from 'package:broom':
##
## bootstrap
```

`library(knitr)`

We will deal with the well-known Titanic data, ie., we check which predictors augment survival probability:

```
data(titanic_train)
d <- titanic_train
```

Let’s glimpse the data:

`glimpse(d) `

```
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
```

Let’s compute the logistic regression using the standard `glm()`

, using the following notation, the interaction term will be included. Note that this type of glm assumes a flat, unregulatated prior and a Gaussian likelihood, in Bayesian parlance. The interaction term is also linear. If you are not satisfied with this default values, better switch to a fully Bayesian analysis.

```
glm1 <- glm(Survived ~ Fare*Sex, data = d,
family = "binomial")
tidy(glm1) %>% kable()
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 0.4084278 | 0.1899986 | 2.149635 | 0.0315841 |

Fare | 0.0198785 | 0.0053715 | 3.700698 | 0.0002150 |

Sexmale | -2.0993453 | 0.2302908 | -9.116061 | 0.0000000 |

Fare:Sexmale | -0.0116171 | 0.0059337 | -1.957809 | 0.0502524 |

Now, consider plotting the interaction. Note first that the glm is a linear model. However, linear for the logits, ie:

\[l(y_i) = b_0 + b_1x_i + \dots\]

where \(l(y_i)\) is the logit of the probability of the modeled event for case \(i\).

In other words: It’s just a linear model, just like “normal” regression. Hence, we can plot it using straight, linear lines; that’s true for the interaction as well. Plotting the interaction of a glm is nothing new in this sense.

```
d %>%
mutate(pred_logit = predict(glm1)) -> d
```

Let’s glimpse the distribution of the logits:

```
ggplot(d) +
aes(x = pred_logit, fill = Sex) +
geom_density(alpha = .5)
```

Now let’s plot the interaction on logit scale:

```
d %>%
ggplot() +
aes(x = Fare, y = pred_logit, color = Sex) +
geom_point() +
geom_line()
```

Just lines - diverging lines in this case, indicating the presence of an interaction effect between `Sex`

and `Fare`

.

For the sake of curiosity, let’s compute a linear model with `Survived`

as criterion:

```
lm1 <- lm(Survived ~ Fare*Sex, data = d)
tidy(lm1)
```

```
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.669 0.0286 23.4 1.14e-94
## 2 Fare 0.00165 0.000391 4.22 2.72e- 5
## 3 Sexmale -0.519 0.0346 -15.0 1.18e-45
## 4 Fare:Sexmale -0.0000950 0.000551 -0.172 8.63e- 1
```

Again, an interaction effect is present.

Now let’s plot `Survived`

on the y-axis. We just plugin the estimates from `lm1`

at the slope and intercept parameter for `geom_abline`

which draws a straight line.

```
d %>%
ggplot() +
aes(x = Fare, y = Survived, color = Sex) +
geom_point() +
geom_abline(mapping = aes(slope = 1.65e-03, intercept = 6.69e-01), color = "#F8766D") +
geom_abline(mapping = aes(slope = 1.65e-03 - 9.50e-05, intercept = 6.69e-01 - 6.19e-01), color = "#00BFC4")
```

As a side note, that’s how to get the names of the standard ggplot color palette:

`library(scales)`

```
##
## Attaching package: 'scales'
```

```
## The following object is masked from 'package:purrr':
##
## discard
```

```
## The following object is masked from 'package:readr':
##
## col_factor
```

`show_col(hue_pal()(2))`

As can be seen, not much interaction appears to be there (although a little bit of an interaction effect is there). Why that? That’s just because the scaling of the data does not readily show the interaction. Let’s use ggplot’s build in geom `geom_smooth()`

which should give the same output as above:

```
d %>%
ggplot() +
aes(x = Fare, y = Survived, color = Sex) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
```

Of course, probabilities greater 1 do not make sense. That’s the reason why we prefer a “bended” graph, such as the s-type ogive in logistic regression. Let’s plot that instead.

First, we need to get the survival probabilities:

```
d %>%
mutate(pred_prob = predict(glm1, type = "response")) -> d
```

Notice that `type = "response`

gives you the probabilities of survival (ie., of the modeled event).

Check, check, double-check:

```
ggplot(d) +
geom_density(aes(x = pred_prob, fill = Sex), alpha = .5)
```

Now let’s plot the ogive including the interaction effects:

```
d %>%
ggplot() +
aes(x = Fare, y = Survived, color = Sex) +
geom_point() +
geom_smooth(method = "glm",
aes(x = Fare, y= pred_prob),
method.args = list(family = "binomial"),
se = FALSE)
```

```
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
```

# Debrief

Plotting helps. Instead of inspecting the numbers in the glm output table only, it helps to plot the curves. In that way, the overall trend can be easier understood. Remember that the coefficients of a linear model are different if an interaction effect is modeled, compared with an model where no interaction is modeled. The coefficient of a predictor, such as `Fare`

means: “That’s the estimate of the mean statistical influence of Fare, if Fare is zero, which makes no sense, and if the other predictors are all at zero too, which make no sense either in the most cases”. This difficulty in interpretation is the reason why people like to center their models. Centered models have the mean value as the zero, which is easier to interpret. However, for an interaction of two groups and one single continuous parameter, things are not too difficult.