Visualisation of interaction for the logistic regression

In this post we are plotting an interaction for a logistic regression. Interaction per se is a concept difficult to grasp; for a GLM it may be even more difficult especially for continuous variables’ interaction. Plotting helps to better or more easy grasp what a model tries to tell us.

First, load some packages.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(titanic)
library(broom)
library(modelr)
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap
library(knitr)

We will deal with the well-known Titanic data, ie., we check which predictors augment survival probability:

data(titanic_train)
d <- titanic_train

Let’s glimpse the data:

glimpse(d) 
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived    <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass      <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex         <chr> "male", "female", "female", "female", "male", "mal...
## $ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp       <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch       <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
## $ Cabin       <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...

Let’s compute the logistic regression using the standard glm(), using the following notation, the interaction term will be included. Note that this type of glm assumes a flat, unregulatated prior and a Gaussian likelihood, in Bayesian parlance. The interaction term is also linear. If you are not satisfied with this default values, better switch to a fully Bayesian analysis.

glm1 <- glm(Survived ~ Fare*Sex, data = d,
            family = "binomial")
tidy(glm1) %>% kable()
term estimate std.error statistic p.value
(Intercept) 0.4084278 0.1899986 2.149635 0.0315841
Fare 0.0198785 0.0053715 3.700698 0.0002150
Sexmale -2.0993453 0.2302908 -9.116061 0.0000000
Fare:Sexmale -0.0116171 0.0059337 -1.957809 0.0502524

Now, consider plotting the interaction. Note first that the glm is a linear model. However, linear for the logits, ie:

\[l(y_i) = b_0 + b_1x_i + \dots\]

where \(l(y_i)\) is the logit of the probability of the modeled event for case \(i\).

In other words: It’s just a linear model, just like “normal” regression. Hence, we can plot it using straight, linear lines; that’s true for the interaction as well. Plotting the interaction of a glm is nothing new in this sense.

d %>% 
  mutate(pred_logit = predict(glm1)) -> d

Let’s glimpse the distribution of the logits:

ggplot(d) +
  aes(x = pred_logit, fill = Sex) +
  geom_density(alpha = .5)

Now let’s plot the interaction on logit scale:

d %>% 
  ggplot() +
  aes(x = Fare, y = pred_logit, color = Sex) +
  geom_point() +
  geom_line()

Just lines - diverging lines in this case, indicating the presence of an interaction effect between Sex and Fare.

For the sake of curiosity, let’s compute a linear model with Survived as criterion:

lm1 <- lm(Survived ~ Fare*Sex, data = d)
tidy(lm1)
## # A tibble: 4 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   0.669      0.0286      23.4   1.14e-94
## 2 Fare          0.00165    0.000391     4.22  2.72e- 5
## 3 Sexmale      -0.519      0.0346     -15.0   1.18e-45
## 4 Fare:Sexmale -0.0000950  0.000551    -0.172 8.63e- 1

Again, an interaction effect is present.

Now let’s plot Survived on the y-axis. We just plugin the estimates from lm1 at the slope and intercept parameter for geom_abline which draws a straight line.

d %>% 
  ggplot() +
  aes(x = Fare, y = Survived, color = Sex) +
  geom_point() +
  geom_abline(mapping = aes(slope = 1.65e-03, intercept = 6.69e-01), color = "#F8766D") +
  geom_abline(mapping = aes(slope = 1.65e-03 - 9.50e-05, intercept = 6.69e-01 - 6.19e-01), color = "#00BFC4")

As a side note, that’s how to get the names of the standard ggplot color palette:

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
show_col(hue_pal()(2))

As can be seen, not much interaction appears to be there (although a little bit of an interaction effect is there). Why that? That’s just because the scaling of the data does not readily show the interaction. Let’s use ggplot’s build in geom geom_smooth() which should give the same output as above:

 d %>% 
  ggplot() +
  aes(x = Fare, y = Survived, color = Sex) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Of course, probabilities greater 1 do not make sense. That’s the reason why we prefer a “bended” graph, such as the s-type ogive in logistic regression. Let’s plot that instead.

First, we need to get the survival probabilities:

d %>% 
  mutate(pred_prob = predict(glm1, type = "response")) -> d

Notice that type = "response gives you the probabilities of survival (ie., of the modeled event).

Check, check, double-check:

ggplot(d) +
  geom_density(aes(x = pred_prob, fill  = Sex), alpha = .5)

Now let’s plot the ogive including the interaction effects:

 d %>% 
  ggplot() +
  aes(x = Fare, y = Survived, color = Sex) +
  geom_point() +
  geom_smooth(method = "glm",
              aes(x = Fare, y= pred_prob),
              method.args = list(family = "binomial"),
              se = FALSE)
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

Debrief

Plotting helps. Instead of inspecting the numbers in the glm output table only, it helps to plot the curves. In that way, the overall trend can be easier understood. Remember that the coefficients of a linear model are different if an interaction effect is modeled, compared with an model where no interaction is modeled. The coefficient of a predictor, such as Fare means: “That’s the estimate of the mean statistical influence of Fare, if Fare is zero, which makes no sense, and if the other predictors are all at zero too, which make no sense either in the most cases”. This difficulty in interpretation is the reason why people like to center their models. Centered models have the mean value as the zero, which is easier to interpret. However, for an interaction of two groups and one single continuous parameter, things are not too difficult.