# A short tutorial for the logistic regression

Here’s q quick walk-through for a logistic regression in R.

# Setup

library(tidyverse)
library(reshape2)  # dataset "tips"
library(caret)
library(mosaic)

We’ll use the tips dataset:

data(tips)

# Research question

Assume we would like to predict if a person is female based on some predictor such as the amount of tip she/he give.

How many instances of each type of the outcome variable are in the data set?

tally(~ sex, data = tips, format = "proportion")
#> sex
#>    Female      Male
#> 0.3565574 0.6434426
tally(~ sex, data = tips)
#> sex
#> Female   Male
#>     87    157

# Preparation

It is helpful (but not mandatory) to have the outcome variable as binary, ie., of type 0/1.

tips <- tips %>%
mutate(sex_binary = case_when(
sex == "Female" ~ 1,
sex == "Male" ~ 0
))

# Regression

glm1 <- glm(sex_binary ~ tip, data = tips,
family = "binomial")

summary(glm1)
#>
#> Call:
#> glm(formula = sex_binary ~ tip, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -1.0479  -0.9743  -0.8837   1.3755   1.6630
#>
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.1703     0.3288  -0.518    0.605
#> tip          -0.1421     0.1031  -1.378    0.168
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 317.89  on 243  degrees of freedom
#> Residual deviance: 315.89  on 242  degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4

Let’s compare the AIC to the null model:

glm0 <- glm(sex_binary ~ 1, data = tips,
family = "binomial")

summary(glm0)
#>
#> Call:
#> glm(formula = sex_binary ~ 1, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -0.9391  -0.9391  -0.9391   1.4362   1.4362
#>
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  -0.5903     0.1337  -4.417    1e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 317.89  on 243  degrees of freedom
#> Residual deviance: 317.89  on 243  degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4

As can be seen, the AICs do not differ. Our model glm1 is not better than the null model (glm0).

There’s no difference if we had used the factor variable version of sex instead:

glm2 <- glm(sex ~ tip, data = tips,
family = "binomial")

summary(glm2)
#>
#> Call:
#> glm(formula = sex ~ tip, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -1.6630  -1.3755   0.8837   0.9743   1.0479
#>
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)   0.1703     0.3288   0.518    0.605
#> tip           0.1421     0.1031   1.378    0.168
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 317.89  on 243  degrees of freedom
#> Residual deviance: 315.89  on 242  degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4

Apparently, our model did not work out. For the sake of explanation, let’s try different predictors:

glm3 <- glm(sex ~ total_bill + time + smoker,
data = tips,
family = "binomial")
summary(glm3)
#>
#> Call:
#> glm(formula = sex ~ total_bill + time + smoker, family = "binomial",
#>     data = tips)
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -1.8592  -1.1656   0.7888   0.9022   1.3337
#>
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  0.30015    0.38290   0.784  0.43310
#> total_bill   0.02949    0.01684   1.751  0.07992 .
#> timeLunch   -0.83976    0.29923  -2.806  0.00501 **
#> smokerYes   -0.07373    0.28432  -0.259  0.79538
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#>     Null deviance: 317.89  on 243  degrees of freedom
#> Residual deviance: 304.61  on 240  degrees of freedom
#> AIC: 312.61
#>
#> Number of Fisher Scoring iterations: 4

OK, that’s better - at least the predictors appear to be of some use. Let’s keep on working with this model (glm3).

# Get predictions

Let’s get the predictions of the model:

tips <- tips %>%
mutate(glm3_prediction = predict(glm3,
newdata = tips,
type = "response"))

head(tips$glm3_prediction) #> 1 2 3 4 5 6 #> 0.6902262 0.6468161 0.7149859 0.7307558 0.7360029 0.7399942 FWIW, these prediction data are also stored in the GLM object: head(glm3$fitted.values)
#>         1         2         3         4         5         6
#> 0.6902262 0.6468161 0.7149859 0.7307558 0.7360029 0.7399942

What the predict functions gives back, is the probability of being female. In other words, if 0 is surely a male, and 1 surely a woman, a value of say 0.30 can be interpreted as something like “likely a man”.

In order to distill the “male” vs. “female” verdict out of the probabilities, we need a rule that says something like “if the probability is greater than my threshold of, say, 0.50, than let’s conclude it’s a female, else we conclude it’s a man.”

This can be achieved like this:

tips <- tips %>%
mutate(sex_predicted = case_when(
glm3_prediction > 0.5 ~ "Female",
glm3_prediction <= 0.5 ~ "Male")
)

head(tips$sex_predicted) #> [1] "Female" "Female" "Female" "Female" "Female" "Female" # Check Let’s see how many females and how many males our model predicts: tally(~ sex_predicted, data = tips) #> sex_predicted #> Female Male #> 193 51 What’s the data type of our predicted data and the observed (outcome) data? str(tips$sex)
#>  Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
str(tips$sex_predicted) #> chr [1:244] "Female" "Female" "Female" "Female" "Female" "Female" ... Note that tips$sex_predicted is of type character. A later function will only digest factor variables, so let’s change the type now:

tips$sex_predicted <- factor(tips$sex_predicted)

# Confusion matrix

There are a number of packages and functions available, but I find this one (from the package caret) handy:

confusionMatrix(reference = tips$sex, data = tips$sex_predicted)
#> Confusion Matrix and Statistics
#>
#>           Reference
#> Prediction Female Male
#>     Female     59  134
#>     Male       28   23
#>
#>                Accuracy : 0.3361
#>                  95% CI : (0.2771, 0.3991)
#>     No Information Rate : 0.6434
#>     P-Value [Acc > NIR] : 1
#>
#>                   Kappa : -0.1379
#>  Mcnemar's Test P-Value : <2e-16
#>
#>             Sensitivity : 0.6782
#>             Specificity : 0.1465
#>          Pos Pred Value : 0.3057
#>          Neg Pred Value : 0.4510
#>              Prevalence : 0.3566
#>          Detection Rate : 0.2418
#>    Detection Prevalence : 0.7910
#>       Balanced Accuracy : 0.4123
#>
#>        'Positive' Class : Female
#> 

Our model is lousy; The no information rate is .72; our accuracy is much below. What’s the “no information rate”? From the help page of confusionMatrix:

the "no information rate," which is taken to be the largest class percentage in the data.`

The model accuracy should always be compared with the no information rate to gauge the “goodness” of the accuracy.

# Debrief

That’s only a very brief example on how to conduct a statistical classification. See this post for a slightly more advanced example.

A next step could be to compute the area under the curve (AUV) metrics.