Here’s q quick walk-through for a logistic regression in R.
Setup
library(tidyverse)
library(reshape2) # dataset "tips"
library(caret)
library(mosaic)
We’ll use the tips
dataset:
data(tips)
Research question
Assume we would like to predict if a person is female based on some predictor such as the amount of tip she/he give.
How many instances of each type of the outcome variable are in the data set?
tally(~ sex, data = tips, format = "proportion")
#> sex
#> Female Male
#> 0.3565574 0.6434426
tally(~ sex, data = tips)
#> sex
#> Female Male
#> 87 157
Preparation
It is helpful (but not mandatory) to have the outcome variable as binary, ie., of type 0/1.
tips <- tips %>%
mutate(sex_binary = case_when(
sex == "Female" ~ 1,
sex == "Male" ~ 0
))
Regression
glm1 <- glm(sex_binary ~ tip, data = tips,
family = "binomial")
summary(glm1)
#>
#> Call:
#> glm(formula = sex_binary ~ tip, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.0479 -0.9743 -0.8837 1.3755 1.6630
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.1703 0.3288 -0.518 0.605
#> tip -0.1421 0.1031 -1.378 0.168
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 317.89 on 243 degrees of freedom
#> Residual deviance: 315.89 on 242 degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4
Let’s compare the AIC to the null model:
glm0 <- glm(sex_binary ~ 1, data = tips,
family = "binomial")
summary(glm0)
#>
#> Call:
#> glm(formula = sex_binary ~ 1, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.9391 -0.9391 -0.9391 1.4362 1.4362
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.5903 0.1337 -4.417 1e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 317.89 on 243 degrees of freedom
#> Residual deviance: 317.89 on 243 degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4
As can be seen, the AICs do not differ. Our model glm1
is not better than the null model (glm0
).
There’s no difference if we had used the factor variable version of sex
instead:
glm2 <- glm(sex ~ tip, data = tips,
family = "binomial")
summary(glm2)
#>
#> Call:
#> glm(formula = sex ~ tip, family = "binomial", data = tips)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.6630 -1.3755 0.8837 0.9743 1.0479
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.1703 0.3288 0.518 0.605
#> tip 0.1421 0.1031 1.378 0.168
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 317.89 on 243 degrees of freedom
#> Residual deviance: 315.89 on 242 degrees of freedom
#> AIC: 319.89
#>
#> Number of Fisher Scoring iterations: 4
Apparently, our model did not work out. For the sake of explanation, let’s try different predictors:
glm3 <- glm(sex ~ total_bill + time + smoker,
data = tips,
family = "binomial")
summary(glm3)
#>
#> Call:
#> glm(formula = sex ~ total_bill + time + smoker, family = "binomial",
#> data = tips)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.8592 -1.1656 0.7888 0.9022 1.3337
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.30015 0.38290 0.784 0.43310
#> total_bill 0.02949 0.01684 1.751 0.07992 .
#> timeLunch -0.83976 0.29923 -2.806 0.00501 **
#> smokerYes -0.07373 0.28432 -0.259 0.79538
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 317.89 on 243 degrees of freedom
#> Residual deviance: 304.61 on 240 degrees of freedom
#> AIC: 312.61
#>
#> Number of Fisher Scoring iterations: 4
OK, that’s better - at least the predictors appear to be of some use. Let’s keep on working with this model (glm3
).
Get predictions
Let’s get the predictions of the model:
tips <- tips %>%
mutate(glm3_prediction = predict(glm3,
newdata = tips,
type = "response"))
head(tips$glm3_prediction)
#> 1 2 3 4 5 6
#> 0.6902262 0.6468161 0.7149859 0.7307558 0.7360029 0.7399942
FWIW, these prediction data are also stored in the GLM object:
head(glm3$fitted.values)
#> 1 2 3 4 5 6
#> 0.6902262 0.6468161 0.7149859 0.7307558 0.7360029 0.7399942
What the predict
functions gives back, is the probability of being female. In other words, if 0
is surely a male, and 1
surely a woman, a value of say 0.30
can be interpreted as something like “likely a man”.
In order to distill the “male” vs. “female” verdict out of the probabilities, we need a rule that says something like “if the probability is greater than my threshold of, say, 0.50, than let’s conclude it’s a female, else we conclude it’s a man.”
This can be achieved like this:
tips <- tips %>%
mutate(sex_predicted = case_when(
glm3_prediction > 0.5 ~ "Female",
glm3_prediction <= 0.5 ~ "Male")
)
head(tips$sex_predicted)
#> [1] "Female" "Female" "Female" "Female" "Female" "Female"
Check
Let’s see how many females and how many males our model predicts:
tally(~ sex_predicted, data = tips)
#> sex_predicted
#> Female Male
#> 193 51
What’s the data type of our predicted data and the observed (outcome) data?`
str(tips$sex)
#> Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
str(tips$sex_predicted)
#> chr [1:244] "Female" "Female" "Female" "Female" "Female" "Female" ...
Note that tips$sex_predicted
is of type character
. A later function will only digest factor variables, so let’s change the type now:
tips$sex_predicted <- factor(tips$sex_predicted)
Confusion matrix
There are a number of packages and functions available, but I find this one (from the package caret
) handy:
confusionMatrix(reference = tips$sex, data = tips$sex_predicted)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Female Male
#> Female 59 134
#> Male 28 23
#>
#> Accuracy : 0.3361
#> 95% CI : (0.2771, 0.3991)
#> No Information Rate : 0.6434
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : -0.1379
#> Mcnemar's Test P-Value : <2e-16
#>
#> Sensitivity : 0.6782
#> Specificity : 0.1465
#> Pos Pred Value : 0.3057
#> Neg Pred Value : 0.4510
#> Prevalence : 0.3566
#> Detection Rate : 0.2418
#> Detection Prevalence : 0.7910
#> Balanced Accuracy : 0.4123
#>
#> 'Positive' Class : Female
#>
Our model is lousy; The no information rate is .72; our accuracy is much below. What’s the “no information rate”? From the help page of confusionMatrix
:
the "no information rate," which is taken to be the largest class percentage in the data.
The model accuracy should always be compared with the no information rate to gauge the “goodness” of the accuracy.
Debrief
That’s only a very brief example on how to conduct a statistical classification. See this post for a slightly more advanced example.
A next step could be to compute the area under the curve (AUV) metrics.