Intro
Professor Sweet is conducting some research to investigate the risk factor and drivers of student exam success. In a recent analysis he considers the variable “exam successfully passed” (vs. not passed) as the criterion (output) and the amount of time spent for preparation (aka study time) as predictor.
Setup
Please make sure that all packages are installed before proceeding. Except pradadata
, all packages are on CRAN. [ Here’s] (https://github.com/sebastiansauer/pradadatathe installation guide for pradadata
.
# devtools::install_github("sebastiansauer/pradadata") install this packages if not yet installed
data(stats_test, package = "pradadata")
library(tidyverse)
library(ggpubr)
library(glue)
library(ggExtra)
library(broom)
Let’s change some output formatting of ggplot (see this post for details)
options(ggplot2.continuous.colour="viridis")
options(ggplot2.continuous.fill = "viridis")
scale_colour_discrete <- scale_colour_viridis_d
scale_fill_discrete <- scale_fill_viridis_d
theme_set(theme_pubr())
Glimpse
First, Professor Sweet checks the uni-variate distribution of each variable of interest:
stats_test <- stats_test %>%
mutate_if(is.integer, as.numeric)
glimpse(stats_test)
#> Observations: 1,646
#> Variables: 7
#> $ date_time <chr> "08.12.2015 07:34:29", "08.12.2015 12:46:49", "08.1...
#> $ study_time <dbl> 3, 4, 3, 4, 2, 2, 4, 1, 3, 3, 2, 4, 3, 3, 3, 4, 4, ...
#> $ self_eval <dbl> 7, 8, 4, 7, 6, 6, 6, 7, 8, 7, 8, 8, 7, 7, 4, 6, 8, ...
#> $ interest <dbl> 4, 5, 1, 5, 3, 4, 3, 4, 5, 5, 3, 5, 4, 4, 1, 4, 5, ...
#> $ score <dbl> 0.7500000, 0.8250000, 0.7894737, 0.8250000, 0.80000...
#> $ row_number <dbl> 21, 22, 23, 24, 25, 26, 28, 29, 30, 31, 32, 33, 34,...
#> $ bestanden <chr> "ja", "ja", "ja", "ja", "ja", "ja", "ja", "ja", "ja...
Add some noise
Let’s add some noise to study time in order to adjust for measurement error:
stats_test <- stats_test %>%
mutate(study_time_noise = study_time + rnorm(n = nrow(.), 0,.5))
p1 <- stats_test %>%
select(study_time_noise, score) %>%
map2(., names(.), ~ggplot(data = data_frame(.x), aes(x = .x)) +
labs(x = .y,
title = glue("Variable: {.y}")) +
geom_histogram())
print(p1)
#> $study_time_noise
#>
#> $score
Bivariate association
p2 <- ggplot(stats_test) +
aes(x = study_time_noise, y = score) +
geom_jitter(alpha = .5) +
geom_smooth() +
labs(title = "Bivariate association of study time \nand score")
p3 <- ggMarginal(p2)
p3
Regression
Now, Professor Sweet computes a simple regression. Curiously, he sits and watches what the software spits out. First, a visual sketch:
stats_test %>%
ggplot() +
aes(y = score, x = study_time_noise) +
geom_jitter(alpha = .5) +
geom_smooth(method = "lm")
Then he computes the actual regression:
lm1 <- lm(score ~ study_time_noise, data = stats_test)
tidy(lm1)
#> # A tibble: 2 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.628 0.00827 75.9 0.
#> 2 study_time_noise 0.0493 0.00267 18.5 1.48e-69
Compute the predicted values:
stats_test <- stats_test %>%
mutate(pred_lm1 = predict(lm1, newdata = stats_test))
Visual sketch:
p4 <- stats_test %>%
ggplot() +
aes(x = study_time_noise) +
geom_jitter(aes(y = score), alpha = .5) +
geom_line(aes(y = pred_lm1), color = "red", size = 5) +
labs(title = "Regression of exam score \nby study time")
p4
But what would happen if study time was, say, 10?
data_new <- data_frame(
study_time_noise = 0:10,
score = predict(lm1, newdata = data.frame(study_time_noise = 0:10))
)
p5 <- ggplot(stats_test) +
aes(x = study_time_noise, y = score) +
geom_jitter(alpha = .5) +
geom_line(data = data_new, color = "red", size = 3) +
scale_x_continuous(limits = c(0,10)) +
scale_y_continuous(breaks = seq(0,1,.25)) +
labs(title = "Regression with high values \nof study time")
p5
Pause
Now the exhausted professor sits and pauses in an attempt to conceive whats has been achieved.
ggarrange(p3, p4, p5, labels = c("A", "B", "C"))
This approach may yield implausible score values, ie above 1. Clearly, a different approach is needed.
Sigmoid curve
A simple alternative would be to replace to straight line by a sigmoid curve, such as in the following.
In essence, that’s the formula: \[p(X=1) = \frac{e^x}{1+e^x}\]
That’s how it looks like:
logist <- function(x){
y = exp(x) / (1 + exp(x))
}
ggplot(data_frame()) +
stat_function(aes(-5:5), fun = logist)
This function computes it as the best fit to a bunch of data points:
glm1 <- glm(score ~ study_time_noise, data = stats_test, family = "binomial")
Add the according predicted values:
stats_test <- stats_test %>%
mutate(pred_glm1 = predict(glm1, newdata = stats_test, type = "response"))
Plot the predicted values:
df_glm <- data_frame(
study_time_noise = 0:10,
pred_glm1 = predict(glm1,
newdata = data.frame(study_time_noise = 0:10),
type = "response")
)
p6 <- ggplot(stats_test) +
aes(x = study_time_noise, y = score) +
geom_jitter(alpha = .5) +
labs(title = "Sigmoid regression line") +
geom_line(data = df_glm, aes(y = pred_glm1),
color = "red", size = 3)
p6
Pause again
ggarrange(p3, p4, p5, p6, labels = c("A", "B", "C", "D"))