Load packages
library(tidyverse)
Intro
A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.
In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.
Simulate data and define structural model
Let \(X\) be a discrete variable following uniform distribution, and \(x_i \in \{1,2,3\}\).
set.seed(42)
n <- 1000
X <- sample(x = c(1,2,3), size = n, replace = TRUE)
hist(X)
Let \(y_i = 0.1 + 1x_i\), by extension let \(\hat{y_i} = 0.1 + 0.1x_i + \epsilon_i\), where \(\epsilon_i \sim \Gamma(m = 0.1, \, s= 1/m)\), and \(\Gamma\) refers to the Gamma distribution with location \(m\) and shape \(s\).
e <- rgamma(n, shape = 0.1, scale = 1/0.1)
Compute \(Y\) and \(\hat(Y)\):
Y = 0.1 + 1 * X
Y_hat = Y + e
hist(Y_hat)
plot(X, Y_hat)
Regression
glm1 <- glm(Y_hat ~ X, Gamma(link = "log"))
summary(glm1)
#>
#> Call:
#> glm(formula = Y_hat ~ X, family = Gamma(link = "log"))
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -0.6617 -0.3676 -0.2828 -0.2056 4.8995
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.54403 0.10210 5.329 1.22e-07 ***
#> X 0.29473 0.04777 6.170 9.93e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Gamma family taken to be 1.459691)
#>
#> Null deviance: 497.06 on 999 degrees of freedom
#> Residual deviance: 442.11 on 998 degrees of freedom
#> AIC: 3915.2
#>
#> Number of Fisher Scoring iterations: 5
Coefficients, delogged
exp(coef(glm1))
#> (Intercept) X
#> 1.722940 1.342758
The delogged (exponentiated) coefficients provide a factor by which the mean of the outcome variable will change per unit change of X, similar to a normal linear model. The main difference is that the the coefficient denotes a multiplicative change rather than an additive change.
Prediction time
We’d expect that, if \(x_i=1\), that \(y_i=1.72 * 1.34 \approx 2.31\).
Let’s check that.
predict(glm1, newdata = data.frame(X=1))
#> 1
#> 0.8387581
That’s in log-notation. Let’s delog:
predict(glm1, newdata = data.frame(X=1)) %>% exp()
#> 1
#> 2.313492
Works!