Simulating data for a Gamma regression

Load packages

library(tidyverse)

Intro

A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.

In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.

Simulate data and define structural model

Let \(X\) be a discrete variable following uniform distribution, and \(x_i \in \{1,2,3\}\).

set.seed(42)

n <- 1000
X <- sample(x = c(1,2,3), size = n, replace = TRUE)

hist(X)

Let \(y_i = 0.1 + 1x_i\), by extension let \(\hat{y_i} = 0.1 + 0.1x_i + \epsilon_i\), where \(\epsilon_i \sim \Gamma(m = 0.1, \, s= 1/m)\), and \(\Gamma\) refers to the Gamma distribution with location \(m\) and shape \(s\).

e <- rgamma(n, shape = 0.1, scale = 1/0.1)

Compute \(Y\) and \(\hat(Y)\):

Y = 0.1 + 1 * X
Y_hat = Y + e
hist(Y_hat)

plot(X, Y_hat)

Regression

glm1 <- glm(Y_hat ~ X, Gamma(link = "log"))
summary(glm1)
#> 
#> Call:
#> glm(formula = Y_hat ~ X, family = Gamma(link = "log"))
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -0.6617  -0.3676  -0.2828  -0.2056   4.8995  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.54403    0.10210   5.329 1.22e-07 ***
#> X            0.29473    0.04777   6.170 9.93e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for Gamma family taken to be 1.459691)
#> 
#>     Null deviance: 497.06  on 999  degrees of freedom
#> Residual deviance: 442.11  on 998  degrees of freedom
#> AIC: 3915.2
#> 
#> Number of Fisher Scoring iterations: 5

Coefficients, delogged

exp(coef(glm1))
#> (Intercept)           X 
#>    1.722940    1.342758

The delogged (exponentiated) coefficients provide a factor by which the mean of the outcome variable will change per unit change of X, similar to a normal linear model. The main difference is that the the coefficient denotes a multiplicative change rather than an additive change.

Prediction time

We’d expect that, if \(x_i=1\), that \(y_i=1.72 * 1.34 \approx 2.31\).

Let’s check that.

predict(glm1, newdata = data.frame(X=1))
#>         1 
#> 0.8387581

That’s in log-notation. Let’s delog:

predict(glm1, newdata = data.frame(X=1)) %>% exp()
#>        1 
#> 2.313492

Works!