# Simulating data for a Gamma regression

library(tidyverse)

# Intro

A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.

In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.

# Simulate data and define structural model

Let $$X$$ be a discrete variable following uniform distribution, and $$x_i \in \{1,2,3\}$$.

set.seed(42)

n <- 1000
X <- sample(x = c(1,2,3), size = n, replace = TRUE)

hist(X)

Let $$y_i = 0.1 + 1x_i$$, by extension let $$\hat{y_i} = 0.1 + 0.1x_i + \epsilon_i$$, where $$\epsilon_i \sim \Gamma(m = 0.1, \, s= 1/m)$$, and $$\Gamma$$ refers to the Gamma distribution with location $$m$$ and shape $$s$$.

e <- rgamma(n, shape = 0.1, scale = 1/0.1)

Compute $$Y$$ and $$\hat(Y)$$:

Y = 0.1 + 1 * X
Y_hat = Y + e
hist(Y_hat)

plot(X, Y_hat)

# Regression

glm1 <- glm(Y_hat ~ X, Gamma(link = "log"))
summary(glm1)
#>
#> Call:
#> glm(formula = Y_hat ~ X, family = Gamma(link = "log"))
#>
#> Deviance Residuals:
#>     Min       1Q   Median       3Q      Max
#> -0.6617  -0.3676  -0.2828  -0.2056   4.8995
#>
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  0.54403    0.10210   5.329 1.22e-07 ***
#> X            0.29473    0.04777   6.170 9.93e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for Gamma family taken to be 1.459691)
#>
#>     Null deviance: 497.06  on 999  degrees of freedom
#> Residual deviance: 442.11  on 998  degrees of freedom
#> AIC: 3915.2
#>
#> Number of Fisher Scoring iterations: 5

# Coefficients, delogged

exp(coef(glm1))
#> (Intercept)           X
#>    1.722940    1.342758

The delogged (exponentiated) coefficients provide a factor by which the mean of the outcome variable will change per unit change of X, similar to a normal linear model. The main difference is that the the coefficient denotes a multiplicative change rather than an additive change.

# Prediction time

We’d expect that, if $$x_i=1$$, that $$y_i=1.72 * 1.34 \approx 2.31$$.

Let’s check that.

predict(glm1, newdata = data.frame(X=1))
#>         1
#> 0.8387581

That’s in log-notation. Let’s delog:

predict(glm1, newdata = data.frame(X=1)) %>% exp()
#>        1
#> 2.313492

Works!