Load packages
library(tidyverse)
library(hrbrthemes)
library(MASS)
library(moments)
Why you should spell out your model explicitly
Often, assumptions of widely used models, such as linear models, appear opaque. Why is heteroscedasticity important? Where is a list of the model assumptions I need to consider?
As it turns out, there are straight forward answers to these (and similar) questions. The solution is to explicitly spell out your model. All “assumptions” can easily read off from these model specifications.
Example for explicit (and complex) model
Say, we would like to model Covid-19 cases (\(C\)) as a function of tests applied in some population. We could specify our model like this:
\[ C \sim \mathcal{P}(\lambda_i) \\ \text{log}\lambda_i = \alpha + \beta_1\,x_{i}^{s} \\ \alpha \sim \mathcal{E}(.5) \\ \beta_1 \sim \mathcal{E}(0.1) \]
Here, we model \(C\) as Poisson (\(\mathcal{P}\)) distribution variable, because \(C\) is a count variable, and Poisson is a maximum entropy variable for counts. However, this is a apriori reasoning; after seeing the data we might want to consider different model despite the danger of overfitting.
Recall that \(\lambda\) must be positive or null. That’s why we put a “log” before the likelihood (the linear model specification) line. Similarly, we can only take the log of a positive number, so we need to make sure that \((\alpha + \beta_1 \, X )> 0\). Here, the Exponential distribution (\(\mathcal{E}\)) is handy, as it makes sure that \(\alpha\) or \(\beta\) ist positive.
It’s helpful to use standardized predictors, \(x_i^s\), because that allows us to model the intercept \(\alpha\) as the mean value of \(C\), among other things. It would be a fruitful discussion to contemplate the usefulness of the mean for long-tail distributions, see Taleb Incerto, but that’s beyond the scope of this post.
We must take care of the log-normal (\(\mathcal{LN}\)), because log-normal are fat-tailed so the mean will be large. Seeing is believing, so let’s consider \(A \sim \mathcal{LN}(0, 10)\). Draw some samples from that population:
set.seed(42)
a <- rlnorm(n = 1000, meanlog = 0, sdlog = 10)
options(scipen = 3)
mean(a)
#> [1] 1716379931934
format(mean(a), scientific = T)
#> [1] "1.71638e+12"
For a smaller sd:
set.seed(42)
a <- rlnorm(n = 1000, meanlog = 4, sdlog = 1)
options(scipen = 3)
mean(a)
#> [1] 88.55222
format(mean(a), scientific = T)
#> [1] "8.855222e+01"
Let’s apply the same reasoning for \(\beta_{1}\).
Read off modelling assumptions
Some of the “assumptions” implied by the above model are:
- Certain distributions of the variables involved
- Certain function forms map the input variable to the output variable
- The conditional errors (conditional to \(\hat{Y}\)) are iid - independent and identically distributed, that is, at each value of \(\hat{Y}\) the errors are drawn from the same distribution, and they are drawn without correlation (eg., drawn randomly from an infinite population). This fact incorporated heteroscedasticity, as the residuals have the same variance irrespective to \(\hat{Y}\).
- Our ignorance according to the “true” value of some variable is incorporated. For example, we are not sure about the true value of \(\beta_1\), there is uncertainty involved, which is reflected by the fact that \(\beta_1\) is drawn from a probability distribution.
Simulate data
Say, \(n=1000\)
According to our model:
set.seed(42)
n <- 1000
alpha <- rexp(n = 1000, rate = 1)
beta <- rexp(n = 1000, rate = 0.1)
See:
ggplot(data.frame(alpha)) +
aes(x = alpha) +
geom_density()
ggplot(data.frame(beta)) +
aes(x = beta) +
geom_density()
As we do not have real data \(X\), we need to simulate it. Assuming that \(X\) is again exponentially distributed (to avoid negative values):
set.seed(42)
X <- rexp(1000, rate = .5)
ggplot(data.frame(X)) +
aes(x = X) +
geom_density() +
theme_ipsum_rc()
Compute the likelihood
lambda_log <- (alpha + beta*X) %>% log()
lambda <- exp(lambda_log)
ggplot(data.frame(lambda)) +
aes(x = lambda) +
geom_density()
Oh my, that’s a long tail.
Estimate \(C\)
set.seed(42)
C <- rpois(n = 1000, lambda = lambda)
Put the data together
C_df <- tibble(
C = C,
lambda = lambda,
lambda_log = lambda_log,
X = X,
alpha = alpha,
beta = beta)
ggplot(C_df) +
aes(x= C) +
geom_density() +
theme_ipsum_rc(grid = "Y")
Fit distributions
pois_param <- fitdistr(C_df$C,
densfun = "poisson")$estimate
str(pois_param)
#> Named num 23.2
#> - attr(*, "names")= chr "lambda"
pois_param
#> lambda
#> 23.184
Supported distributions are:
“beta”, “cauchy”, “chi-squared”, “exponential”, “gamma”, “geometric”, “log-normal”, “lognormal”, “logistic”, “negative binomial”, “normal”, “Poisson”, “t” and “weibull”
For comparison, let’s fit another distribution, say a log-normal.
lnorm_param <- fitdistr(C_df$C[C_df$C > 0],
densfun = "lognormal")$estimate
lnorm_param
#> meanlog sdlog
#> 2.389367 1.337632
… or a Gamma:
gamma_param <- fitdistr(C_df$C[C_df$C > 0],
densfun = "gamma")$estimate
gamma_param
#> shape rate
#> 0.71162128 0.02796269
Compare fit
To compare fit, a qqplot can be helpful.
Poisson
C_df %>%
ggplot(aes(sample = C)) +
geom_qq(distribution = qpois,
dparams = as.list(pois_param),
color = "dodgerblue") +
stat_qq_line(distribution = qpois,
dparams = as.list(pois_param),
color = "dodgerblue") +
theme_ipsum_rc()
Gamma
C_df %>%
ggplot(aes(sample = C)) +
geom_qq(distribution = qgamma,
dparams = as.list(gamma_param),
color = "darkseagreen") +
stat_qq_line(distribution = qgamma,
dparams = as.list(gamma_param),
color = "darkseagreen") +
theme_ipsum_rc()
Log-normal
C_df %>%
ggplot(aes(sample = C)) +
geom_qq(distribution = qlnorm,
dparams = as.list(lnorm_param),
color = "firebrick") +
stat_qq_line(distribution = qlnorm,
dparams = as.list(lnorm_param),
color = "firebrick") +
theme_ipsum_rc()
As can be seen, our distributions do not fit well, even given the fact that we simulated the data to be Poisson distributed. This may be due to the fact that we have induced such a long tail. Maybe there are other things going on. Any ideas? Let me know.