Following up on an earlier post, we will model the voting success of the (most prominent) populist party, AfD, in the recent federal elections. This time, Bayesian modeling techniques will be used, drawing on the excellent textbook my McElreath.
Note that this post is rather a notebook of my thinking, doing, and erring. I’ve made no efforts to hide scaffolding. I think it will be confusing to the uniniate and the initiate as well …
Note: Data is based on data published by the Bundeswahlleiter 2017, (c) Der Bundeswahlleiter, Wiesbaden 2017, https://www.bundeswahlleiter.de/info/impressum.html, Data is made available for unrestricted use provided the source is credited.
Setup
Some packages that will help us:
library(tidyverse)
library(rethinking)
library(pradadata)
library(sjmisc)
library(viridis)
First, get the election data.They can be accessed from the Bundeswahlleiter. More conveniently, I have put them in a R package (after some cleansing):
data("elec_results")
In order to combine socioeconomic data with the election results, we can make use of data from the same source as above. Again accessible via the same R pacakge:
data("socec")
Note that a code book is available for these data:
data("socec_dict")
These data will be used as predictors for modeling the election results.
Third, we will make use of geo data in order to geoplot the modeling results. The Bundeswahlleiter provides such data (again via pradadata
):
data("wahlkreise_shp")
Note: Data objects can also be downloaded from this source.
Data preparation
Now let’s merge the data frames. There will also be some janitor work such as renaming columns etc.
First, change the names of the socec
data to a common format:
socec_renamed <- socec %>%
rename(state = V01,
area_nr = V02,
area_name = V03,
total_n = V06,
germans_n = V07,
for_prop = V08,
pop_move_prop = V11,
pop_migr_background_prop = V19,
income = V26,
unemp_prop = V47)
Compute some more columns:
socec2 <- socec_renamed %>%
mutate(foreigner_n = total_n - germans_n,
pop_move_n = pop_move_prop * total_n,
unemp_n = unemp_prop * total_n / 100,
pop_migr_background_n = pop_migr_background_prop * total_n / 100) %>%
drop_na()
Same thing with the election data, here we only need the criterion (AfD success) and the ID variables for merging:
elec_results2 <- elec_results %>%
rename(afd_votes = AfD_3,
area_nr = district_nr,
area_name = district_name,
votes_total = votes_valid_3) %>%
mutate(afd_prop = afd_votes/votes_total) # valid votes only, and of the present Zweitstimme
Note that we are focusing on the Zweitstimme of the present election (hence the 3
in votes_valid_3
and in AfD_3
).
Merge
socec2 %>%
left_join(elec_results2, by = "area_name") %>%
left_join(wahlkreise_shp, by = c("area_name" = "WKR_NAME")) -> d_all_with_na
After-merge preparations
Add variable for East (1) vs. West (0):
d_all_with_na <- d_all_with_na %>%
mutate(east = case_when(
state %in% c("Mecklenburg-Vorpommern", "Brandenburg", "Berlin", "Sachsen-Anhalt", "Sachsen", "Thüringen") ~ "yes",
TRUE ~ "no"
)
)
d_all_with_na$east_num <- ifelse(d_all_with_na$east == "yes", 1, 0)
Main data frame: d_short and d_short_X
We will also provide a version without the geo data, and in pure (old school) data frame
form (ie., not as tibble)_
d_all_with_na %>%
rename(area_nr = area_nr.x) %>%
select(state,
area_nr,
area_name,
total_n,
germans_n,
foreigner_n,
for_prop,
pop_move_n,
pop_migr_background_n,
income ,
unemp_n,
unemp_prop,
votes_total,
afd_votes,
afd_prop,
state,
east,
east_num) -> d_short_with_nas
names(d_short_with_nas)
Remove NAs:
d_short_with_nas %>%
drop_na() -> d_short_nona
Add state id:
d_short_nona$state_id <- coerce_index(d_short_nona$state)
Multiply by 1000 to get the real numbers so that a count model gets the “right” data
d_short_nona %>%
mutate_at(vars(total_n, germans_n, foreigner_n, pop_move_n,
pop_migr_background_n, unemp_n), funs(. * 1000)
) -> d_short_nona_1000
glimpse(d_short_nona_1000)
Some checks
I know that there are 299 districts in Germany, so let’s check the row numbers
nrow(d_short_nona_1000) == 299
By the way, the number 316 of the data frame d_all_1000
is the sum of the districts plus the 16 federal states plus 1 for Germany itself.
Let’s do a similar check for the district names:
identical(elec_results$district_name,socec$V03)
Does not match. Maybe the order is different?
head(elec_results$district_name, 10)
head(socec$V03, 10)
Looks reassuring.
Let’s check which one is present in one but not in the other data frame:
elec_results2 %>%
select(area_name, area_nr) %>%
full_join(select(socec2, area_name, area_nr), by = "area_nr") -> merge_test
Low let’s filter for missings, ie non-matches:
merge_test %>%
filter(is.na(area_name.x))
“Land insgesamt” indicats the grand total of the federal state. We don’t need that data. Seems like all is fine.
Round values
Round values in order to work with counts:
d_short_not_rounded <- d_short_nona_1000 # backup
We need to convert the variables back to integer, bacause stan need integer for count models (makes sense):
d_short_nona_1000 %>%
mutate_at(vars(total_n:afd_votes), funs(as.integer)) -> d_short_as_int
glimpse(d_short_as_int)
Main dataframe:
d_short <- d_short_as_int
Any NAs left:
anyNA(d_short)
z-transformation
Transform to z-values:
d_short %>%
sjmisc::std() %>%
select(-c(state_z, area_nr_z, area_name_z, state_id_z, east_z, east_num_z)) -> d_short_z
names(d_short_z)
center
d_short_as_int %>%
sjmisc::center() %>%
select(-c(state_c, area_nr_c, area_name_c, state_id_c, east_c, east_num_c)) -> d_short_c
names(d_short_c)
Coerce to normal data frame, no tibble
d_short <- data.frame(d_short)
d_short_z <- data.frame(d_short_z)
d_short_c <- data.frame(d_short_c)
Data export
On a side note, this last data frame appears useful. I will upload it to a common repository, so that others can work with it.
election_modeling_data %>%
write_rds("election_modeling_data.rds")
d_short %>%
write_csv("d_short.csv")
d_short_c %>%
write_csv("d_short_c.csv")
d_short_z %>%
write_csv("d_short_z.csv")
This dataset is now made available via osf, DOI: 10.17605/OSF.IO/2YHR9. Similarly, our data frame d
is available via the same plattform (note the file names of the CSV files).
model output list
I use this list to store the model outputs.
models_stan <- list()
Helper functions
The traceplot function from rethinking
appears to have a bug, or does not work as I expected. Here’s a helper function:
my_traceplot <- function(model) {
rstan::traceplot( model@stanfit, pars=names(model@start[[1]]))
}
m9: Some (metric) linear models with east/west, foreigner, unemp as predictors
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "east")]
m9_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta0[east] + beta1*for_prop_z + beta2*unemp_prop_z,
beta0[east] ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
beta2 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2
)
my_traceplot(m9_stan)
precis(m9_stan, depth = 2)
models_stan[["m9_stan"]] <- m9_stan
WAIC(m9_stan)
DIC is not being produced, some issues exist.
m9a: metric model with no east/west
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z")]
m9a_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- alpha + beta1*for_prop_z + beta2*unemp_prop_z,
alpha ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
beta2 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2)
Looks good:
my_traceplot(m9a_stan)
precis(m9a_stan)
models_stan[["m9a_stan"]] <- m9a_stan
Here’s the data in the S4 object for the MCMC chains:
m9_stan@stanfit@sim$samples[[1]] %>% str()
precis(m9a_stan, depth = 2)
WAIC(m9a_stan)
DIC(m9a_stan)
m10: East/West as numeric predictor, not as varying intercept
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "east_num")]
m10_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- alpha + beta1*for_prop_z + beta2*unemp_prop_z + beta3*east_num,
alpha ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
beta2 ~ dnorm(0, 1),
beta3 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2
)
Model output:
precis(m10_stan)
my_traceplot(m10_stan)
DIC(m10_stan)
WAIC(m10_stan)
models_stan[["m10_stan"]] <- m10_stan
Looks ok, but why are the coefficients so large? Strange.
Predictor importance
Let’s compare two models with only one predictor - either unemployment or migration or foreigners to see which model has a better entropy.
foreigner prop
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "east")]
m11a_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- alpha + beta1*for_prop_z,
alpha ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2)
Looks good:
my_traceplot(m11a_stan)
precis(m11a_stan)
models_stan[["m11a_stan"]] <- m11a_stan
unemployment
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "east")]
m11c_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- alpha + beta1*unemp_prop_z,
alpha ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2)
my_traceplot(m11c_stan)
precis(m11c_stan)
models_stan[["m11c_stan"]] <- m11c_stan
east only
d <- d_short_z[, c("afd_prop", "for_prop_z", "east")]
d$east_num <- ifelse(d$east == "yes", 1, 0)
m11d_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta1*east_num,
beta1 ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2)
Looks good:
my_traceplot(m11d_stan)
precis(m11d_stan)
WAIC(m11d_stan)
models_stan[["m11d_stan"]] <- m11d_stan
Multilevel normal models
area as multilevel
d <- d_short_z[, c("afd_prop", "area_nr")]
m12_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta0[area_nr],
beta0[area_nr] ~ dnorm(a, sigma),
a ~ dnorm(0, 1),
sigma ~ dnorm(0, 1),
alpha ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2
)
#> Error in map2stan(alist(afd_prop ~ dnorm(mu, sigma), mu <- beta0[area_nr], : Something went wrong in at least one chain. Debug your model while setting chains=1 and cores=1. Once the model is working with a single chain and core, try using multiple chains/cores again.
#> argument is of length zero
precis(m12_stan, depth = 2) %>% head()
#coeftab(m12_stan)
my_traceplot(m12_stan)
models_stan[["m12_stan"]] <- m12_stan
Too many coefficients…
state as multilevel
d <- d_short_z[, c("afd_prop", "state_id")]
m13_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta0[state_id],
beta0[state_id] ~ dnorm(0, sigma2),
sigma ~ dnorm(0, 1),
sigma2 ~ dnorm(0, 1)
),
data = d,
cores = 2,
chains = 2
)
#> Error in map2stan(alist(afd_prop ~ dnorm(mu, sigma), mu <- beta0[state_id], : Something went wrong in at least one chain. Debug your model while setting chains=1 and cores=1. Once the model is working with a single chain and core, try using multiple chains/cores again.
#> argument is of length zero
precis(m13_stan, depth = 2)
coeftab(m13_stan)
WAIC(m13_stan)
my_traceplot(m13_stan)
models_stan[["m13_stan"]] <- m13_stan
Rhat is ok. Traceplot indicates problems. neff indicates problems.
east + for_prop + unemp_prop
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "east")]
m14_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta0[east] + beta1*for_prop_z + beta2*unemp_prop_z,
beta0[east] ~ dnorm(0, sigma2),
sigma ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
beta2 ~ dnorm(0, 1),
sigma2 ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2
)
#> Error in map2stan(alist(afd_prop ~ dnorm(mu, sigma), mu <- beta0[east] + : Something went wrong in at least one chain. Debug your model while setting chains=1 and cores=1. Once the model is working with a single chain and core, try using multiple chains/cores again.
#> argument is of length zero
precis(m14_stan, depth = 2)
coeftab(m14_stan)
my_traceplot(m14_stan)
models_stan[["m14_stan"]] <- m14_stan
state + for_prop + unemp_prop
d <- d_short_z[, c("afd_prop", "for_prop_z", "unemp_prop_z", "state_id")]
m15_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- beta0[state_id] + beta1*for_prop_z + beta2*unemp_prop_z,
beta0[state_id] ~ dnorm(0, sigma2),
sigma ~ dnorm(0, 1),
sigma2 ~ dnorm(0, 1),
beta1 ~ dnorm(0, 1),
beta2 ~ dnorm(0, 1)
),
data = d,
chains = 2,
cores = 2
)
#> Error in map2stan(alist(afd_prop ~ dnorm(mu, sigma), mu <- beta0[state_id] + : Something went wrong in at least one chain. Debug your model while setting chains=1 and cores=1. Once the model is working with a single chain and core, try using multiple chains/cores again.
#> argument is of length zero
precis(m15_stan, depth = 2)
coeftab(m15_stan)
my_traceplot(m15_stan)
models_stan[["m15_stan"]] <- m15_stan
Normal null model
d <- d_short_z[, c("afd_prop"), drop = FALSE]
m16_stan <- map2stan(
alist(
afd_prop ~ dnorm(mu, sigma),
mu <- alpha,
alpha ~ dnorm(0, 1),
sigma ~ dnorm(0, 1)
),
data = d,
chains = 2
)
#> Error in map2stan(alist(afd_prop ~ dnorm(mu, sigma), mu <- alpha, alpha ~ : Something went wrong, when calling Stan. Check any debug messages for clues, detective.
#> argument is of length zero
precis(m16_stan, depth = 2)
coeftab(m16_stan)
my_traceplot(m16_stan)
models_stan[["m16_stan"]] <- m16_stan
Compare Normal models
Now check which one has a lower DIC or WAIC:
stan_normal_models <- list("m9_stan" = m9_stan,
"m9a_stan" = m9a_stan,
"m10_stan" = m10_stan,
"m11a_stan" = m11a_stan,
"m11c_stan" = m11c_stan,
"m11d_stan" = m11d_stan,
"m12_stan" = m12_stan,
"m13_stan" = m13_stan,
"m14_stan" = m14_stan,
"m15_stan" = m15_stan,
"m16_stan" = m16_stan)
stan_normal_models_vec <- c("m9_stan" = m9_stan,
"m9a_stan" = m9a_stan,
"m10_stan" = m10_stan,
"m11a_stan" = m11a_stan,
"m11c_stan" = m11c_stan,
"m11d_stan" = m11d_stan,
"m12_stan" = m12_stan,
"m13_stan" = m13_stan,
"m14_stan" = m14_stan,
"m15_stan" = m15_stan,
"m16_stan" = m16_stan)
#stan_normal_models
#save(stan_normal_models, file = "stan_normal_models.Rda")
stan_model_comparison <- compare(m9_stan, m9a_stan,
m10_stan,
m11a_stan, m11c_stan, m11d_stan,
m12_stan, m13_stan, m14_stan, m15_stan,
m16_stan)
stan_model_comparison
save(stan_model_comparison, file = "stan_model_comparison.Rda")
m11b
is better than m11a
- unemployment apears to be more important to predict AfD success compared to migration rate in the area.
Get best (normal) model
Which model ist best? Let’s define the model with the lowest WAIC as best:
stan_model_comparison@output %>%
rownames_to_column(var = "model_name") %>%
slice(which.min(WAIC)) -> best_model
And now extract the name of the best model:
best_model_name <-
best_model %>%
pull(model_name)
Geo plotting
AfD success in the election:
wahlkreise_shp %>%
left_join(select(d_short, area_nr, afd_prop), by = c("WKR_NR" = "area_nr")) %>%
ggplot() +
geom_sf(aes(fill = afd_prop)) +
theme_void() +
scale_fill_viridis() +
labs(fill="Afd votes\n(Zweitstimme)",
caption = "Data provided by the Bundeswahlleiter 2017") -> p_afd
p_afd
Unemployment rates in Germany per district:
wahlkreise_shp %>%
left_join(select(d_short, area_nr, unemp_n, total_n), by = c("WKR_NR" = "area_nr")) %>%
mutate(unemp_prop = unemp_n / total_n) %>%
ggplot() +
geom_sf(aes(fill = unemp_prop)) +
theme_void() +
scale_fill_viridis() +
labs(fill="unemployment rate",
caption = "Data provided by the Bundeswahlleiter 2017") -> p_unemp
p_unemp
Foreigner rates:
wahlkreise_shp %>%
left_join(select(d_short, area_nr, for_prop, total_n), by = c("WKR_NR" = "area_nr")) %>%
ggplot() +
geom_sf(aes(fill = for_prop)) +
theme_void() +
scale_fill_viridis() +
labs(fill="Foreigner rate",
caption = "Data provided by the Bundeswahlleiter 2017") -> p_foreign
p_foreign
Joint diagrams
library(gridExtra)
grid.arrange(p_unemp, p_afd, nrow = 1)
grid.arrange(p_foreign, p_afd, nrow = 1)
Computing prediction errors
Here’s a function to compute the modeling error, defined as the absolute difference of the estimated model value (of afd proportion) minus the observed value (of afd proportion).
comp_error <- function(model, data = d_short_z, fun = mean) {
posterior_per_person <- link(model)
as_tibble(posterior_per_person) %>%
summarise_all(fun) %>%
gather() %>%
rename(estimate = value) %>%
mutate(afd_prop = data$afd_votes / data$votes_total,
error = abs(estimate - afd_prop)) %>%
pull(error) -> error_vec
return(error_vec)
}
Apply the function on all models:
model_error <- lapply(stan_normal_models, comp_error)
#model_error
#save(model_error, file = "model_error.Rda")
Compute the median absolute error:
md_abs_error_all_models <- sapply(model_error, median) %>% unlist()
md_abs_error_all_models
best_model_name <- md_abs_error_all_models[which.min(md_abs_error_all_models)] %>% names()
Also, compute the IQR of the errors:
modell_error_IQR <- lapply(stan_normal_models, comp_error, fun = IQR)
Visualizing prediction error
Some preparation:
model_error %>%
as.data.frame() -> model_error_df
#glimpse(model_error_df)
# model_names_binom <- list("m0_stan", "m1_stan", "m2_stan", "m3_stan",
# "m4_stan", "m5_stan", "m6_stan", "m7_stan")
model_names <- c("m9_stan",
"m9a_stan",
"m10_stan" ,
"m11a_stan",
"m11c_stan",
"m11d_stan",
"m12_stan",
"m13_stan",
"m14_stan",
"m15_stan" ,
"m16_stan")
names(model_error_df) <- model_names
model_error_df %>%
mutate(afd_prop = d_short$afd_prop,
id = 1:nrow(model_error_df)) -> model_error_df
modell_error_IQR %>%
as.data.frame() -> model_error_IQR_df
names(model_error_IQR_df) <- model_names
model_error_IQR_df %>%
mutate(id = 1:nrow(model_error_IQR_df)) -> model_error_IQR_df
Convert to long version for plotting:
model_error_IQR_df %>%
gather(key = model, value = iqr, -c(id)) %>%
mutate(stat = "IQR") -> model_error_IQR_df_long
model_error_df %>%
gather(key = model, value = error, -c(afd_prop, id)) %>%
mutate(stat = "median") -> model_error_df_long
model_error_df_long %>%
bind_rows(model_error_IQR_df_long) -> model_error_long
model_error_df_long %>%
left_join(model_error_IQR_df_long, by = c("id", "model")) %>%
select(-c(stat.x, stat.y)) -> model_error_md_iqr
glimpse(model_error_md_iqr)
#save(model_error_md_iqr, file = "model_error_md_iqr.RDa")
Now plot:
as_tibble(md_abs_error_all_models) %>%
mutate(model = unlist(model_names),
best_model = ifelse(model_names == best_model_name, TRUE, FALSE)) -> md_abs_error_all_models
glimpse(md_abs_error_all_models)
model_error_md_iqr %>%
arrange(-error) %>%
ggplot(aes(x = id)) +
facet_wrap(~model) +
geom_hline(aes(yintercept = value), data = md_abs_error_all_models) +
geom_errorbar(aes(ymin = error - (iqr/2),
ymax = error + (iqr/2)),
alpha = .8,
color = "gray40") +
geom_point(aes(y = error)) +
geom_label(aes(label = round(value, 3),
color = best_model), x = 1, y = .2,
data = md_abs_error_all_models,
hjust = 0) +
guides(color=FALSE) +
scale_color_manual(values = c("red", "darkgreen"))
Plotting prediction error against observed values
posterior_per_person_best_model <- link(m15_stan)
posterior_per_person_best_model %>%
as_tibble() %>%
summarise_all(median) %>%
gather() %>%
rename(estimate = value) %>%
add_column(area_nr = 1:nrow(.)) %>%
full_join(d_short) %>%
mutate(error = abs(estimate - afd_prop),
top05 = percent_rank(error) >= .95) -> d_short_w_pred_err
polygon_pos <- data.frame(
x = c(0, 0.4, 0.4, 0, 0.4, 0, 0 ),
y = c(0, 0, 0.4, 0, 0.4, 0.4, 0),
value = c("underestimates", "underestimates", "underestimates", "overestimates", "overestimates", "overestimates", "overestimates")
)
d_short_w_pred_err %>%
ggplot() +
aes(x = afd_prop, y = estimate) +
geom_abline(slope = 1, intercept = 0, color = "grey60") +
geom_polygon(data = polygon_pos, aes(x = x, y = y, fill = value), alpha = .1) +
geom_point(aes(color = error,
shape = east),
alpha = .6,
size = 2) +
ggrepel::geom_label_repel(aes(label = area_name), data = filter(d_short_w_pred_err, top05 == TRUE)) +
annotate("text", x = 0.4, y = 0, label = "model understimates", hjust = 1, vjust = 0) +
annotate("text", x = 0, y = 0.4, label = "model overestimates", hjust = 0, vjust = 1) +
labs(x = "Observed AfD votes (proportion)",
y = "Estimated AfD votes (proportion)",
caption = "n=299 electoral districts; data provided by Bundeswahlleiter 2017") +
guides(fill = FALSE)
Sensecheck
I just wondered what the bivariate correlations of the predictors to afd_vote
is`. Let’s check that.
library(GGally)
ggpairs(d_short, columns = c("foreigner_n", "unemp_n", "afd_votes"))
There is a substantial negatie correlations with number of foreigners and a positive correlation with unemployment, but (nearly) no correlation with unemployment numbers. But our analysis above suggest that these associations are spurious once the state is controlled for. To make things easy, let’s pick two states, say Sachsen und Bayern and check in each state the assoctions.
d_short %>%
filter(state == "Bayern") %>%
ggpairs( columns = c("foreigner_n", "unemp_n", "afd_votes"))
d_short %>%
filter(state == "Sachsen") %>%
ggpairs( columns = c("foreigner_n", "unemp_n", "afd_votes"))
Interestingly, in Sachsen manifests a strong negative correlation between AfD and foreigner rates.
Check some model assumptions
Let’s compute the predictions for each model:
model_predictions <- lapply(stan_normal_models, link)
Now plot the predictions against the error, as advised by Gelman and Hill.
First, get predictions of the best moel:
model_predictions[[best_model_name]] %>%
as_tibble %>%
summarise_all(mean) %>%
gather() -> best_model_preds
# save(best_model_preds, file = "best_model_preds.Rda")
Each observation is one row in this data frame.
Similarly, get errors of the best model:
model_error[[best_model_name]] %>%
as_tibble() %>%
rename(error = value) %>%
mutate(pred = best_model_preds$value) -> best_model_pred_err
best_model_pred_err %>%
ggplot() +
aes(x = pred, y = error) +
geom_hline(yintercept = quantile(best_model_pred_err$error, .5),
color = "grey60") +
geom_hline(yintercept = quantile(best_model_pred_err$error, .975),
color = "grey80", , linetype = "dashed") +
geom_hline(yintercept = quantile(best_model_pred_err$error, .025),
color = "grey80", , linetype = "dashed") +
geom_point() +
labs(title = best_model_name,
xlab = "Model predictions",
ylab = "Model error",
caption = "Note. Horizontal lines denote .025, .5, and .975 quantiles, respectively")
Check posterior distribution of predictors
Let’s have a look at the posterior distribution of the best_model
.
post_best_model <- extract.samples(stan_normal_models[[best_model_name]])
This object is a list. Let’s convert it to a data frame for easier plotting.
post_best_model_df <- data_frame(
sigma = post_best_model[["sigma"]],
sigma2 = post_best_model[["sigma2"]],
beta1 = post_best_model[["beta1"]],
beta2= post_best_model[["beta2"]]
)
And now we plot a number of histograms:
post_best_model_df %>%
gather() %>%
rename(coef = key) %>%
ggplot() +
aes(x = value) +
facet_wrap(~coef, scales = "free") +
geom_histogram() -> p_post_best_model
p_post_best_model
Now compute summary statistics:
post_best_model_df %>%
gather() %>%
rename(coef = key) %>%
group_by(coef) %>%
summarise(q05 = quantile(value, .05),
q50 = quantile(value, .5),
q95 = quantile(value, .95),
value = mean(value)
) -> post_best_model_df_sum
#gather(key = my_quantile, value = value, -coef) -> post_best_model_df_sum
head(post_best_model_df_sum)
Now plot both:
p_post_best_model +
geom_rect(data = post_best_model_df_sum,
aes(xmin = q05,
xmax = q95,
ymin = 0,
ymax = Inf),
fill = "red",
alpha = .2) +
theme(axis.text.y=element_blank(),
axis.ticks.y = element_blank()) +
labs(title = best_model_name,
y = "",
caption = "Note. Shaded areas demark 90% mass intervals")