For modeling cause-effect relationships, linear regression is among the most typically used methods.
Take, for example, the idea that the Gross Domestic Product (GDP) drives religiosity. Of course, we should have a strong theory that defends this choice and this directionality. Without a convincing theory it may be argued that the cause-relationship is the other way round or complete different (ie., some third variable accounts for any association between GDP and religiosity).
Let’s have a look at some data to illustrate.
First, we load some packages.
library(tidyverse) # do all
library(broom) # tidy
library(DiagrammeR) # graphs
This dataset deals with the aforementioned association:
data <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/ReligionGDP.csv")
## Warning: Missing column names filled in: 'X1' [1]
glimpse(data)
## Rows: 44
## Columns: 10
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Country <chr> "United_States", "Canada", "Argentina", "Bolivia", "Brazil…
## $ Religiosity <dbl> 1.70, 0.88, 1.07, 1.86, 2.34, 1.35, 1.38, 1.95, 1.54, 0.58…
## $ GDP <dbl> 47440, 45085, 8171, 1656, 8295, 10117, 10200, 4448, 11388,…
## $ Africa <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EastEurope <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ MiddleEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Asia <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ WestEurope <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Americas <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
The path (cause-effect) diagram would look like this:
DiagrammeR::grViz("
digraph rmarkdown{
rankdir=LR;
GDP -> Religiosity;
}",
height = 200)
Let’s say our model insists this:
DiagrammeR::grViz("
digraph rmarkdown{
rankdir=LR;
GDP -> Religiosity;
Region -> Religiosity
}",
height = 200)
Of course, a more realistic path diagram might look like this:
DiagrammeR::grViz("
digraph rmarkdown{
graph [fontsize = 10; rankdir=LR]
node [fontname = Helvetica]
GDP; Religiosity; X; Y1; Y2; M1; M2; Region
GDP -> Religiosity
GDP -> M1
M -> Religiosity
X -> Religiosity
Y1 -> GDP
Region -> GDP
Region -> Religiosity
Region -> M2
Y2 -> Region
M2 -> Religiosity
}")
So, in short, this model holds that both Region and GDP have an indirect and and direct effect on Religiosity, but there are other influences on Religiosity, too. And GDP and Region are in turn influenced by some other factors.
Anyway, let’s try to model the simple model using a linear model.
I have no column Region
My path diagram says I need a column Region
but there’s none in my data set! Instead, there are a bunch of columns like “Africa”, “MiddleEast” where each country is coded with \(1\) or \(0\) if it is part of the region or not, respectively. Variables like those are called dummy variables (or indicator variables, or binary variables).
Now what?
Basically, a linear model only understands numerical data as predictor(s). Simply put, on the X-axis there must always be numeric data. A dummy variable can be seen as numeric - possessing only two distinct values, \(0\) and \(1\)!. That’s why dummy variables work for regression. By the way, many models think this way. So we are in the position to run our regression right away:
lm1 <- lm(Religiosity ~ GDP + Africa + MiddleEast + Asia + WestEurope + Americas, data = data)
tidy(lm1)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.926 0.208 4.46 0.0000746
## 2 GDP -0.0000173 0.00000636 -2.72 0.0100
## 3 Africa 1.51 0.253 5.97 0.000000680
## 4 MiddleEast 1.32 0.273 4.84 0.0000230
## 5 Asia 1.26 0.262 4.81 0.0000253
## 6 WestEurope 0.409 0.336 1.22 0.231
## 7 Americas 0.919 0.249 3.68 0.000730
That’s the way a linear model operates.
The output tells us the difference in Religiosity between the states 0
and 1
of that variable. Hm, that works numerically, but does not make much sense. What does it mean if all regions are at values \(0\)? Not much.
We are better to take out one region, define it as baseline or comparison region, and compare each region against that region. If all the remaining region variables are at \(0\) that means that we are looking at this baseline region.
Build a long data frame
The good news is that lm()
takes care of that. We don’t have to do much. We just give one column Region
and lm()
will sort out the rest.
So let’s build this column:
data_long <- gather(data, key = Region, value = is_part, -c(X1, Country, Religiosity, GDP))
data_long <- filter(data_long, is_part == 1)
glimpse(data_long)
## Rows: 44
## Columns: 6
## $ X1 <dbl> 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 16, 17, 18, 19, 20…
## $ Country <chr> "Ethiopia", "Ghana", "Ivory_Coast", "Kenya", "Mali", "Nige…
## $ Religiosity <dbl> 2.39, 2.26, 2.43, 2.40, 2.60, 2.63, 2.83, 2.11, 2.19, 2.33…
## $ GDP <dbl> 333, 739, 1132, 838, 657, 1401, 1066, 5685, 520, 455, 6561…
## $ Region <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa"…
## $ is_part <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Run the regression with Region
lm2 <- lm(Religiosity ~ GDP + Region, data = data_long)
tidy(lm2)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.44 0.149 16.4 1.61e-18
## 2 GDP -0.0000173 0.00000636 -2.72 1.00e- 2
## 3 RegionAmericas -0.594 0.237 -2.51 1.65e- 2
## 4 RegionAsia -0.251 0.239 -1.05 3.00e- 1
## 5 RegionEastEurope -1.51 0.253 -5.97 6.80e- 7
## 6 RegionMiddleEast -0.191 0.261 -0.732 4.69e- 1
## 7 RegionWestEurope -1.10 0.362 -3.05 4.22e- 3
As we can see, the effect of GDP
remains the same. What has changed is that Africa
, being the first value in alphabetical order, is not defined as the baseline region. Hence, Africa
does not appear as predictor, because the remaining regions are compared with Africa.
What about model performance:
glance(lm2)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.721 0.676 0.471 15.9 0.00000000604 6 -25.5 67.0 81.2
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Is model 2 better (\(R^2\)) than model 1?
round(glance(lm2)$adj.r.squared, 2) == round(glance(lm1)$adj.r.squared, 2)
## [1] TRUE
Both models are equal in their performance, which makes sense since the data is the same.
Viz
For visualization and many typical reports, it is necessary to have on column Region
.
qplot(y = GDP,
x = Region,
data = data_long,
geom = "boxplot")
Let’s sort the boxplots and flip the axes:
qplot(y = Religiosity,
x = reorder(Region, Religiosity),
data = data_long,
geom = "boxplot") +
coord_flip() +
labs(x = "Region")