For modeling cause-effect relationships, linear regression is among the most typically used methods.

Take, for example, the idea that the Gross Domestic Product (GDP) drives religiosity. Of course, we should have a strong theory that defends this choice and this directionality. Without a convincing theory it may be argued that the cause-relationship is the other way round or complete different (ie., some third variable accounts for any association between GDP and religiosity).

Let’s have a look at some data to illustrate.

First, we load some packages.

```
library(tidyverse) # do all
library(broom) # tidy
library(DiagrammeR) # graphs
```

This dataset deals with the aforementioned association:

`data <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/ReligionGDP.csv")`

`## Warning: Missing column names filled in: 'X1' [1]`

`glimpse(data)`

```
## Rows: 44
## Columns: 10
## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Country <chr> "United_States", "Canada", "Argentina", "Bolivia", "Brazil…
## $ Religiosity <dbl> 1.70, 0.88, 1.07, 1.86, 2.34, 1.35, 1.38, 1.95, 1.54, 0.58…
## $ GDP <dbl> 47440, 45085, 8171, 1656, 8295, 10117, 10200, 4448, 11388,…
## $ Africa <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EastEurope <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ MiddleEast <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Asia <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ WestEurope <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Americas <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
```

The path (cause-effect) diagram would look like this:

```
DiagrammeR::grViz("
digraph rmarkdown{
rankdir=LR;
GDP -> Religiosity;
}",
height = 200)
```

Let’s say our model insists this:

```
DiagrammeR::grViz("
digraph rmarkdown{
rankdir=LR;
GDP -> Religiosity;
Region -> Religiosity
}",
height = 200)
```

Of course, a more realistic path diagram might look like this:

```
DiagrammeR::grViz("
digraph rmarkdown{
graph [fontsize = 10; rankdir=LR]
node [fontname = Helvetica]
GDP; Religiosity; X; Y1; Y2; M1; M2; Region
GDP -> Religiosity
GDP -> M1
M -> Religiosity
X -> Religiosity
Y1 -> GDP
Region -> GDP
Region -> Religiosity
Region -> M2
Y2 -> Region
M2 -> Religiosity
}")
```

So, in short, this model holds that both Region and GDP have an indirect and and direct effect on Religiosity, but there are other influences on Religiosity, too. And GDP and Region are in turn influenced by some other factors.

Anyway, let’s try to model the simple model using a linear model.

# I have no column `Region`

My path diagram says I need a column `Region`

but there’s none in my data set! Instead, there are a bunch of columns like “Africa”, “MiddleEast” where each country is coded with \(1\) or \(0\) if it is part of the region or not, respectively. Variables like those are called *dummy variables* (or indicator variables, or binary variables).

# Now what?

Basically, a linear model only understands numerical data as predictor(s). Simply put, on the X-axis there must always be numeric data. A dummy variable can be seen as numeric - possessing only two distinct values, \(0\) and \(1\)!. That’s why dummy variables work for regression. By the way, many models think this way. So we are in the position to run our regression right away:

```
lm1 <- lm(Religiosity ~ GDP + Africa + MiddleEast + Asia + WestEurope + Americas, data = data)
tidy(lm1)
```

```
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.926 0.208 4.46 0.0000746
## 2 GDP -0.0000173 0.00000636 -2.72 0.0100
## 3 Africa 1.51 0.253 5.97 0.000000680
## 4 MiddleEast 1.32 0.273 4.84 0.0000230
## 5 Asia 1.26 0.262 4.81 0.0000253
## 6 WestEurope 0.409 0.336 1.22 0.231
## 7 Americas 0.919 0.249 3.68 0.000730
```

That’s the way a linear model operates.

The output tells us the *difference* in Religiosity between the states `0`

and `1`

of that variable. Hm, that works numerically, but does not make much sense. What does it mean if *all* regions are at values \(0\)? Not much.

We are better to take out one region, define it as baseline or comparison region, and compare each region against that region. If all the remaining region variables are at \(0\) that means that we are looking at this baseline region.

# Build a long data frame

The good news is that `lm()`

takes care of that. We don’t have to do much. We just give one column `Region`

and `lm()`

will sort out the rest.

So let’s build this column:

```
data_long <- gather(data, key = Region, value = is_part, -c(X1, Country, Religiosity, GDP))
data_long <- filter(data_long, is_part == 1)
glimpse(data_long)
```

```
## Rows: 44
## Columns: 6
## $ X1 <dbl> 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 16, 17, 18, 19, 20…
## $ Country <chr> "Ethiopia", "Ghana", "Ivory_Coast", "Kenya", "Mali", "Nige…
## $ Religiosity <dbl> 2.39, 2.26, 2.43, 2.40, 2.60, 2.63, 2.83, 2.11, 2.19, 2.33…
## $ GDP <dbl> 333, 739, 1132, 838, 657, 1401, 1066, 5685, 520, 455, 6561…
## $ Region <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa"…
## $ is_part <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
```

# Run the regression with `Region`

```
lm2 <- lm(Religiosity ~ GDP + Region, data = data_long)
tidy(lm2)
```

```
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.44 0.149 16.4 1.61e-18
## 2 GDP -0.0000173 0.00000636 -2.72 1.00e- 2
## 3 RegionAmericas -0.594 0.237 -2.51 1.65e- 2
## 4 RegionAsia -0.251 0.239 -1.05 3.00e- 1
## 5 RegionEastEurope -1.51 0.253 -5.97 6.80e- 7
## 6 RegionMiddleEast -0.191 0.261 -0.732 4.69e- 1
## 7 RegionWestEurope -1.10 0.362 -3.05 4.22e- 3
```

As we can see, the effect of `GDP`

remains the same. What has changed is that `Africa`

, being the first value in alphabetical order, is not defined as the baseline region. Hence, `Africa`

does not appear as predictor, because the remaining regions are compared with Africa.

What about model performance:

`glance(lm2)`

```
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.721 0.676 0.471 15.9 0.00000000604 6 -25.5 67.0 81.2
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
```

Is model 2 better (\(R^2\)) than model 1?

`round(glance(lm2)$adj.r.squared, 2) == round(glance(lm1)$adj.r.squared, 2)`

`## [1] TRUE`

Both models are equal in their performance, which makes sense since the data is the same.

# Viz

For visualization and many typical reports, it is necessary to have on column `Region`

.

```
qplot(y = GDP,
x = Region,
data = data_long,
geom = "boxplot")
```

Let’s sort the boxplots and flip the axes:

```
qplot(y = Religiosity,
x = reorder(Region, Religiosity),
data = data_long,
geom = "boxplot") +
coord_flip() +
labs(x = "Region")
```