Dummy variables and regression

For modeling cause-effect relationships, linear regression is among the most typically used methods.

Take, for example, the idea that the Gross Domestic Product (GDP) drives religiosity. Of course, we should have a strong theory that defends this choice and this directionality. Without a convincing theory it may be argued that the cause-relationship is the other way round or complete different (ie., some third variable accounts for any association between GDP and religiosity).

Let’s have a look at some data to illustrate.

First, we load some packages.

library(tidyverse)  # do all
library(broom)  # tidy
library(DiagrammeR)  # graphs

This dataset deals with the aforementioned association:

data <- read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/ReligionGDP.csv")
## Warning: Missing column names filled in: 'X1' [1]
glimpse(data)
## Rows: 44
## Columns: 10
## $ X1          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ Country     <chr> "United_States", "Canada", "Argentina", "Bolivia", "Brazil…
## $ Religiosity <dbl> 1.70, 0.88, 1.07, 1.86, 2.34, 1.35, 1.38, 1.95, 1.54, 0.58…
## $ GDP         <dbl> 47440, 45085, 8171, 1656, 8295, 10117, 10200, 4448, 11388,…
## $ Africa      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EastEurope  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ MiddleEast  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Asia        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ WestEurope  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Americas    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

The path (cause-effect) diagram would look like this:

DiagrammeR::grViz("
                  digraph rmarkdown{
                  rankdir=LR;
                  GDP -> Religiosity;
                  }",
                  height = 200)

Let’s say our model insists this:

DiagrammeR::grViz("
                  digraph rmarkdown{
                  rankdir=LR;
                  GDP -> Religiosity;
                  Region -> Religiosity
                  }",
                  height = 200)

Of course, a more realistic path diagram might look like this:

DiagrammeR::grViz("
                  digraph rmarkdown{
                  graph [fontsize = 10; rankdir=LR]

                  node [fontname = Helvetica]
                  GDP; Religiosity; X; Y1; Y2; M1; M2; Region
                  
                  GDP -> Religiosity
                  GDP -> M1
                  M -> Religiosity
                  X -> Religiosity
                  Y1 -> GDP
Region -> GDP
Region -> Religiosity
Region -> M2
Y2 -> Region
M2 -> Religiosity
                  }")

So, in short, this model holds that both Region and GDP have an indirect and and direct effect on Religiosity, but there are other influences on Religiosity, too. And GDP and Region are in turn influenced by some other factors.

Anyway, let’s try to model the simple model using a linear model.

I have no column Region

My path diagram says I need a column Region but there’s none in my data set! Instead, there are a bunch of columns like “Africa”, “MiddleEast” where each country is coded with \(1\) or \(0\) if it is part of the region or not, respectively. Variables like those are called dummy variables (or indicator variables, or binary variables).

Now what?

Basically, a linear model only understands numerical data as predictor(s). Simply put, on the X-axis there must always be numeric data. A dummy variable can be seen as numeric - possessing only two distinct values, \(0\) and \(1\)!. That’s why dummy variables work for regression. By the way, many models think this way. So we are in the position to run our regression right away:

lm1 <- lm(Religiosity ~ GDP + Africa + MiddleEast + Asia + WestEurope + Americas, data = data)
tidy(lm1)
## # A tibble: 7 x 5
##   term          estimate  std.error statistic     p.value
##   <chr>            <dbl>      <dbl>     <dbl>       <dbl>
## 1 (Intercept)  0.926     0.208           4.46 0.0000746  
## 2 GDP         -0.0000173 0.00000636     -2.72 0.0100     
## 3 Africa       1.51      0.253           5.97 0.000000680
## 4 MiddleEast   1.32      0.273           4.84 0.0000230  
## 5 Asia         1.26      0.262           4.81 0.0000253  
## 6 WestEurope   0.409     0.336           1.22 0.231      
## 7 Americas     0.919     0.249           3.68 0.000730

That’s the way a linear model operates.

The output tells us the difference in Religiosity between the states 0 and 1 of that variable. Hm, that works numerically, but does not make much sense. What does it mean if all regions are at values \(0\)? Not much.

We are better to take out one region, define it as baseline or comparison region, and compare each region against that region. If all the remaining region variables are at \(0\) that means that we are looking at this baseline region.

Build a long data frame

The good news is that lm() takes care of that. We don’t have to do much. We just give one column Region and lm() will sort out the rest.

So let’s build this column:

data_long <- gather(data, key = Region, value = is_part, -c(X1, Country, Religiosity, GDP))
data_long <- filter(data_long, is_part == 1)
glimpse(data_long)
## Rows: 44
## Columns: 6
## $ X1          <dbl> 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 16, 17, 18, 19, 20…
## $ Country     <chr> "Ethiopia", "Ghana", "Ivory_Coast", "Kenya", "Mali", "Nige…
## $ Religiosity <dbl> 2.39, 2.26, 2.43, 2.40, 2.60, 2.63, 2.83, 2.11, 2.19, 2.33…
## $ GDP         <dbl> 333, 739, 1132, 838, 657, 1401, 1066, 5685, 520, 455, 6561…
## $ Region      <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa"…
## $ is_part     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Run the regression with Region

lm2 <- lm(Religiosity ~ GDP + Region, data = data_long)
tidy(lm2)
## # A tibble: 7 x 5
##   term               estimate  std.error statistic  p.value
##   <chr>                 <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept)       2.44      0.149         16.4   1.61e-18
## 2 GDP              -0.0000173 0.00000636    -2.72  1.00e- 2
## 3 RegionAmericas   -0.594     0.237         -2.51  1.65e- 2
## 4 RegionAsia       -0.251     0.239         -1.05  3.00e- 1
## 5 RegionEastEurope -1.51      0.253         -5.97  6.80e- 7
## 6 RegionMiddleEast -0.191     0.261         -0.732 4.69e- 1
## 7 RegionWestEurope -1.10      0.362         -3.05  4.22e- 3

As we can see, the effect of GDP remains the same. What has changed is that Africa, being the first value in alphabetical order, is not defined as the baseline region. Hence, Africa does not appear as predictor, because the remaining regions are compared with Africa.

What about model performance:

glance(lm2)
## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic       p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>         <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.721         0.676 0.471      15.9 0.00000000604     6  -25.5  67.0  81.2
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Is model 2 better (\(R^2\)) than model 1?

round(glance(lm2)$adj.r.squared, 2) == round(glance(lm1)$adj.r.squared, 2)
## [1] TRUE

Both models are equal in their performance, which makes sense since the data is the same.

Viz

For visualization and many typical reports, it is necessary to have on column Region.

qplot(y = GDP,
      x = Region,
      data = data_long,
      geom  = "boxplot")

Let’s sort the boxplots and flip the axes:

qplot(y = Religiosity, 
      x = reorder(Region, Religiosity), 
      data = data_long,
      geom = "boxplot") +
  coord_flip() +
  labs(x = "Region")