1 Kontext

Dieser Post bezieht sich auf diese Fallstudie.

2 Vorbereitung

library(tidyverse)  # Datenjudo

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(moderndive)  # Daten
library(modelr)  # für "add_predictions()"
data("house_prices")

3 LogY-LogX-Modell

Bei einem LogY-LogX-Modell gilt: Steigt X um 1%, so steigt Y um k%.

LogY-X-Modell: Steigt X um 1 Einheit, so steigt Y um k%.

house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_size = log10(sqft_living)
    )

Hier ist die Basis des Logarithmus 10, so dass gilt \(lg(100) = 2, lg(1000) = 3\) etc.

4 Modell 1

Wir beziehen uns auf dieses Modell.

# Fit regression model:
price_interaction <- lm(log10_price ~ log10_size * condition, 
                        data = house_prices)

# Get regression table:
get_regression_table(price_interaction)

## # A tibble: 10 x 7
##    term                  estimate std_error statistic p_value lower_ci upper_ci
##    <chr>                    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
##  1 intercept                3.33      0.451     7.38    0        2.45     4.22 
##  2 log10_size               0.69      0.148     4.65    0        0.399    0.98 
##  3 condition2               0.047     0.498     0.094   0.925   -0.93     1.02 
##  4 condition3              -0.367     0.452    -0.812   0.417   -1.25     0.519
##  5 condition4              -0.398     0.453    -0.879   0.38    -1.29     0.49 
##  6 condition5              -0.883     0.457    -1.93    0.053   -1.78     0.013
##  7 log10_size:condition2   -0.024     0.163    -0.148   0.882   -0.344    0.295
##  8 log10_size:condition3    0.133     0.148     0.893   0.372   -0.158    0.424
##  9 log10_size:condition4    0.146     0.149     0.979   0.328   -0.146    0.437
## 10 log10_size:condition5    0.31      0.15      2.07    0.039    0.016    0.604

5 Vorhersage zum Beispiel aus der Fallstudie

preds1 <- predict(price_interaction,
                  newdata = data.frame(
                    condition = factor("5"),  #nominal skaliert
                    log10_size = 3.28))
preds1  # 5.73

##        1 
## 5.725459

100000 -> 1000000

500000

Das ist der Wert des Hauses in der Log-Skala. Für die Rohwerte müssen wir den Logarithmus wieder umkehren, also die Umkehrfunktion anwenden, das ist die Exponenzialfunktion, hier zur Basis 10:

10^5.73  # delog

## [1] 537031.8

Der Wert der Immobilie liegt also, laut Modell, bei gut 500 Tausend Dollar.

6 Vorhersagen wie im Prognose-Wettbewerb

house_prices <-
  house_prices %>% 
  add_predictions(price_interaction,
                  var = "pred_log10") %>% 
  mutate(pred = 10^pred_log10)  # da Log-Basis 10

Wäre die Basis des Log. nicht 10, sondern e, dann hieße der Code:

house_prices <-
  house_prices %>% 
  add_predictions(price_interaction,
                  var = "pred_log10") %>% 
  mutate(pred_e = 2.71^pred_log10)  # da Log-Basis 10

7 Check

Ein Blick in die Daten:

house_prices %>% 
  select(price, pred_log10, pred) %>% 
  slice(1:10)

## # A tibble: 10 x 3
##      price pred_log10     pred
##      <dbl>      <dbl>    <dbl>
##  1  221900       5.49  308270.
##  2  538000       5.77  584591.
##  3  180000       5.34  217028.
##  4  604000       5.74  546648.
##  5  510000       5.62  412162.
##  6 1225000       6.03 1079632.
##  7  257500       5.62  419208.
##  8  291850       5.45  282254.
##  9  229500       5.64  432227.
## 10  323000       5.66  454069.

Unser Modell scheint plausible Vorhersagen zu tätigen.

Rücktransformation logarithmierter y-Werte