Hands-on data exploration using R

Sebastian Sauer

last update: 2018-11-21

1 / 110

Setup2 / 110

Overview

Setup
Tidyverse 101
Data diagrams 101
Case study

3 / 110

whoami

 system("whoami")

R enthusiast
Data analyst/scientist
Professor at FOM Hochschule

4 / 110

The lights are on

5 / 110

Upfront preparation

Please install the following software upfront:

Starting RStudio will start R automatically.

Please also make sure:

Your OS is up to date
You have internet access during the course
You reach the next power socket (maybe better bring a power cable)

6 / 110

You, after this workshop

Well, kinda off...

7 / 110

Learning goals

Understanding basic tidyverse goals
Applying tidyverse tools
Visualizing data
Basic modeling

8 / 110

Intended audience

Analysts who have used R somewhat, but who are new to the tidyverse
Applied workers in the field of data analyses who want to learn modern techniques

A basic working knowledge of R is needed to thoroughly follow the workshop

You should know how to get R and Rstudio running (and packages installed)
You should know some basics of R (such as object definition and indexing, using functions)
You should have some experience with coding (not strictly necessary, but helpful)

9 / 110

We'll use the following R packages

pckgs <- c("nycflights13", "mosaic",  "broom", "corrr", "lubridate", "viridis", 
           "GGally", "ggmap", "pacman", "sjmisc", "leaflet", "knitr",  "tidyverse", 
           "tidyimpute", "na.tools", "checkpoint")

Please install each missing package prior to the workshop from within R:

install.packages("nycflights13")

10 / 110

Load each package after each start of Rstudio

library(pacman)
p_load(pckgs, character.only = TRUE)

Tip: Use search() to see loaded packages and installed.packages() for installed packages.

Simpler: Check the "packages pane" in RStudio.

11 / 110

Data we'll use: `mtcars`

mtcars is a toy dataset built into R (no need for installing).
Data come from 1974 motor sports magazine describing some automotive.
Columns: e.g., horsepower, weight, fuel consumption

Load the dataset:

data(mtcars)

Get help:

?mtcars

12 / 110

Data we'll use: `flights`

flights is a dataset from R package nycflights13 (package must be installed).
Data come from flights leaving the NYC airports in 2013.
Columns: e.g.., delay, air time, carrier name

Load the dataset:

data(flights, package = "nycflights13")

Get help:

?flights

Load the data each time you open RStudio (during this workshop).

13 / 110

RStudio running

14 / 110

The tidyverse15 / 110

16 / 110

The data analysis (science) pipeline

17 / 110

Get the power of the uni tidyverse

18 / 110

But I love the old way ...

19 / 110

Nice data

20 / 110

Tidy data

More Details

21 / 110

Dataset `mtcars`

glimpse(mtcars)
#> Observations: 32
#> Variables: 11
#> $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
#> $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
#> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
#> $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
#> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
#> $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
#> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
#> $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
#> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
#> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

22 / 110

Data wrangling23 / 110

Two tidyverse principles

Knock-down principle

Pipe princriple

24 / 110

25 / 110

Atoms of the knock-down principle

filter()
select()
mutate()
group_by()
...

26 / 110

Filtering rows with `filter()`

Extract rows that meet logical criteria.

Filter table mtcars such that only rows remain where cols equal 6

filter(mtcars, cyl == 6)
#>    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> 3 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> 4 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> 5 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> 6 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 7 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

27 / 110

`filter()` - exercises

Filter the automatic cars.

Filter the automatic cars with more than 4 cylinders.

Filter cars with either low consumption or the the super. thirsty ones

28 / 110

`filter()` - solutions to exercises

data(mtcars)  # only if dataset is not yet loaded
filter(mtcars, am == 1)
filter(mtcars, cyl > 4)
filter(mtcars, mpg > 30 | mpg < 12)

29 / 110

Select columns with `select()`

Extract columns by name.

Select the columns cyl and hp. Discard the rest.

select(mtcars, cyl, hp)

#>                   cyl  hp
#> Mazda RX4           6 110
#> Mazda RX4 Wag       6 110
#> Datsun 710          4  93
#> Hornet 4 Drive      6 110
#> Hornet Sportabout   8 175
#> Valiant             6 105

30 / 110

`select()` - exercises

Select the first three columns.

Select the first and third column.

Select all columns containing the letter "c".

31 / 110

`select()` - solutions to exercises

select(mtcars, 1:3)
select(mtcars, 1, disp)
select(mtcars, contains("c"))  # regex supported

32 / 110

Add or change a column with `mutate`

Apply vectorized functions to columns to create new columns.

Define weight in kg for each car.

mtcars <- mutate(mtcars, 
                 weight_kg = wt * 2)
head(select(mtcars, wt, weight_kg))
#>      wt weight_kg
#> 1 2.620      5.24
#> 2 2.875      5.75
#> 3 2.320      4.64
#> 4 3.215      6.43
#> 5 3.440      6.88
#> 6 3.460      6.92

33 / 110

`mutate()` - exercises

Compute a variable for consumption (gallons per 100 miles).

Compute two variables in one mutate-call.

34 / 110

`mutate()` - solutions to exercises

mtcars <- mutate(mtcars, consumption = (1/mpg) * 100 * 3.8 / 1.6)
mtcars <- mutate(mtcars, 
                 consumption_g_per_m = (1/mpg),
                 consumption_l_per_100_k = consumption_g_per_m  * 3.8 / 1.6 * 100)

35 / 110

Summarise a column with `summarise()`

Apply function to summarise column to single value.

Summarise the values to their mean.

summarise(mtcars, 
          mean_hp = mean(hp))
#>    mean_hp
#> 1 146.6875

36 / 110

`summarise()` - exercises

Compute the median of consumption.

Compute multiple statistics at once.

37 / 110

`summarise()` - solution to exercises

summarise(mtcars, median(consumption))
#>   median(consumption)
#> 1            12.36979
summarise(mtcars, 
          consumption_md = median(consumption),
          consumption_avg = mean(consumption)
          )
#>   consumption_md consumption_avg
#> 1       12.36979        12.87897

38 / 110

Group with `group_by()`

Create "grouped" copy of table. dplyr functions will manipulate each group separately and then combine the results.

Group cars by am (automatic vs. manual). Then summarise to mean in each group.

mtcars_grouped <- group_by(mtcars, am)
summarise(mtcars_grouped, mean_hp = mean(hp))
#> # A tibble: 2 x 2
#>      am mean_hp
#>   <dbl>   <dbl>
#> 1     0    160.
#> 2     1    127.

39 / 110

`group_by()` - exercises

Compute the median consumption, grouped by cylinder.

Compute the median consumption, grouped by cylinder and am.

40 / 110

`group_by()` - solution to exercises

mtcars_grouped <- group_by(mtcars, cyl)
summarise(mtcars_grouped, mean_hp = mean(consumption))
#> # A tibble: 3 x 2
#>     cyl mean_hp
#>   <dbl>   <dbl>
#> 1     4    9.14
#> 2     6   12.1 
#> 3     8   16.2
mtcars_grouped <- group_by(mtcars, cyl,am)
mtcars_summarized <- summarise(mtcars_grouped, 
                               mean_hp = mean(consumption),
                               sd_hp = sd(consumption))

41 / 110

Enter the pipe42 / 110

43 / 110

Life without the pipe operator

summarise(
  raise_to_power(
    compute_differences(data, mean), 
    2
  ),
  mean
)

44 / 110

Life with the pipe operator

data %>%
  compute_differences(mean) %>% 
  raise_to_power(2) %>% 
  summarise(mean)

45 / 110

Life with the pipe operator II

Advanced.

data <- mtcars$hp
data %>% 
  `-`(mean(data)) %>% 
  `^`(2) %>% 
  mean()
#> [1] 4553.965
var(mtcars$hp) * (length(data)-1) / length(data)
#> [1] 4553.965

46 / 110

Data diagrams47 / 110

Why we need diagrams

48 / 110

Anatomy of a diagram

49 / 110

First plot with `ggplot`

mtcars %>% 
  ggplot() +  # initialize plot
  aes(x = hp, y = mpg) +  # define axes etc.
  geom_point() +  # graw points 
  geom_smooth()  # draw smoothing line

Notice the + in contrast to the pipe %>%.

50 / 110

Groups and colors

mtcars %>% 
  ggplot(aes(x = hp, y = mpg, color = am)) +
  geom_point() +
  geom_smooth() +
  scale_color_viridis_c() +  # package "viridis" needed
  theme_bw()

51 / 110

Diagrams - exercises

Plot the mean and the median for each cylinder group (dataset mtcars).

Now add a visualization for the variation in the data.

52 / 110

Diagrams - solutions to exercises 1

mtcars_summarized %>% 
  ggplot() +
  aes(x = cyl, y = mean_hp, color = factor(am), 
      shape = factor(am)) +
  geom_point(size = 5)

53 / 110

Diagrams - solutions to exercises 2

mtcars_summarized %>% 
  ggplot(aes(x = cyl, color = factor(am), shape = factor(am))) +
  geom_errorbar(aes(ymin = mean_hp - sd_hp, ymax = mean_hp + sd_hp), width = .2, 
                position =  position_dodge(width=0.9)) + 
  geom_point(aes(y = mean_hp), size = 5, position =  position_dodge(width=0.9))

54 / 110

Case study Why are flights delayed?55 / 110

Know thy data

Don't forget to load it from the package via:

data(flights)

A look to the help page:

?flights

56 / 110

Glimpse data

glimpse(flights)
#> Observations: 336,776
#> Variables: 19
#> $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
#> $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
#> $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
#> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
#> $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
#> $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
#> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
#> $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
#> $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
#> $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
#> $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
#> $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
#> $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
#> $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
#> $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
#> $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
#> $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
#> $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

57 / 110

Data sanity - quantitative variablesflights %>% descr() 
#> 
#> ## Basic descriptive statistics
#> 
#>             var    type          label      n NA.prc    mean      sd   se
#>            year integer           year 336776   0.00 2013.00    0.00 0.00
#>           month integer          month 336776   0.00    6.55    3.41 0.01
#>             day integer            day 336776   0.00   15.71    8.77 0.02
#>        dep_time integer       dep_time 328521   2.45 1349.11  488.28 0.85
#>  sched_dep_time integer sched_dep_time 336776   0.00 1344.25  467.34 0.81
#>       dep_delay numeric      dep_delay 328521   2.45   12.64   40.21 0.07
#>        arr_time integer       arr_time 328063   2.59 1502.05  533.26 0.93
#>  sched_arr_time integer sched_arr_time 336776   0.00 1536.38  497.46 0.86
#>       arr_delay numeric      arr_delay 327346   2.80    6.90   44.63 0.08
#>          flight integer         flight 336776   0.00 1971.92 1632.47 2.81
#>        air_time numeric       air_time 327346   2.80  150.69   93.69 0.16
#>        distance numeric       distance 336776   0.00 1039.91  733.23 1.26
#>            hour numeric           hour 336776   0.00   13.18    4.66 0.01
#>          minute numeric         minute 336776   0.00   26.23   19.30 0.03
#>    md trimmed           range  skew
#>  2013 2013.00   0 (2013-2013)   NaN
#>     7    6.56       11 (1-12) -0.01
#>    16   15.70       30 (1-31)  0.01
#>  1401 1346.82   2399 (1-2400) -0.02
#>  1359 1341.60 2253 (106-2359) -0.01
#>    -2    3.32 1344 (-43-1301)  4.80
#>  1535 1526.42   2399 (1-2400) -0.47
#>  1556 1550.67   2358 (1-2359) -0.35
#>    -5   -1.03 1358 (-86-1272)  3.72
#>  1496 1830.51   8499 (1-8500)  0.66
#>   129  140.03    675 (20-695)  1.07
#>   872  955.27  4966 (17-4983)  1.13
#>    13   13.15       22 (1-23)  0.00
#>    29   25.64       59 (0-59)  0.09

58 / 110

Data sanity - qualitative variables

flights %>%
  select_if(is.character) %>% 
  inspect()
#> 
#> categorical variables:  
#>      name     class levels      n missing
#> 1 carrier character     16 336776       0
#> 2 tailnum character   4043 334264    2512
#> 3  origin character      3 336776       0
#> 4    dest character    105 336776       0
#>                                    distribution
#> 1 UA (17.4%), B6 (16.2%), EV (16.1%) ...       
#> 2 N725MQ (0.2%), N722MQ (0.2%) ...             
#> 3 EWR (35.9%), JFK (33%), LGA (31.1%)          
#> 4 ORD (5.1%), ATL (5.1%), LAX (4.8%) ...

59 / 110

Distribution - quantitative variables

flights %>% 
  ggplot(aes(x = dep_delay)) +
  geom_histogram() +
  scale_x_continuous(limits = c(-10, 60))

Note the long right tail ("anomaly")

60 / 110

Deal with missing data - omit

flights_nona <- flights %>% 
  drop_na()
nrow(flights_nona) / nrow(flights)
#> [1] 0.9719992

61 / 110

Deal with missing data - replace by some value

flights_nona2 <- flights %>% 
  mutate(dep_delay = ifelse(is.na(dep_delay),
                            mean(dep_delay, na.rm = TRUE), dep_delay))
flights_nona2 %>% 
  summarise(sum(is.na(dep_delay)))
#> # A tibble: 1 x 1
#>   `sum(is.na(dep_delay))`
#>                     <int>
#> 1                       0

62 / 110

Deal with missing data - tidy approach

flights_nona3 <- flights %>% 
  impute_all(na.mean)
flights_nona3 %>% 
  purrr::map(~sum(is.na(.)))

Hint: Use package::function() to disambiguate.

63 / 110

Count missings per row

Advanced.

flights %>% 
  mutate(NA_row = rowSums(is.na(.))) %>% 
  ggplot(aes(x = NA_row)) + geom_histogram()

64 / 110

Descriptive statistics for delay

flights %>% 
  drop_na() %>% 
  summarise(mean(dep_delay), median(dep_delay),
            sd(dep_delay), iqr(dep_delay))
#> # A tibble: 1 x 4
#>   `mean(dep_delay)` `median(dep_delay)` `sd(dep_delay)` `iqr(dep_delay)`
#>               <dbl>               <dbl>           <dbl>            <dbl>
#> 1              12.6                  -2            40.1               16

65 / 110

Descriptive statistics by origin

flights %>% 
  drop_na() %>% 
  group_by(origin) %>% 
  summarise(mean(dep_delay), median(dep_delay),
            sd(dep_delay), iqr(dep_delay))
#> # A tibble: 3 x 5
#>   origin `mean(dep_delay… `median(dep_de… `sd(dep_delay)` `iqr(dep_delay)`
#>   <chr>             <dbl>           <dbl>           <dbl>            <dbl>
#> 1 EWR                15.0              -1            41.2               19
#> 2 JFK                12.0              -1            38.8               15
#> 3 LGA                10.3              -3            39.9               13

66 / 110

Start modeling67 / 110

Delay as a function of origin?

delay = f(origin)

More Rish:

dep_delay ~ origin

68 / 110

Linear models

lm(dep_delay ~ origin, data = drop_na(flights)) %>% tidy()
#> # A tibble: 3 x 5
#>   term        estimate std.error statistic   p.value
#>   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    15.0      0.117     128.  0.       
#> 2 originJFK      -2.99     0.168     -17.7 2.65e- 70
#> 3 originLGA      -4.72     0.172     -27.5 3.33e-166

Some as above, stated differently.

69 / 110

Does `distance` predicts `dep_delay`?

flights %>% 
  ggplot(aes (x = distance, y = dep_delay)) +
  geom_point(alpha = .1) +
  geom_lm() +
  coord_cartesian(ylim(-10, 60))

70 / 110

Does `distance` predicts `dep_delay`?

71 / 110

Alternative visualization (binned data, code)

flights %>% 
  mutate(distance_bins = case_when(
    distance < 250 ~ 250,
    distance < 500 ~ 500,
    distance < 1000 ~ 1000,
    distance < 2000 ~ 2000,
    distance < 3000 ~ 3000,
    TRUE ~ 5000 )) %>% 
  ggplot(aes (y = dep_delay)) +
  geom_boxplot(aes(x = distance_bins, 
  group = distance_bins)) +
  geom_smooth(aes(x = distance)) +
  coord_cartesian(ylim = c(-10, 60))

Use case_when() for binning and recoding of data values.

72 / 110

Alternative visualization (binned data)

73 / 110

Correlation of distance and delay

flights %>% 
  select(distance, dep_delay, origin) %>%
  group_by(origin) %>% 
  drop_na() %>% 
  summarise(cor_delay_dist = cor(dep_delay, distance))
#> # A tibble: 3 x 2
#>   origin cor_delay_dist
#>   <chr>           <dbl>
#> 1 EWR           -0.0361
#> 2 JFK           -0.0398
#> 3 LGA            0.0114

74 / 110

Delay as a function of distance

lm(dep_delay ~ I(distance/1000) + origin, data = flights) %>% 
  tidy()
#> # A tibble: 4 x 5
#>   term             estimate std.error statistic   p.value
#>   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)         16.8     0.157      107.  0.       
#> 2 I(distance/1000)    -1.60    0.0988     -16.2 9.43e- 59
#> 3 originJFK           -2.66    0.170      -15.7 3.09e- 55
#> 4 originLGA           -5.21    0.174      -29.9 3.80e-196

75 / 110

Delay per month (code)

p1 <- flights %>% 
  group_by(origin, month, day) %>% 
  summarise(dep_delay_avg_day = mean(dep_delay, na.rm = TRUE)) %>% 
  ungroup %>% 
  mutate(dep_dt = make_date(2013, month, day)) %>% 
  ggplot(aes(x = dep_dt, y = dep_delay_avg_day, shape = origin, color = origin)) +
  geom_point(alpha = .3) +
  geom_smooth() +
  scale_color_viridis_d()
p1

76 / 110

Delay per month (output)

77 / 110

Delay per month - boxplot (code)

flights %>% 
  ggplot(aes(x = month, y = dep_delay)) +
  geom_boxplot(aes(group = month)) +
  geom_smooth() +
  coord_cartesian(ylim = c(-10, 60)) +
  scale_x_continuous(breaks = 1:12)

78 / 110

Delay per month - boxplot (output)

79 / 110

Is it the weekends? (code)

flights <- flights %>%
  mutate(dow = wday(time_hour),
         weekend = case_when(
           dow %in% c(6, 7) ~ TRUE,
           TRUE ~ FALSE))
delay_dow <- 
  flights %>% 
  group_by(dow) %>% 
  drop_na() %>% 
  summarise(delay_m = mean(dep_delay),
            delay_md = median(dep_delay),
            q_05 = quantile(x = dep_delay, prob = .05),
            q_95 = quantile(x = dep_delay, prob = .95))

80 / 110

Is it the weekends? (data processed)

delay_dow %>% kable(format='html')

dow	delay_m	delay_md	q_05	q_95
1	11.477476	-2	-9	80
2	14.718728	-1	-9	101
3	10.588355	-2	-9	80
4	11.643321	-2	-9	84
5	16.043451	-1	-9	101
6	14.653974	-1	-8	94
7	7.594406	-2	-9	60

81 / 110

Is it the weekends? (code)

flights %>% 
  ggplot(aes(x = dow)) +
  geom_boxplot(aes(group = dow, y = dep_delay, color = weekend)) +
  geom_point(data = delay_dow, aes(y = delay_m), color = "red", 
             size = 5) + 
  coord_cartesian(ylim = c(-10, 100)) +
  scale_x_continuous(breaks = 1:7) +
  geom_hline(yintercept = mean(flights$dep_delay, na.rm = TRUE), 
             linetype = "dashed") +
  geom_hline(yintercept = median(flights$dep_delay, na.rm = TRUE), 
             linetype = "dashed") +
  geom_errorbar(aes(ymin = q_05, ymax = q_95), data = delay_dow)

82 / 110

Is it the weekends? (output)

83 / 110

Delay per time of the day (code)

Let's check whether delays add up during the day, a popular opinion among travellers.

flights %>% 
  select(dep_delay, hour) %>% 
  ggplot(aes(x = hour, y = dep_delay)) +
  geom_boxplot(aes(group = hour)) +
  geom_smooth(method = "lm") +
  coord_cartesian(ylim = c(-10, 60))  +
  scale_x_continuous(breaks = 1:24)

84 / 110

Delay per time of the day (output)

85 / 110

Delay as function of month, hour, origin, and weekday

lm_hour <- lm(dep_delay ~ hour + month + origin + I(dow == 7), 
              data = flights)
rsquared(lm_hour)
#> [1] 0.04408207

86 / 110

Geoplotting

Advanced.

87 / 110

Join airport data

data("airports")
flights_airports <-  # join destination long/lat
  flights %>% 
  left_join(airports, by = c("dest" = "faa")) %>%  
  rename(long = lon)
origin_latlong <-
  airports %>% 
  filter(faa %in% c("LGA", "JFK", "EWR")) %>% 
  rename(lat_origin = lat,
         long_origin = lon)
flights_airports <-  # join origin long/lat
  flights_airports %>%
  left_join(origin_latlong, by = c("origin" = "faa"))

88 / 110

Dataframe for plotting (code)

flights_airports_sum <- flights_airports %>% 
  group_by(dest, origin) %>% 
  summarise(n = n(),
            long = max(long),
            lat = max(lat),
            long_origin = max(long_origin),
            lat_origin = max(lat_origin))

89 / 110

Dataframe for plotting (output)

head(flights_airports_sum)
#> # A tibble: 6 x 7
#> # Groups:   dest [5]
#>   dest  origin     n   long   lat long_origin lat_origin
#>   <chr> <chr>  <int>  <dbl> <dbl>       <dbl>      <dbl>
#> 1 ABQ   JFK      254 -107.   35.0       -73.8       40.6
#> 2 ACK   JFK      265  -70.1  41.3       -73.8       40.6
#> 3 ALB   EWR      439  -73.8  42.7       -74.2       40.7
#> 4 ANC   EWR        8 -150.   61.2       -74.2       40.7
#> 5 ATL   EWR     5022  -84.4  33.6       -74.2       40.7
#> 6 ATL   JFK     1930  -84.4  33.6       -73.8       40.6

90 / 110

Geo plot flights (code)

ggplot(data = map_data("usa")) +
  aes(x = long, y = lat, group = group) +
  geom_path(color = "grey40", size = .1) +
  geom_point(data = flights_airports_sum, 
             aes(size = n, color = n, group = NULL), alpha = .2) +
  geom_segment(data = flights_airports_sum, 
               aes(color = n, group = NULL,
                   x = long_origin, y = lat_origin,
                   xend = long, yend = lat), alpha = .5) +
  geom_text(data = flights_airports_sum %>% filter(n > 6000), 
            aes(x = long, y = lat, label = dest, group = NULL),
            color = "grey40") +  
  theme_map() +
  xlim(-130, -70) + ylim(+20, +50) +
  scale_color_viridis()

91 / 110

Geo plot flights (output)

92 / 110

Outlook

Advanced.

93 / 110

Map columns to function with `map()`

data(mtcars)
purrr::map(select(mtcars, 1:2), ~ {ggplot(mtcars, aes(x = .)) +
    geom_histogram()})
#> $mpg
#> 
#> $cyl

94 / 110

Map TWO columns to function with `map2()`

flights %>% 
  select_if(~!is.numeric(.)) %>% 
  map2(., names(.), ~ 
         {ggplot(data = flights, aes(x = .x)) + 
             geom_bar() + labs(x = .y, title = .y)})

Take flights dataset  
select all numeric cols  
map ggplot call to each such that  
a barplot is plotted where  
the names of each barplot is an element from the vector of col names.

95 / 110

Reshape (transform) dataframe

96 / 110

Transform dataframe for plotting

mtcars %>% 
  select_if(is.numeric) %>% 
  gather(key = variable, value = value) %>% 
  ggplot(aes(x = value)) +
  geom_density() +
  facet_wrap(~ variable, ncol = 3, scales = "free")

97 / 110

`ggplot` niceties: Themes

library(hrbrthemes)
p2 <- p1 + theme_ipsum() + theme(legend.position = "bottom")
p2

May may need to install fonts upfront; see ?hrbrthemes.

98 / 110

ggplot niceties: Combining plots

library(patchwork)
p2 + p2

99 / 110

Resources100 / 110

Modern Dive

Modern Dive -- An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim

101 / 110

R for Data Science

102 / 110

Moderne Datenanalyse mit R

103 / 110

Cheatsheets

https://www.rstudio.com/resources/cheatsheets/

104 / 110

Disclaimer: There may be issues at times

StackOverflow is your friend

105 / 110

Wrap-up106 / 110

That was quick, but it was a start

107 / 110

Thank you

Sebastian Sauer

sebastiansauer

https://data-se.netlify.com/

sebastian.sauer@fom.de

Sebastian Sauer

Get slides here

: Get PDF of slides here

: Get Rmd source code of slides here

Licence: MIT

108 / 110

Credit to

Built using R, RMarkdown, Xaringan. Thanks to the R community and the tidyverse developers. Thanks to Yihui Xie and Antoine Bichat, among others, for Xaringan inspiration. Thanks to FOM Hochschule for supporting me.

Images:

Icons from FontAwesome

109 / 110

SessionInfo

#> [1] "R version 3.5.1 (2018-07-02)"

See si.RData for package version (same folder as this presentation).

This document is made reproducible using checkpoint with day set to 2018-09-30.

110 / 110

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Hands-on data exploration using R

Sebastian Sauer

Setup

Overview

whoami

The lights are on

Upfront preparation

Please install the following software upfront:

Please also make sure:

You, after this workshop

Learning goals

Intended audience

A basic working knowledge of R is needed to thoroughly follow the workshop

We'll use the following R packages

Please install each missing package prior to the workshop from within R:

Load each package after each start of Rstudio

Data we'll use: mtcars

Load the dataset:

Data we'll use: flights

Load the dataset:

RStudio running

The tidyverse

The data analysis (science) pipeline

Get the power of the uni tidyverse

But I love the old way ...

Nice data

Tidy data

Dataset mtcars

Data wrangling

Two tidyverse principles

Knock-down principle

Pipe princriple

Atoms of the knock-down principle

Filtering rows with filter()

filter() - exercises

filter() - solutions to exercises

Select columns with select()

select() - exercises

select() - solutions to exercises

Add or change a column with mutate

mutate() - exercises

mutate() - solutions to exercises

Summarise a column with summarise()

summarise() - exercises

summarise() - solution to exercises

Group with group_by()

group_by() - exercises

group_by() - solution to exercises

Enter the pipe

Life without the pipe operator

Life with the pipe operator

Life with the pipe operator II

Data diagrams

Why we need diagrams

Anatomy of a diagram

First plot with ggplot

Groups and colors

Diagrams - exercises

Diagrams - solutions to exercises 1

Diagrams - solutions to exercises 2

Case study Why are flights delayed?

Know thy data

Glimpse data

Data sanity - quantitative variables

Data sanity - qualitative variables

Distribution - quantitative variables

Deal with missing data - omit

Deal with missing data - replace by some value

Deal with missing data - tidy approach

Count missings per row

Descriptive statistics for delay

Descriptive statistics by origin

Start modeling

Delay as a function of origin?

Linear models

Does distance predicts dep_delay?

Does distance predicts dep_delay?

Alternative visualization (binned data, code)

Alternative visualization (binned data)

Correlation of distance and delay

Data we'll use: `mtcars`

Data we'll use: `flights`

Dataset `mtcars`

Filtering rows with `filter()`

`filter()` - exercises

`filter()` - solutions to exercises

Select columns with `select()`

`select()` - exercises

`select()` - solutions to exercises

Add or change a column with `mutate`

`mutate()` - exercises

`mutate()` - solutions to exercises

Summarise a column with `summarise()`

`summarise()` - exercises

`summarise()` - solution to exercises

Group with `group_by()`

`group_by()` - exercises

`group_by()` - solution to exercises

First plot with `ggplot`

Does `distance` predicts `dep_delay`?

Does `distance` predicts `dep_delay`?

Map columns to function with `map()`

Map TWO columns to function with `map2()`

`ggplot` niceties: Themes