Case study: data vizualization on flight delays using tidyverse tools

1 Load packages

library(tidyverse)  # data wrangling

2 Load data



3 Exercises/questions

See here

4 Solutions

4.1 Plot the distribution of the delays. Describe your insights.

flights %>% 
  ggplot() +
  aes(x = dep_delay) +


flights %>% 
  ggplot() +
  aes(x = dep_delay) +

The distribution is skewed to the right. Some flights are extremely lated compared to the majority.

4.2 Plot the distribution of the delays per origin airport.

flights %>% 
  ggplot() +
  aes(x = dep_delay) +
  geom_density() +
  facet_wrap(~ origin)

4.3 Visualize the assocation of delay and time of the day. Find a way to reduce overplotting.

Hint: Try out geom_bind2d() or geom_density2d() instead of using a scatter plot.

flights %>% 
  ggplot() +
  aes(x = dep_time, dep_delay) +

Adding a “smoother” line:

flights %>% 
  ggplot() +
  aes(x = dep_time, dep_delay) +
  geom_density2d() +
  geom_smooth()  # smoothing line


flights %>% 
  ggplot() +
  aes(x = dep_time, dep_delay) +
  geom_bin2d() +
  geom_smooth(method = "lm")  # smoothing line

4.4 Visualize the assocation of delay and distance to destination. Separate by origin and month.

flights %>% 
  ggplot() +
  aes(x = distance, dep_delay) +
  geom_density2d() +
  facet_grid(origin ~ month)

4.5 Visualize the assocation of delay and time of the day. Only include the three airlines where the delay is highest.

Reduce overplotting.

flights %>% 
  group_by(carrier) %>% 
  summarise(dep_delay_carrier = mean(dep_delay, na.rm = TRUE)) %>% 
  arrange(-dep_delay_carrier) %>% 
#> # A tibble: 3 x 2
#>   carrier dep_delay_carrier
#>   <chr>               <dbl>
#> 1 F9                   20.2
#> 2 EV                   20.0
#> 3 YV                   19.0
flights %>% 
  filter(carrier %in% c("F9", "EV", "YV")) %>% 
  ggplot() +
  aes(x = dep_time, dep_delay) +

4.6 Visualize the proportion of delayed flights per origin.

flights %>% 
  mutate(is_delayed = dep_delay > 0) %>% 
  group_by(origin) %>% 
  summarise(delay_n = sum(is_delayed == TRUE, na.rm = TRUE),
            delay_prop = delay_n / n()) %>% 
  ggplot() +
  aes(x = origin, y = delay_prop) +


flights %>% 
  mutate(is_delayed = dep_delay > 0) %>% 
  group_by(origin) %>% 
  ggplot() +
  aes(x = origin, fill = is_delayed) +

Or even this way:

flights %>% 
  mutate(is_delayed = dep_delay > 0) %>% 
  group_by(origin) %>% 
  ggplot() +
  aes(x = origin, fill = is_delayed) +
  geom_bar(position = "fill")

4.7 Visualize the proportion of delayed flights per time of the day

flights %>% 
  mutate(is_delayed = dep_delay > 0) %>% 
  group_by(origin) %>% 
  drop_na(is_delayed, origin) %>% 
  ggplot() +
  aes(x = origin, fill = is_delayed) +
  geom_bar(position = "fill") +
  facet_wrap(~ hour) +
  scale_fill_viridis_d()  # d as in "discrete"

4.8 Visualize the proportion of delayed flights per week day

There’s a package that does the weight lifting for us when it comes to working with times and dates:

flights %>% 
  mutate(is_delayed = dep_delay > 0) %>% 
  mutate(day_of_week = wday(time_hour)) %>%   # Wochentag
  group_by(origin) %>% 
  drop_na(is_delayed, origin) %>% 
  ggplot() +
  aes(x = origin, fill = is_delayed) +
  geom_bar(position = "fill") +
  facet_wrap(~ day_of_week) +
  scale_fill_viridis_d()  # d wie "discrete"

5 Reproducibility

