# Case study: data vizualization on flight delays using tidyverse tools

``library(tidyverse)  # data wrangling``

``````library(tidyverse)
library(nycflights13)

data("flights")``````

See here

# 4 Solutions

## 4.1 Plot the distribution of the delays. Describe your insights.

``````flights %>%
ggplot() +
aes(x = dep_delay) +
geom_histogram()``````

Alternatively:

``````flights %>%
ggplot() +
aes(x = dep_delay) +
geom_density()``````

The distribution is skewed to the right. Some flights are extremely lated compared to the majority.

## 4.2 Plot the distribution of the delays per origin airport.

``````flights %>%
ggplot() +
aes(x = dep_delay) +
geom_density() +
facet_wrap(~ origin)``````

## 4.3 Visualize the assocation of delay and time of the day. Find a way to reduce overplotting.

Hint: Try out `geom_bind2d()` or `geom_density2d()` instead of using a scatter plot.

``````flights %>%
ggplot() +
aes(x = dep_time, dep_delay) +
geom_density2d()``````

``````flights %>%
ggplot() +
aes(x = dep_time, dep_delay) +
geom_density2d() +
geom_smooth()  # smoothing line``````

Alternatively:

``````flights %>%
ggplot() +
aes(x = dep_time, dep_delay) +
geom_bin2d() +
geom_smooth(method = "lm")  # smoothing line``````

## 4.4 Visualize the assocation of delay and distance to destination. Separate by origin and month.

``````flights %>%
ggplot() +
aes(x = distance, dep_delay) +
geom_density2d() +
facet_grid(origin ~ month)``````

## 4.5 Visualize the assocation of delay and time of the day. Only include the three airlines where the delay is highest.

Reduce overplotting.

``````flights %>%
group_by(carrier) %>%
summarise(dep_delay_carrier = mean(dep_delay, na.rm = TRUE)) %>%
arrange(-dep_delay_carrier) %>%
slice(1:3)
#> # A tibble: 3 x 2
#>   carrier dep_delay_carrier
#>   <chr>               <dbl>
#> 1 F9                   20.2
#> 2 EV                   20.0
#> 3 YV                   19.0``````
``````flights %>%
filter(carrier %in% c("F9", "EV", "YV")) %>%
ggplot() +
aes(x = dep_time, dep_delay) +
geom_density2d()``````

## 4.6 Visualize the proportion of delayed flights per origin.

``````flights %>%
mutate(is_delayed = dep_delay > 0) %>%
group_by(origin) %>%
summarise(delay_n = sum(is_delayed == TRUE, na.rm = TRUE),
delay_prop = delay_n / n()) %>%
ggplot() +
aes(x = origin, y = delay_prop) +
geom_col()
``````

Alternatively:

``````flights %>%
mutate(is_delayed = dep_delay > 0) %>%
group_by(origin) %>%
ggplot() +
aes(x = origin, fill = is_delayed) +
geom_bar()``````

Or even this way:

``````flights %>%
mutate(is_delayed = dep_delay > 0) %>%
group_by(origin) %>%
ggplot() +
aes(x = origin, fill = is_delayed) +
geom_bar(position = "fill")``````

## 4.7 Visualize the proportion of delayed flights per time of the day

``````flights %>%
mutate(is_delayed = dep_delay > 0) %>%
group_by(origin) %>%
drop_na(is_delayed, origin) %>%
ggplot() +
aes(x = origin, fill = is_delayed) +
geom_bar(position = "fill") +
facet_wrap(~ hour) +
scale_fill_viridis_d()  # d as in "discrete"``````

## 4.8 Visualize the proportion of delayed flights per week day

There’s a package that does the weight lifting for us when it comes to working with times and dates:

``library(lubridate)``
``````flights %>%
mutate(is_delayed = dep_delay > 0) %>%
mutate(day_of_week = wday(time_hour)) %>%   # Wochentag
group_by(origin) %>%
drop_na(is_delayed, origin) %>%
ggplot() +
aes(x = origin, fill = is_delayed) +
geom_bar(position = "fill") +
facet_wrap(~ day_of_week) +
scale_fill_viridis_d()  # d wie "discrete"``````

