class: inverse <br/> <br/> .right[ # Hands-on data exploration using R <br/> ### Sebastian Sauer ] <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0e/Hochschule_f%C3%BCr_Oekonomie_%26_Management_2012_logo.svg/768px-Hochschule_f%C3%BCr_Oekonomie_%26_Management_2012_logo.svg.png" width="20%" style="display: block; margin: auto auto auto 0;" /> .right[ .tiny[ last update: 2018-11-21 ]] --- class: inverse, center, middle # Setup --- class: top, left ## Overview .Large[ .full-width[ .content-box-blue[ - Setup - Tidyverse 101 - Data diagrams 101 - Case study ] ] ] --- ## whoami ```r system("whoami") ``` - R enthusiast - Data analyst/scientist - Professor at FOM Hochschule --- ## The lights are on .center[
] --- ## Upfront preparation ### Please install the following software upfront: - [R](https://www.r-project.org/) - [RStudio Desktop](https://www.rstudio.com/products/rstudio/download/) Starting RStudio will start R automatically. <br/> <br/> ### Please also make sure: - Your OS is up to date - You have internet access during the course - You reach the next power socket (maybe better bring a power cable) --- ## You, after this workshop <img src="img/kermit.gif" width="70%" style="display: block; margin: auto;" /> Well, kinda off... --- ## Learning goals .large[ .full-width[ .content-box-blue[ - Understanding basic tidyverse goals - Applying tidyverse tools - Visualizing data - Basic modeling ] ] ] --- ## Intended audience - Analysts who have used R somewhat, but who are new to the tidyverse - Applied workers in the field of data analyses who want to learn modern techniques <br/> <br/> ### A basic working knowledge of R is needed to thoroughly follow the workshop - You should know how to get R and Rstudio running (and packages installed) - You should know some basics of R (such as object definition and indexing, using functions) - You should have some experience with coding (not strictly necessary, but helpful) --- ## We'll use the following R packages ```r pckgs <- c("nycflights13", "mosaic", "broom", "corrr", "lubridate", "viridis", "GGally", "ggmap", "pacman", "sjmisc", "leaflet", "knitr", "tidyverse", "tidyimpute", "na.tools", "checkpoint") ``` #### Please install each missing package prior to the workshop from within R: ```r install.packages("nycflights13") ``` --- ## Load each package after each start of Rstudio ```r library(pacman) p_load(pckgs, character.only = TRUE) ``` Tip: Use `search()` to see loaded packages and `installed.packages()` for installed packages. Simpler: Check the "packages pane" in RStudio. --- ## Data we'll use: `mtcars` - `mtcars` is a toy dataset built into R (no need for installing). - Data come from 1974 motor sports magazine describing some automotive. - Columns: e.g., horsepower, weight, fuel consumption ### Load the dataset: ```r data(mtcars) ``` Get help: ```r ?mtcars ``` --- ## Data we'll use: `flights` - `flights` is a dataset from R package `nycflights13` (package must be installed). - Data come from flights leaving the NYC airports in 2013. - Columns: e.g.., delay, air time, carrier name ### Load the dataset: ```r data(flights, package = "nycflights13") ``` Get help: ```r ?flights ``` </br> Load the data each time you open RStudio (during this workshop). --- ## RStudio running <img src="http://paulyeo21.github.io/cell_bio_gen_lab/images/331f4ea3.rstudio_pane.png" width="65%" style="display: block; margin: auto;" /> --- class: center, inverse, middle # The tidyverse --- background-image: url("https://pbs.twimg.com/media/CvzEQcfWIAAIs-N.jpg") background-size: cover --- ## The data analysis (science) pipeline <br/> <br/> <img src="https://d33wubrfki0l68.cloudfront.net/86cc45e87bb755a3bcecce462a6524e68d13a466/90635/images/tidy1.png" width="100%" style="display: block; margin: auto;" /> --- ## Get the power of the ~~uni~~ tidyverse <img src="img/git-push-force.gif" width="50%" style="display: block; margin: auto;" /> --- ## But I love the old way ... <img src="img/kid-waves.gif" width="50%" style="display: block; margin: auto;" /> --- ## Nice data <img src="https://biostat2.uni.lu/lectures/img/02/data_rectangle.png" width="50%" style="display: block; margin: auto;" /> --- ## Tidy data <br/> <br/> <br/> <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/09_Praxisprobleme_der_Datenaufbereitung/06_tidy-1.png" width="100%" style="display: block; margin: auto;" /> [More Details](https://biostat2.uni.lu/lectures/lecture02_tidyverse.html#1) --- ## Dataset `mtcars` ```r glimpse(mtcars) #> Observations: 32 #> Variables: 11 #> $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.... #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ... #> $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1... #> $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ... #> $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9... #> $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3... #> $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2... #> $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ... #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ... #> $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ... #> $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ... ``` --- class: inverse, center, middle # Data wrangling --- class: top, left ## Two tidyverse principles <br/> <br/> .pull-left[ ### Knock-down principle <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/07_Datenjudo/knockdown.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ ### Pipe princriple <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/07_Datenjudo/11_800px-Pipa_savinelli.jpg" width="70%" style="display: block; margin: auto;" /> ] --- background-image: url("https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/07_Datenjudo/knockdown.png") background-size: contain --- ## Atoms of the knock-down principle .Large[ - `filter()` - `select()` - `mutate()` - `group_by()` - ... ] --- ## Filtering rows with `filter()` .pull-left[ Extract rows that meet logical criteria. <img src="img/filter.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ Filter table `mtcars` such that only rows remain where `cols` equal `6` ```r filter(mtcars, cyl == 6) #> mpg cyl disp hp drat wt qsec vs am gear carb #> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> 4 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 #> 5 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 #> 6 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 #> 7 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ``` ] --- ## `filter()` - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Filter the automatic cars. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Filter the automatic cars with more than 4 cylinders. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Filter cars with either low consumption or the the super. thirsty ones --- ## `filter()` - solutions to exercises ```r data(mtcars) # only if dataset is not yet loaded filter(mtcars, am == 1) filter(mtcars, cyl > 4) filter(mtcars, mpg > 30 | mpg < 12) ``` --- ## Select columns with `select()` .pull-left[ Extract columns by name. <img src="img/select.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ Select the columns `cyl` and `hp`. Discard the rest. ```r select(mtcars, cyl, hp) ``` ``` #> cyl hp #> Mazda RX4 6 110 #> Mazda RX4 Wag 6 110 #> Datsun 710 4 93 #> Hornet 4 Drive 6 110 #> Hornet Sportabout 8 175 #> Valiant 6 105 ``` ] --- ## `select()` - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Select the first three columns. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Select the first and third column. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Select all columns containing the letter "c". --- ## `select()` - solutions to exercises ```r select(mtcars, 1:3) select(mtcars, 1, disp) select(mtcars, contains("c")) # regex supported ``` --- ## Add or change a column with `mutate` .pull-left[ Apply vectorized functions to columns to create new columns. <img src="img/mutate.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ Define weight in kg for each car. ```r mtcars <- mutate(mtcars, weight_kg = wt * 2) head(select(mtcars, wt, weight_kg)) #> wt weight_kg #> 1 2.620 5.24 #> 2 2.875 5.75 #> 3 2.320 4.64 #> 4 3.215 6.43 #> 5 3.440 6.88 #> 6 3.460 6.92 ``` ] --- ## `mutate()` - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute a variable for consumption (gallons per 100 miles). <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute two variables in one mutate-call. --- ## `mutate()` - solutions to exercises ```r mtcars <- mutate(mtcars, consumption = (1/mpg) * 100 * 3.8 / 1.6) mtcars <- mutate(mtcars, consumption_g_per_m = (1/mpg), consumption_l_per_100_k = consumption_g_per_m * 3.8 / 1.6 * 100) ``` --- ## Summarise a column with `summarise()` .pull-left[ Apply function to summarise column to single value. <img src="img/summary.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ Summarise the `values` to their mean. ```r summarise(mtcars, mean_hp = mean(hp)) #> mean_hp #> 1 146.6875 ``` ] --- ## `summarise()` - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute the median of consumption. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute multiple statistics at once. --- ## `summarise()` - solution to exercises ```r summarise(mtcars, median(consumption)) #> median(consumption) #> 1 12.36979 summarise(mtcars, consumption_md = median(consumption), consumption_avg = mean(consumption) ) #> consumption_md consumption_avg #> 1 12.36979 12.87897 ``` --- ## Group with `group_by()` .pull-left[ Create "grouped" copy of table. dplyr functions will manipulate each group separately and then combine the results. <img src="img/group-by.png" width="40%" style="display: block; margin: auto;" /> ] .pull-right[ Group cars by `am` (automatic vs. manual). Then summarise to mean in each group. ```r mtcars_grouped <- group_by(mtcars, am) summarise(mtcars_grouped, mean_hp = mean(hp)) #> # A tibble: 2 x 2 #> am mean_hp #> <dbl> <dbl> #> 1 0 160. #> 2 1 127. ``` ] --- ## `group_by()` - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute the median consumption, grouped by cylinder. <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Compute the median consumption, grouped by cylinder and `am`. --- ## `group_by()` - solution to exercises ```r mtcars_grouped <- group_by(mtcars, cyl) summarise(mtcars_grouped, mean_hp = mean(consumption)) #> # A tibble: 3 x 2 #> cyl mean_hp #> <dbl> <dbl> #> 1 4 9.14 #> 2 6 12.1 #> 3 8 16.2 mtcars_grouped <- group_by(mtcars, cyl,am) mtcars_summarized <- summarise(mtcars_grouped, mean_hp = mean(consumption), sd_hp = sd(consumption)) ``` --- class: center, middle, inverse # Enter the pipe --- background-image: url("https://www.rstudio.com/wp-content/uploads/2014/04/magrittr.png") background-size: cover --- ## Life without the pipe operator ```r summarise( raise_to_power( compute_differences(data, mean), 2 ), mean ) ``` --- ## Life with the pipe operator ```r data %>% compute_differences(mean) %>% raise_to_power(2) %>% summarise(mean) ``` --- ## Life with the pipe operator II <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 384 512"><path d="M272 428v28c0 10.449-6.68 19.334-16 22.629V488c0 13.255-10.745 24-24 24h-80c-13.255 0-24-10.745-24-24v-9.371c-9.32-3.295-16-12.18-16-22.629v-28c0-6.627 5.373-12 12-12h136c6.627 0 12 5.373 12 12zm-143.107-44c-9.907 0-18.826-6.078-22.376-15.327C67.697 267.541 16 277.731 16 176 16 78.803 94.805 0 192 0s176 78.803 176 176c0 101.731-51.697 91.541-90.516 192.673-3.55 9.249-12.47 15.327-22.376 15.327H128.893zM112 176c0-44.112 35.888-80 80-80 8.837 0 16-7.164 16-16s-7.163-16-16-16c-61.757 0-112 50.243-112 112 0 8.836 7.164 16 16 16s16-7.164 16-16z"/></svg> Advanced. <br/> ```r data <- mtcars$hp data %>% `-`(mean(data)) %>% `^`(2) %>% mean() #> [1] 4553.965 var(mtcars$hp) * (length(data)-1) / length(data) #> [1] 4553.965 ``` --- class: center, middle, inverse # Data diagrams --- ## Why we need diagrams <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/11_Daten_visualisieren/02_anscombe-crop.png" width="70%" style="display: block; margin: auto;" /> --- ## Anatomy of a diagram <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/11_Daten_visualisieren/anatomie_en.png" width="60%" style="display: block; margin: auto;" /> --- ## First plot with `ggplot` ```r mtcars %>% ggplot() + # initialize plot aes(x = hp, y = mpg) + # define axes etc. geom_point() + # graw points geom_smooth() # draw smoothing line ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-44-1.png" width="50%" style="display: block; margin: auto;" /> Notice the `+` in contrast to the pipe `%>%`. --- ## Groups and colors ```r mtcars %>% ggplot(aes(x = hp, y = mpg, color = am)) + geom_point() + geom_smooth() + scale_color_viridis_c() + # package "viridis" needed theme_bw() ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-45-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Diagrams - exercises <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Plot the mean and the median for each cylinder group (dataset `mtcars`). <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 640 512"><path d="M104 96H56c-13.3 0-24 10.7-24 24v104H8c-4.4 0-8 3.6-8 8v48c0 4.4 3.6 8 8 8h24v104c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V120c0-13.3-10.7-24-24-24zm528 128h-24V120c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v272c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h24c4.4 0 8-3.6 8-8v-48c0-4.4-3.6-8-8-8zM456 32h-48c-13.3 0-24 10.7-24 24v168H256V56c0-13.3-10.7-24-24-24h-48c-13.3 0-24 10.7-24 24v400c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V288h128v168c0 13.3 10.7 24 24 24h48c13.3 0 24-10.7 24-24V56c0-13.3-10.7-24-24-24z"/></svg> Now add a visualization for the variation in the data. --- ## Diagrams - solutions to exercises 1 ```r mtcars_summarized %>% ggplot() + aes(x = cyl, y = mean_hp, color = factor(am), shape = factor(am)) + geom_point(size = 5) ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-46-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Diagrams - solutions to exercises 2 ```r mtcars_summarized %>% ggplot(aes(x = cyl, color = factor(am), shape = factor(am))) + geom_errorbar(aes(ymin = mean_hp - sd_hp, ymax = mean_hp + sd_hp), width = .2, position = position_dodge(width=0.9)) + geom_point(aes(y = mean_hp), size = 5, position = position_dodge(width=0.9)) ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-47-1.png" width="50%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Case study Why are flights delayed? --- ## Know thy data Don't forget to load it from the package via: ```r data(flights) ``` A look to the help page: ```r ?flights ``` --- class: top, left ## Glimpse data ```r glimpse(flights) #> Observations: 336,776 #> Variables: 19 #> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,... #> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... #> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... #> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55... #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60... #> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2... #> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7... #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7... #> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -... #> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",... #> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79... #> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN... #> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"... #> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"... #> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138... #> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94... #> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,... #> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ... #> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013... ``` --- ## Data sanity - quantitative variables .small-code[ ```r flights %>% descr() #> #> ## Basic descriptive statistics #> #> var type label n NA.prc mean sd se #> year integer year 336776 0.00 2013.00 0.00 0.00 #> month integer month 336776 0.00 6.55 3.41 0.01 #> day integer day 336776 0.00 15.71 8.77 0.02 #> dep_time integer dep_time 328521 2.45 1349.11 488.28 0.85 #> sched_dep_time integer sched_dep_time 336776 0.00 1344.25 467.34 0.81 #> dep_delay numeric dep_delay 328521 2.45 12.64 40.21 0.07 #> arr_time integer arr_time 328063 2.59 1502.05 533.26 0.93 #> sched_arr_time integer sched_arr_time 336776 0.00 1536.38 497.46 0.86 #> arr_delay numeric arr_delay 327346 2.80 6.90 44.63 0.08 #> flight integer flight 336776 0.00 1971.92 1632.47 2.81 #> air_time numeric air_time 327346 2.80 150.69 93.69 0.16 #> distance numeric distance 336776 0.00 1039.91 733.23 1.26 #> hour numeric hour 336776 0.00 13.18 4.66 0.01 #> minute numeric minute 336776 0.00 26.23 19.30 0.03 #> md trimmed range skew #> 2013 2013.00 0 (2013-2013) NaN #> 7 6.56 11 (1-12) -0.01 #> 16 15.70 30 (1-31) 0.01 #> 1401 1346.82 2399 (1-2400) -0.02 #> 1359 1341.60 2253 (106-2359) -0.01 #> -2 3.32 1344 (-43-1301) 4.80 #> 1535 1526.42 2399 (1-2400) -0.47 #> 1556 1550.67 2358 (1-2359) -0.35 #> -5 -1.03 1358 (-86-1272) 3.72 #> 1496 1830.51 8499 (1-8500) 0.66 #> 129 140.03 675 (20-695) 1.07 #> 872 955.27 4966 (17-4983) 1.13 #> 13 13.15 22 (1-23) 0.00 #> 29 25.64 59 (0-59) 0.09 ``` ] --- ## Data sanity - qualitative variables ```r flights %>% select_if(is.character) %>% inspect() #> #> categorical variables: #> name class levels n missing #> 1 carrier character 16 336776 0 #> 2 tailnum character 4043 334264 2512 #> 3 origin character 3 336776 0 #> 4 dest character 105 336776 0 #> distribution #> 1 UA (17.4%), B6 (16.2%), EV (16.1%) ... #> 2 N725MQ (0.2%), N722MQ (0.2%) ... #> 3 EWR (35.9%), JFK (33%), LGA (31.1%) #> 4 ORD (5.1%), ATL (5.1%), LAX (4.8%) ... ``` --- ## Distribution - quantitative variables ```r flights %>% ggplot(aes(x = dep_delay)) + geom_histogram() + scale_x_continuous(limits = c(-10, 60)) ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-53-1.png" width="70%" style="display: block; margin: auto;" /> Note the long right tail ("anomaly") --- ## Deal with missing data - omit ```r flights_nona <- flights %>% drop_na() nrow(flights_nona) / nrow(flights) #> [1] 0.9719992 ``` --- ## Deal with missing data - replace by some value ```r flights_nona2 <- flights %>% mutate(dep_delay = ifelse(is.na(dep_delay), mean(dep_delay, na.rm = TRUE), dep_delay)) flights_nona2 %>% summarise(sum(is.na(dep_delay))) #> # A tibble: 1 x 1 #> `sum(is.na(dep_delay))` #> <int> #> 1 0 ``` --- ## Deal with missing data - tidy approach ```r flights_nona3 <- flights %>% impute_all(na.mean) flights_nona3 %>% purrr::map(~sum(is.na(.))) ``` Hint: Use `package::function()` to disambiguate. --- ## Count missings per row <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 384 512"><path d="M272 428v28c0 10.449-6.68 19.334-16 22.629V488c0 13.255-10.745 24-24 24h-80c-13.255 0-24-10.745-24-24v-9.371c-9.32-3.295-16-12.18-16-22.629v-28c0-6.627 5.373-12 12-12h136c6.627 0 12 5.373 12 12zm-143.107-44c-9.907 0-18.826-6.078-22.376-15.327C67.697 267.541 16 277.731 16 176 16 78.803 94.805 0 192 0s176 78.803 176 176c0 101.731-51.697 91.541-90.516 192.673-3.55 9.249-12.47 15.327-22.376 15.327H128.893zM112 176c0-44.112 35.888-80 80-80 8.837 0 16-7.164 16-16s-7.163-16-16-16c-61.757 0-112 50.243-112 112 0 8.836 7.164 16 16 16s16-7.164 16-16z"/></svg> Advanced. <br/> ```r flights %>% mutate(NA_row = rowSums(is.na(.))) %>% ggplot(aes(x = NA_row)) + geom_histogram() ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-57-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Descriptive statistics for delay ```r flights %>% drop_na() %>% summarise(mean(dep_delay), median(dep_delay), sd(dep_delay), iqr(dep_delay)) #> # A tibble: 1 x 4 #> `mean(dep_delay)` `median(dep_delay)` `sd(dep_delay)` `iqr(dep_delay)` #> <dbl> <dbl> <dbl> <dbl> #> 1 12.6 -2 40.1 16 ``` --- ## Descriptive statistics by origin ```r flights %>% drop_na() %>% group_by(origin) %>% summarise(mean(dep_delay), median(dep_delay), sd(dep_delay), iqr(dep_delay)) #> # A tibble: 3 x 5 #> origin `mean(dep_delay… `median(dep_de… `sd(dep_delay)` `iqr(dep_delay)` #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 EWR 15.0 -1 41.2 19 #> 2 JFK 12.0 -1 38.8 15 #> 3 LGA 10.3 -3 39.9 13 ``` --- class: center, middle, inverse # Start modeling --- ## Delay as a function of origin? <br/> <br/> <br/> ``` delay = f(origin) ``` <br/> <br/> More Rish: ``` dep_delay ~ origin ``` --- ## Linear models ```r lm(dep_delay ~ origin, data = drop_na(flights)) %>% tidy() #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 15.0 0.117 128. 0. #> 2 originJFK -2.99 0.168 -17.7 2.65e- 70 #> 3 originLGA -4.72 0.172 -27.5 3.33e-166 ``` Some as above, stated differently. --- ## Does `distance` predicts `dep_delay`? ```r flights %>% ggplot(aes (x = distance, y = dep_delay)) + geom_point(alpha = .1) + geom_lm() + coord_cartesian(ylim(-10, 60)) ``` --- ## Does `distance` predicts `dep_delay`? <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-62-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Alternative visualization (binned data, code) ```r flights %>% mutate(distance_bins = case_when( distance < 250 ~ 250, distance < 500 ~ 500, distance < 1000 ~ 1000, distance < 2000 ~ 2000, distance < 3000 ~ 3000, TRUE ~ 5000 )) %>% ggplot(aes (y = dep_delay)) + geom_boxplot(aes(x = distance_bins, group = distance_bins)) + geom_smooth(aes(x = distance)) + coord_cartesian(ylim = c(-10, 60)) ``` Use `case_when()` for binning and recoding of data values. --- ## Alternative visualization (binned data) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-64-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Correlation of distance and delay ```r flights %>% select(distance, dep_delay, origin) %>% group_by(origin) %>% drop_na() %>% summarise(cor_delay_dist = cor(dep_delay, distance)) #> # A tibble: 3 x 2 #> origin cor_delay_dist #> <chr> <dbl> #> 1 EWR -0.0361 #> 2 JFK -0.0398 #> 3 LGA 0.0114 ``` --- ## Delay as a function of distance ```r lm(dep_delay ~ I(distance/1000) + origin, data = flights) %>% tidy() #> # A tibble: 4 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 16.8 0.157 107. 0. #> 2 I(distance/1000) -1.60 0.0988 -16.2 9.43e- 59 #> 3 originJFK -2.66 0.170 -15.7 3.09e- 55 #> 4 originLGA -5.21 0.174 -29.9 3.80e-196 ``` --- ## Delay per month (code) ```r p1 <- flights %>% group_by(origin, month, day) %>% summarise(dep_delay_avg_day = mean(dep_delay, na.rm = TRUE)) %>% ungroup %>% mutate(dep_dt = make_date(2013, month, day)) %>% ggplot(aes(x = dep_dt, y = dep_delay_avg_day, shape = origin, color = origin)) + geom_point(alpha = .3) + geom_smooth() + scale_color_viridis_d() p1 ``` --- ## Delay per month (output) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-68-1.png" width="100%" style="display: block; margin: auto;" /> --- ## Delay per month - boxplot (code) ```r flights %>% ggplot(aes(x = month, y = dep_delay)) + geom_boxplot(aes(group = month)) + geom_smooth() + coord_cartesian(ylim = c(-10, 60)) + scale_x_continuous(breaks = 1:12) ``` --- ## Delay per month - boxplot (output) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-70-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Is it the weekends? (code) ```r flights <- flights %>% mutate(dow = wday(time_hour), weekend = case_when( dow %in% c(6, 7) ~ TRUE, TRUE ~ FALSE)) delay_dow <- flights %>% group_by(dow) %>% drop_na() %>% summarise(delay_m = mean(dep_delay), delay_md = median(dep_delay), q_05 = quantile(x = dep_delay, prob = .05), q_95 = quantile(x = dep_delay, prob = .95)) ``` --- ## Is it the weekends? (data processed) ```r delay_dow %>% kable(format='html') ``` <table> <thead> <tr> <th style="text-align:right;"> dow </th> <th style="text-align:right;"> delay_m </th> <th style="text-align:right;"> delay_md </th> <th style="text-align:right;"> q_05 </th> <th style="text-align:right;"> q_95 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 11.477476 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 14.718728 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 101 </td> </tr> <tr> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 10.588355 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 80 </td> </tr> <tr> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 11.643321 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 84 </td> </tr> <tr> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 16.043451 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 101 </td> </tr> <tr> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 14.653974 </td> <td style="text-align:right;"> -1 </td> <td style="text-align:right;"> -8 </td> <td style="text-align:right;"> 94 </td> </tr> <tr> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 7.594406 </td> <td style="text-align:right;"> -2 </td> <td style="text-align:right;"> -9 </td> <td style="text-align:right;"> 60 </td> </tr> </tbody> </table> --- ##Is it the weekends? (code) ```r flights %>% ggplot(aes(x = dow)) + geom_boxplot(aes(group = dow, y = dep_delay, color = weekend)) + geom_point(data = delay_dow, aes(y = delay_m), color = "red", size = 5) + coord_cartesian(ylim = c(-10, 100)) + scale_x_continuous(breaks = 1:7) + geom_hline(yintercept = mean(flights$dep_delay, na.rm = TRUE), linetype = "dashed") + geom_hline(yintercept = median(flights$dep_delay, na.rm = TRUE), linetype = "dashed") + geom_errorbar(aes(ymin = q_05, ymax = q_95), data = delay_dow) ``` --- ## Is it the weekends? (output) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-74-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Delay per time of the day (code) Let's check whether delays add up during the day, a popular opinion among travellers. ```r flights %>% select(dep_delay, hour) %>% ggplot(aes(x = hour, y = dep_delay)) + geom_boxplot(aes(group = hour)) + geom_smooth(method = "lm") + coord_cartesian(ylim = c(-10, 60)) + scale_x_continuous(breaks = 1:24) ``` --- ## Delay per time of the day (output) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-76-1.png" width="70%" style="display: block; margin: auto;" /> --- # Delay as function of month, hour, origin, and weekday ```r lm_hour <- lm(dep_delay ~ hour + month + origin + I(dow == 7), data = flights) rsquared(lm_hour) #> [1] 0.04408207 ``` --- class: inverse, center, middle # Geoplotting <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 384 512"><path d="M272 428v28c0 10.449-6.68 19.334-16 22.629V488c0 13.255-10.745 24-24 24h-80c-13.255 0-24-10.745-24-24v-9.371c-9.32-3.295-16-12.18-16-22.629v-28c0-6.627 5.373-12 12-12h136c6.627 0 12 5.373 12 12zm-143.107-44c-9.907 0-18.826-6.078-22.376-15.327C67.697 267.541 16 277.731 16 176 16 78.803 94.805 0 192 0s176 78.803 176 176c0 101.731-51.697 91.541-90.516 192.673-3.55 9.249-12.47 15.327-22.376 15.327H128.893zM112 176c0-44.112 35.888-80 80-80 8.837 0 16-7.164 16-16s-7.163-16-16-16c-61.757 0-112 50.243-112 112 0 8.836 7.164 16 16 16s16-7.164 16-16z"/></svg> Advanced. <br/> --- class: top, left ## Join airport data ```r data("airports") flights_airports <- # join destination long/lat flights %>% left_join(airports, by = c("dest" = "faa")) %>% rename(long = lon) origin_latlong <- airports %>% filter(faa %in% c("LGA", "JFK", "EWR")) %>% rename(lat_origin = lat, long_origin = lon) flights_airports <- # join origin long/lat flights_airports %>% left_join(origin_latlong, by = c("origin" = "faa")) ``` --- ## Dataframe for plotting (code) ```r flights_airports_sum <- flights_airports %>% group_by(dest, origin) %>% summarise(n = n(), long = max(long), lat = max(lat), long_origin = max(long_origin), lat_origin = max(lat_origin)) ``` --- ## Dataframe for plotting (output) ```r head(flights_airports_sum) #> # A tibble: 6 x 7 #> # Groups: dest [5] #> dest origin n long lat long_origin lat_origin #> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> #> 1 ABQ JFK 254 -107. 35.0 -73.8 40.6 #> 2 ACK JFK 265 -70.1 41.3 -73.8 40.6 #> 3 ALB EWR 439 -73.8 42.7 -74.2 40.7 #> 4 ANC EWR 8 -150. 61.2 -74.2 40.7 #> 5 ATL EWR 5022 -84.4 33.6 -74.2 40.7 #> 6 ATL JFK 1930 -84.4 33.6 -73.8 40.6 ``` --- ## Geo plot flights (code) ```r ggplot(data = map_data("usa")) + aes(x = long, y = lat, group = group) + geom_path(color = "grey40", size = .1) + geom_point(data = flights_airports_sum, aes(size = n, color = n, group = NULL), alpha = .2) + geom_segment(data = flights_airports_sum, aes(color = n, group = NULL, x = long_origin, y = lat_origin, xend = long, yend = lat), alpha = .5) + geom_text(data = flights_airports_sum %>% filter(n > 6000), aes(x = long, y = lat, label = dest, group = NULL), color = "grey40") + theme_map() + xlim(-130, -70) + ylim(+20, +50) + scale_color_viridis() ``` --- ## Geo plot flights (output) <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-82-1.png" width="80%" style="display: block; margin: auto;" /> --- class: center, inverse, middle # Outlook <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 384 512"><path d="M272 428v28c0 10.449-6.68 19.334-16 22.629V488c0 13.255-10.745 24-24 24h-80c-13.255 0-24-10.745-24-24v-9.371c-9.32-3.295-16-12.18-16-22.629v-28c0-6.627 5.373-12 12-12h136c6.627 0 12 5.373 12 12zm-143.107-44c-9.907 0-18.826-6.078-22.376-15.327C67.697 267.541 16 277.731 16 176 16 78.803 94.805 0 192 0s176 78.803 176 176c0 101.731-51.697 91.541-90.516 192.673-3.55 9.249-12.47 15.327-22.376 15.327H128.893zM112 176c0-44.112 35.888-80 80-80 8.837 0 16-7.164 16-16s-7.163-16-16-16c-61.757 0-112 50.243-112 112 0 8.836 7.164 16 16 16s16-7.164 16-16z"/></svg> Advanced. <br/> --- class: top, left ## Map columns to function with `map()` ```r data(mtcars) purrr::map(select(mtcars, 1:2), ~ {ggplot(mtcars, aes(x = .)) + geom_histogram()}) #> $mpg #> #> $cyl ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-83-1.png" width="20%" style="display: block; margin: auto;" /><img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-83-2.png" width="20%" style="display: block; margin: auto;" /> --- ## Map TWO columns to function with `map2()` ```r flights %>% select_if(~!is.numeric(.)) %>% map2(., names(.), ~ {ggplot(data = flights, aes(x = .x)) + geom_bar() + labs(x = .y, title = .y)}) ``` <br/> ``` Take flights dataset select all numeric cols map ggplot call to each such that a barplot is plotted where the names of each barplot is an element from the vector of col names. ``` --- ## Reshape (transform) dataframe <br/> <img src="https://raw.githubusercontent.com/sebastiansauer/modar/master/static/images/09_Praxisprobleme_der_Datenaufbereitung/07_tidy-dataframes-crop.png" width="70%" style="display: block; margin: auto;" /> --- ## Transform dataframe for plotting ```r mtcars %>% select_if(is.numeric) %>% gather(key = variable, value = value) %>% ggplot(aes(x = value)) + geom_density() + facet_wrap(~ variable, ncol = 3, scales = "free") ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-86-1.png" width="30%" style="display: block; margin: auto;" /> --- ## `ggplot` niceties: Themes ```r library(hrbrthemes) p2 <- p1 + theme_ipsum() + theme(legend.position = "bottom") p2 ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-87-1.png" width="70%" style="display: block; margin: auto;" /> May may need to install fonts upfront; see `?hrbrthemes`. --- ## ggplot niceties: Combining plots ```r library(patchwork) p2 + p2 ``` <img src="handson-data-workshop_2018-11-21_files/figure-html/unnamed-chunk-88-1.png" width="70%" style="display: block; margin: auto;" /> --- class: center, inverse, left # Resources --- ## Modern Dive <br/> <br/> <img src="https://moderndive.com/wide_format.png" width="70%" style="display: block; margin: auto;" /> [Modern Dive -- An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim](https://moderndive.com/index.html) --- ## R for Data Science <br/> <br/> <img src="https://covers.oreillystatic.com/images/0636920034407/lrg.jpg" width="20%" style="display: block; margin: auto;" /> [R for Data Science](https://r4ds.had.co.nz/) --- ## Moderne Datenanalyse mit R <img src="https://images.springer.com/sgw/books/medium/9783658215866.jpg" width="30%" style="display: block; margin: auto;" /> [Moderne Datenanalyse mit R](https://www.springer.com/us/book/9783658215866) --- ## Cheatsheets <img src="img/cheatsheet-dplyr.png" width="50%" style="display: block; margin: auto;" /> https://www.rstudio.com/resources/cheatsheets/ --- ## Disclaimer: There may be issues at times <img src="img/overhead-bin.gif" width="50%" style="display: block; margin: auto;" />
[StackOverflow](https://stackoverflow.com/) is your friend --- class: center, inverse, middle class: center, inverse, middle # Wrap-up --- ## That was quick, but it was a start .center[ ![](img/get-into-the-car.gif) ] --- class: right, top, inverse # Thank you <br/> <br/> Sebastian Sauer <br/> .left[
[sebastiansauer](https://github.com/sebastiansauer)
<https://data-se.netlify.com/>
sebastian.sauer@fom.de
[Sebastian Sauer](https://www.linkedin.com/in/dr-sebastian-sauer-4791762)
Get slides [here](http://data-se.netlify.com/slides/hands-on-data-exploration/handson-data-workshop_2018-11-21.html)
: Get PDF of slides [here](http://data-se.netlify.com/slides/hands-on-data-exploration/hands-on-data-exploration-using-R.pdf)
: Get Rmd source code of slides [here](http://data-se.netlify.com/slides/hands-on-data-exploration/handson-data-workshop_2018-11-21.Rmd) <br/> Licence: MIT ] --- ## Credit to Built using R, RMarkdown, Xaringan. Thanks to the R community and the tidyverse developers. Thanks to [Yihui Xie](https://slides.yihui.name/2017-rmarkdown-UNL-Yihui-Xie.htm) and [Antoine Bichat](https://abichat.github.io/Slides/ScienceCommunicationSOTR/ScienceCommunicationSOTR.html), among others, for Xaringan inspiration. Thanks to FOM Hochschule for supporting me. Images: - [Data Transformation with dplyr Cheat Sheet, by RStudio](https://www.rstudio.com/resources/cheatsheets/) - [Modern Dive, Chester Ismay and Albert Y. Kim](https://moderndive.com/) - [Kermit typping](https://www.reddit.com/r/reactiongifs/comments/2zk2ds/mrw_i_have_a_paper_due_the_following_morning/) - [RStudio running](http://paulyeo21.github.io/cell_bio_gen_lab/images/331f4ea3.rstudio_pane.png) - [tidyverse](http://paulyeo21.github.io/cell_bio_gen_lab/images/331f4ea3.rstudio_pane.png) - [Process of data analysis](https://d33wubrfki0l68.cloudfront.net/86cc45e87bb755a3bcecce462a6524e68d13a466/90635/images/tidy1.png) - [Yoda](http://i.imgur.com/CZ4D2pf.gifv) - [kid waves](https://media.giphy.com/media/XrDT8BuYB3I2s/giphy.gif) - [Nice data](https://biostat2.uni.lu/lectures/img/02/data_rectangle.png) - [tidy data](http://garrettgman.github.io/tidying/) - [Magrittr pipe](https://www.rstudio.com/wp-content/uploads/2014/04/magrittr.png) - [Jump car](https://wifflegif.com/gifs/598112-skill-wtf-gif) - [Overhead locker](https://i.gifer.com/DUrV.gif) Icons from [FontAwesome](https://fontawesome.com/icons) --- ## SessionInfo ``` #> [1] "R version 3.5.1 (2018-07-02)" ``` See `si.RData` for package version (same folder as this presentation). This document is made reproducible using [checkpoint](https://mran.microsoft.com/package/checkpoint) with day set to 2018-09-30.