sesa blog

sesa blog https://data-se.netlify.app/ Recent content on sesa blog Hugo -- gohugo.io en-us Sun, 13 Oct 2024 00:00:00 +0000 Working with list columns - an example https://data-se.netlify.app/2024/10/13/working-with-list-columns-an-example/ Sun, 13 Oct 2024 00:00:00 +0000 https://data-se.netlify.app/2024/10/13/working-with-list-columns-an-example/ 1 Load packages 2 Introduction 3 Example data 4 Add list column 1 5 Add list column 2 6 Extract list column 7 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Introduction In this post, I want to show you how to work with list columns in R. List columns are a powerful feature of the tidyverse that allow you to store multiple objects in a single column of a data frame. Benötigte R-Pakete für ein Projekt prüfen https://data-se.netlify.app/2024/10/11/ben%C3%B6tigte-r-pakete-f%C3%BCr-ein-projekt-pr%C3%BCfen/ Fri, 11 Oct 2024 00:00:00 +0000 https://data-se.netlify.app/2024/10/11/ben%C3%B6tigte-r-pakete-f%C3%BCr-ein-projekt-pr%C3%BCfen/ 1 Packages 2 Motivation 3 Find out missing packages 4 Install missing packages from CRAN 5 Install non-CRAN packages 1 Packages library(renv) 2 Motivation Assume you find a cool repo or some online-book and you want to run the R code. You might want to check if you have all the required packages installed. This is what this post is about. We will use the usethis package to check if all required packages are installed. Dead Man Bias in correlation https://data-se.netlify.app/2024/03/05/dead-man-bias-in-correlation/ Tue, 05 Mar 2024 00:00:00 +0000 https://data-se.netlify.app/2024/03/05/dead-man-bias-in-correlation/ 1 Load packages 2 Background 3 Data example 4 Discussion 5 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Background Nassim Taleb points out in the paper Fooled by Correlation: Common Misinterpretations in Social “Science” that spurious correlation may appear due to various reasons. One reason is what he calls the “Dead Man Bias”, occuring if constant data is added to uniformly distributed data. The thing is if the data is uniformly distributed and non-correlated, you will get a spurious correlation if constant data is added. Adjust labels in ggplot https://data-se.netlify.app/2024/02/25/adjust-labels-in-ggplot/ Sun, 25 Feb 2024 00:00:00 +0000 https://data-se.netlify.app/2024/02/25/adjust-labels-in-ggplot/ 1 Load packages 2 Data 3 Unadjusted labels 4 Adjusted labels manually 5 Adjust labels automatically 6 Expanding the limits 7 Duckdive the problem: tinyfy the label 8 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(see) # okabeito_colors 2 Data data("mariokart", package = "openintro") 3 Unadjusted labels mario_quantile <- mariokart %>% filter(total_pr < 100) %>% summarise(q25 = quantile(total_pr, .25), q50 = quantile(total_pr, .50), q75 = quantile(total_pr, . Dictionaries in R https://data-se.netlify.app/2024/02/17/dictionaries-in-r/ Sat, 17 Feb 2024 00:00:00 +0000 https://data-se.netlify.app/2024/02/17/dictionaries-in-r/ 1 Load packages 2 Are there Dictionaries in R? 3 Named vectors as dictonaries 4 Assign keys to a dictionary 5 Adding elements to a dictionary 6 Changing the order of the keys 7 Combining dictionaries 8 Looking up the keys for a given value 9 Using position index to look-up values 10 Searching for some value 11 Searching for the value given some key fragments 12 Check whether the dictionary contains some key 13 Sort values alphabetically 14 Sort keys alphabetically 15 Lists instead of vectors 16 Further reading 17 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Are there Dictionaries in R? Using dynamic variables in ggplot2 for facetting and more https://data-se.netlify.app/2024/02/04/using-dynamic-variables-in-ggplot2-for-facetting-and-more/ Sun, 04 Feb 2024 00:00:00 +0000 https://data-se.netlify.app/2024/02/04/using-dynamic-variables-in-ggplot2-for-facetting-and-more/ 1 Load packages 1 Load packages library(tidyverse) # data wrangling Prevent dropping from non-occuring levels using dplyr https://data-se.netlify.app/2024/01/30/prevent-dropping-from-non-occuring-levels-using-dplyr/ Tue, 30 Jan 2024 00:00:00 +0000 https://data-se.netlify.app/2024/01/30/prevent-dropping-from-non-occuring-levels-using-dplyr/ 1 Load packages 2 Problem 3 Solution 4 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Problem Consider the following situation: mtcars |> group_by(high_hp = hp > 1000) |> count(high_hp) #> # A tibble: 1 × 2 #> # Groups: high_hp [1] #> high_hp n #> <lgl> <int> #> 1 FALSE 32 The summary table does not show the level TRUE, as it is not occuring in the data. Numerical similarity of some Bayes and Classical models https://data-se.netlify.app/2024/01/29/numerical-similarity-of-some-bayes-and-classical-models/ Mon, 29 Jan 2024 00:00:00 +0000 https://data-se.netlify.app/2024/01/29/numerical-similarity-of-some-bayes-and-classical-models/ 1 Load packages 2 Motivation 2.1 Bayes and Frequentis 2.2 Technical setup for Bayes analysis provides a barrier 3 Numerical convergence of Bayes and Frequentist approaches 4 Example 1: mctcars 4.1 Frequentist model 4.2 Bayesian numerically equivalent model 4.3 Conclusion 5 Example 2: penguins 5.1 Frequentist model 5.2 Bayesian numerically equivalent model 5.3 Conclusion 6 Example 3: diamonds 6.1 Frequentist model 6. Simulating multiple event collision https://data-se.netlify.app/2024/01/29/simulating-multiple-event-collision/ Mon, 29 Jan 2024 00:00:00 +0000 https://data-se.netlify.app/2024/01/29/simulating-multiple-event-collision/ 1 Motivation 2 Setup 3 Constants/Parameters 4 Model 5 Some Assumptions 6 Example 7 Analytical approach 7.1 Limiting to picking 1 option 7.2 Generalizing to picking $p$ options 8 Monte Carlo as an alternative 8.1 Setup 8.2 Test 9 Modelling without dependency 9.1 Examples 9.2 Sampling distribution 9.3 o=10; Make a matching more probable 9.4 o=5; Make a matching highly probable 9.5 Grid of different parameter values 9. Using quizzes on markdown html sites https://data-se.netlify.app/2024/01/18/using-quizzes-on-markdown-html-sites/ Thu, 18 Jan 2024 00:00:00 +0000 https://data-se.netlify.app/2024/01/18/using-quizzes-on-markdown-html-sites/ 1 Motivation 2 Simple quiz implementation using HTML and JS 3 Demo 4 Reproducibility 1 Motivation As a teacher, I often write exercises for my students and post the exercises on my Datenwerk site. As many questions are of multiple choice type it would come handy to have a quiz function. I first hoped that Quarto markdown would supply such a feature out of the box. However, this is not (yet) the case. Simple contingency tables in R https://data-se.netlify.app/2024/01/12/simple-contingency-tables-in-r/ Fri, 12 Jan 2024 00:00:00 +0000 https://data-se.netlify.app/2024/01/12/simple-contingency-tables-in-r/ 1 Load packages 2 Motivation 3 Toy data 4 Using table and friends 5 Using count 6 Don’t drop unused factor levels 7 See also 8 Conclusions 9 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation Assume we would like to compute contingency tables in R without much ado. Let’s explore some ways. 3 Toy data data(mtcars) 4 Using table and friends mtcars |> select(vs, am) |> table() #> am #> vs 0 1 #> 0 12 6 #> 1 7 7 Let’s add margins: Logistic regression using z-standardized values https://data-se.netlify.app/2023/12/20/logistic-regression-using-z-standardized-values/ Wed, 20 Dec 2023 00:00:00 +0000 https://data-se.netlify.app/2023/12/20/logistic-regression-using-z-standardized-values/ 1 Load packages 2 Data 3 Motivation 4 EDA 5 Model with raw values 6 Model with am as factor-Variable 7 Visualizing 8 Standardizing predictors 9 Model with z-scaled predictors 10 Model with all variables z-scaled 11 Conclusion 12 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(easystats) 2 Data data(mtcars) 3 Motivation In this post, we’ll investigate the consequence of z-standardizing the predictor variables, and in addition the outcome variable in a simple logistic regression setting. Testing if return value is in tolerance https://data-se.netlify.app/2023/12/13/test-if-return-value-is-in-tolerance2/ Wed, 13 Dec 2023 00:00:00 +0000 https://data-se.netlify.app/2023/12/13/test-if-return-value-is-in-tolerance2/ 1 Load packages 2 Motivation 3 But in practice, how large is the difference? 4 Check if in tolerance region 4.1 b0 4.2 b1 4.3 R2 4.4 Count 5 Check variability 6 Conclusions 7 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(prada) # function "is_in_tolerance` library(rstanarm) # Bayes regression library(easystats) # R2 etc library(DataExplorer) # data vis library(tictoc) 2 Motivation Bayes models (using MCMC) build on drawing random numbers. Unicode in R und in Markdown https://data-se.netlify.app/2023/11/22/unicode/ Wed, 22 Nov 2023 00:00:00 +0000 https://data-se.netlify.app/2023/11/22/unicode/ 1 Unicode in R 2 Unicode in Markdown 3 Emojis in Markdown 4 FontAwesome 5 Warum Icons, nicht Emojis? 6 FontAwesome - Quarto 7 FontAwesome - R-Paket 8 Latex 9 Reproducibility 1 Unicode in R 25FB ist der Unicode für weißes Quadrat: cat("\u25FB\n") #> ◻ 2 Unicode in Markdown In Markdown kann man den HTML-Code verwenden, also z.B. &#x25FB;. Das ergibt dann: ◻ 3 Emojis in Markdown Da Emojis natürlich auch einen Unicode haben, kann man so auch einfach Emojis darstellen. Speed test for parallel processing https://data-se.netlify.app/2023/11/15/speed-test-for-parallel-processing/ Wed, 15 Nov 2023 00:00:00 +0000 https://data-se.netlify.app/2023/11/15/speed-test-for-parallel-processing/ 0.0.1 How fast is fast? 0.0.2 Tidymodels pipeline 0.0.3 Setup 0.0.4 Simple Fit 0.0.5 Resampling 0.0.6 Tuning 0.0.7 More tuning params 0.0.8 Parallel processing 0.0.9 Parallel processing - explicitly 0.0.10 ANOVA race 0.0.11 Acknowledgements 0.0.12 Reproducibility 0.0.1 How fast is fast? Let’s see how quickly some predictive model runs, in order to estimate time consumption for larger machine learning pipelines. In addtion, let’s see how much time is saves when using multiples cores, ie. normal densities animanted https://data-se.netlify.app/2023/11/04/normal-densities-animanted/ Sat, 04 Nov 2023 00:00:00 +0000 https://data-se.netlify.app/2023/11/04/normal-densities-animanted/ Background Let’s visualize the quantiles of a normal distribution using a density plot. Setup library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted. normal distribution animated https://data-se.netlify.app/2023/11/04/normal-distribution-animated/ Sat, 04 Nov 2023 00:00:00 +0000 https://data-se.netlify.app/2023/11/04/normal-distribution-animated/ Background Let’s visualize the quantiles of a normal distribution. Setup library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted. Gantt-Diagramme mit ganttrify https://data-se.netlify.app/2023/09/22/gantt-diagramme-mit-ganttrify/ Fri, 22 Sep 2023 00:00:00 +0000 https://data-se.netlify.app/2023/09/22/gantt-diagramme-mit-ganttrify/ Filtering vectors in R https://data-se.netlify.app/2023/07/15/filtering-vectors/ Sat, 15 Jul 2023 00:00:00 +0000 https://data-se.netlify.app/2023/07/15/filtering-vectors/ 1 Motivation 2 Setup 3 Way 1: Base R 4 Way 2: magrittr 5 Way 3: tidyverse 6 Way 4: purrr 7 Conclusions 1 Motivation We have a vector and we want to filter it by name. 2 Setup library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.2 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3. Farbpaletten für nominale Variablen https://data-se.netlify.app/2023/06/30/farbpaletten/ Fri, 30 Jun 2023 00:00:00 +0000 https://data-se.netlify.app/2023/06/30/farbpaletten/ 1 Setup 2 tl;dr 3 Beispiel für Farbwahl bei einer nominaler Variablen 4 Anforderungen an eine Farbpalette (für nominale Variablen) 5 Auswahl 6 Hilfsfunktion 7 AWTools 8 ggthemes 9 Weitere Paletten 10 rtist 11 ggsci 12 jcolors 13 Viridis 14 Magma 15 Color Lisa 16 Okabe-Ito 17 Tableau 10 18 Farbnamen 19 Show color 2 20 Farbenblindheit 20.1 Okabe Ito 20.2 Gene Davis 20.3 X18 20.4 Tableau 10 20. Estimating simulation variance in Stan models https://data-se.netlify.app/2023/03/17/estimating-simulation-variance-in-stan-models/ Fri, 17 Mar 2023 00:00:00 +0000 https://data-se.netlify.app/2023/03/17/estimating-simulation-variance-in-stan-models/ 1 Load packages 2 Motivation 3 Model 4 Workhorse function 5 Function for summarizing the simulation results 6 Dataset mtcars 7 Dataset msleep 8 Dataset penguins 9 Dataset tips 10 Dataset gtcars 11 Dataset Boston 12 Dataset TeachingRatings 13 Results overview 14 Conclusion 15 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(rstanarm) library(gt) 2 Motivation stan_glm() allows for setting a seed value thereby eliminating the variance induced by random numbers. Tables, plotted as ggplot objects https://data-se.netlify.app/2023/02/03/tables-plotted-as-ggplot-objects/ Fri, 03 Feb 2023 00:00:00 +0000 https://data-se.netlify.app/2023/02/03/tables-plotted-as-ggplot-objects/ 1 Load packages 2 Show case 1: grid.table 3 Show case 2: tableGrob 4 Show case 3: Reduce whitespace 5 Show case 4: ggpubr 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(gridExtra) library(grid) library(gt) 2 Show case 1: grid.table d <- head(iris[,1:3]) grid.table(d) grid.table does the job nicely. Just plotting give a somewhat too raw object: plot(tableGrob(d)) 3 Show case 2: tableGrob The following R code is taken from this source: Playing around with spirographs https://data-se.netlify.app/2023/01/30/playing-around-with-spirographs/ Mon, 30 Jan 2023 00:00:00 +0000 https://data-se.netlify.app/2023/01/30/playing-around-with-spirographs/ 1 Load packages 2 Spiro 3 You’re my favorite 4 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(spiro) library(viridisLite) 2 Spiro These images and their code are taken from the phantastic Spiro R package bei W. J. Schneider. 3 You’re my favorite k <- 36 files <- paste0("s", 1:k, ".svg") pen_radii <- seq(3.8, 1.5, length.out = k) alphas <- rep_len(c(0.85, rep(0.2, 4)), k) colors <- rep_len(viridis(6, alpha = alphas, begin = 0, end = 1, direction = 1, option = "D"), k) #colors <- rep_len(scico(6, palette = "devon"), k) %>% # alpha(. Differences according to importing CSV using different functions https://data-se.netlify.app/2023/01/19/differences-according-to-importing-csv-using-different-functions/ Thu, 19 Jan 2023 00:00:00 +0000 https://data-se.netlify.app/2023/01/19/differences-according-to-importing-csv-using-different-functions/ 1 Load packages 2 Motivation 3 Data 4 Method 1: read.csv 5 Method 2: read_csv 6 Method 3: data_read 7 First glimpse 8 Hashes 9 Not exactly identical 10 Data comparison 11 Conclusion 12 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(easystats) library(digest) # hashes 2 Motivation Importing a CSV file can yield to - slightly - different results, according to which functions are used for importing the file. A quick demo how to compute rowwise means with the tidyverse https://data-se.netlify.app/2023/01/16/a-quick-demo-how-to-compute-rowwise-means-with-the-tidyverse/ Mon, 16 Jan 2023 00:00:00 +0000 https://data-se.netlify.app/2023/01/16/a-quick-demo-how-to-compute-rowwise-means-with-the-tidyverse/ 1 Load packages 2 Motivation 3 Minimal example 4 See also 5 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation Sometimes is is neccessary to compute functions, such as mean values, rowwise, ie., summing the values for multiple variables (my_vars) for each observation. 3 Minimal example For the sake of simplicity, we’ll make use of the mtcars dataset. data(mtcars) my_vars <- c("mpg", "cyl", "hp") mtcars <- mtcars |> select(all_of(my_vars)) |> rowwise() |> mutate(mtcars_score = mean(c_across(all_of(my_vars)), na. Setting to NA, conditionally https://data-se.netlify.app/2023/01/16/setting-to-na-conditionally/ Mon, 16 Jan 2023 00:00:00 +0000 https://data-se.netlify.app/2023/01/16/setting-to-na-conditionally/ 1 Load packages 2 Motivation 3 Minimal example 4 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation Let’s assume we would like to change the values of multiple variables depending in the state of another variable. For the sake of concreteness, let’s say we have some variable called data_trustworthiness. If this variable (indicating whether or not we can have confidence in some other variables) has the value FALSE for some cases, we would like to set the varialbe measure1 and measure2 to NA, thus reflecting that the data from our measurements are not reliable. Consistency of set.seed across different systems https://data-se.netlify.app/2022/12/13/consistency-of-set-seed-across-different-systems/ Tue, 13 Dec 2022 00:00:00 +0000 https://data-se.netlify.app/2022/12/13/consistency-of-set-seed-across-different-systems/ 1 Load packages 2 Motivation 3 User error 4 Your help needed 5 Same random numbers 5.1 Without seed 5.2 With seed 5.3 Using a hash 6 Seeds in regression models 6.1 lm 6.2 Stan mtcars 6.3 Stan penguins 7 Session info 1 Load packages library(tidyverse) # data wrangling library(digest) library(rstanarm) 2 Motivation Reproducibility of results is a major concern in science and industry alike. Plot timelines using ggplot https://data-se.netlify.app/2022/11/30/plot-timelines-using-ggplot/ Wed, 30 Nov 2022 00:00:00 +0000 https://data-se.netlify.app/2022/11/30/plot-timelines-using-ggplot/ 1 Load packages 2 Motivation 3 Sample data 4 Visualization 5 Debrief 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(vistime) # time line 2 Motivation For project planing, a visualization of some time line is often useful. If it’s not the dates but rather the steps of a process, a graph of steps is more appropriate. However, if the sequence of steps is simple and rather linear, and the dates are the important piece of information to be transmitted, a kind of timeline graph is warranted. Accessing Google Trends https://data-se.netlify.app/2022/11/04/accessing-google-trends/ Fri, 04 Nov 2022 00:00:00 +0000 https://data-se.netlify.app/2022/11/04/accessing-google-trends/ 1 Load packages 2 Motivation 3 Restrictions and quotas 4 Access via R 5 Options 6 Get data 7 Plot it 8 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation Google Trends is, according to Wikipedia: Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. The website uses graphs to compare the search volume of different queries over time. Programmatically plotting with ggplot2 https://data-se.netlify.app/2022/09/28/programmatically-plotting-with-ggplot2/ Wed, 28 Sep 2022 00:00:00 +0000 https://data-se.netlify.app/2022/09/28/programmatically-plotting-with-ggplot2/ 1 Setup 2 Let’s go 2.1 Way 1 2.2 Way 2 2.3 Way 2 2.4 Way 3 2.5 Way 4 3 Further reading 4 Reproducibility 1 Setup library(tidyverse) # data wrangling library(easystats) # comfort in stats data(mtcars) In essence, we want to build this kind of plot programmatically: mtcars %>% ggplot(aes(x=hp)) + geom_histogram() 2 Let’s go 2.1 Way 1 Let’s use unquoted variable names. Some ways to plot the distribution of each variable of a data frame https://data-se.netlify.app/2022/09/26/some-ways-to-plot-the-distribution-of-each-variable-of-a-data-frame/ Mon, 26 Sep 2022 00:00:00 +0000 https://data-se.netlify.app/2022/09/26/some-ways-to-plot-the-distribution-of-each-variable-of-a-data-frame/ 1 Motivation 2 Load packages 3 Load data 4 Let’s plot 4.1 Way 1 4.2 Way 2 4.3 Way 3 4.4 Way 4 5 Reproducibility 1 Motivation Often times, in explorative data analysis, one would like to plot the distribution of the relevant variables. Whereas ggplot provides handy tools to plot one variable after each other, it would come handy to plot ’em all in one go. Great open-access data sets of public interest https://data-se.netlify.app/2022/09/12/great-open-access-data-sets-of-public-interest/ Mon, 12 Sep 2022 00:00:00 +0000 https://data-se.netlify.app/2022/09/12/great-open-access-data-sets-of-public-interest/ 1 Data 1.1 Licences 1.2 General 1.3 Environment 1.4 Health 1.5 Psychology 1 Data This posts lists some great open-access data sets of public or broad interest. This list is by no means comprehensiv, it’s just a casual curation of interesting sources. Note that this post is not updated, so more frequent releases than documented here are possible. You’ll find both links to repositories as well as to particular datasets, in no special order. FontAwesome in ggplot https://data-se.netlify.app/2022/07/27/fontawesome-in-ggplot/ Wed, 27 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/27/fontawesome-in-ggplot/ 1 Load packages 2 Reproducibility 1 Load packages library(tidyverse) # data wrangling Use Case Sometimes it is nice to decorate your posts with some FontAwesome Icons. The easiest way is to use {fontawesome} with the fa() functin. However, to get images instead of a font, try the following appraoch: library(emojifont) library(patchwork) p1 <- ggplot() + geom_fontawesome("fa-bolt", color='steelblue') + theme_void() p2 <- ggplot() + geom_fontawesome("fa-rocket", color = "steelblue") + theme_void() p1 + p2 Alternatively: to p1 + p2: FontAwesome in R and R Markdown https://data-se.netlify.app/2022/07/27/fontawesome-in-r-and-r-markdown/ Wed, 27 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/27/fontawesome-in-r-and-r-markdown/ 1 Load packages 2 Use Case 3 Way 4 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Use Case Some times some nice emojis or icons are of benefit for your new post, right? But, what’s a useful way to implement icons? 3 Way Here’s a quick way of incorporating FontAwesome icons to your RMarkdown document: <center> <font size="15"> ```r library(fontawesome) fa("r-project", fill = "steelblue") fa("bolt-lightning", fill = "steelblue") fa("discourse", fill = "steelblue") fa("rocket", fill = "steelblue") ``` </font> </center> Which renders as: German weather https://data-se.netlify.app/2022/07/24/german-weather/ Sun, 24 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/24/german-weather/ 1 Load packages 2 Motivation 3 Load data 4 Main temperature trajectory over time 4.1 Visualization 4.2 Linear model 5 Temperature change per month 5.1 Vis 1: Change per Month for whole of Germany 5.2 Linear model 5.3 Vis 2: Trend by Bundesland 6 Change per decade 6.1 Vis 1 6.2 Vis 2: Temperature change per decade 7 Change in variability 7. Minimal tidymodels example with the Lasso https://data-se.netlify.app/2022/07/24/minimal-tidymodels-example-with-the-lasso/ Sun, 24 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/24/minimal-tidymodels-example-with-the-lasso/ 1 Intro 2 Load packages 3 Data 4 Minimal code for fitting a model 5 Results 6 Reproducibility 1 Intro In this post, we try to find a minimal setup for running/fitting a predictive model using the tidymodels approach. 2 Load packages library(tidyverse) # data wrangling library(tidymodels) 3 Data data("penguins", package = "modeldata") 4 Minimal code for fitting a model m1 <- linear_reg(engine = "glmnet", penalty = 1, mixture = 1) %>% fit(body_mass_g ~ . Penguins Lasso with Tidymodels https://data-se.netlify.app/2022/07/24/penguins-lasso-with-tidymodels/ Sun, 24 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/24/penguins-lasso-with-tidymodels/ 1 Load packages 2 Data 3 A bit more than minimal 4 Results 5 Extract fit 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) 2 Data data("penguins", package = "modeldata") 3 A bit more than minimal rec1 <- recipe(body_mass_g ~ ., data = penguins) %>% step_dummy(all_nominal()) %>% step_normalize(all_numeric_predictors()) %>% step_nzv(all_numeric_predictors()) %>% step_naomit(all_predictors()) Checks: summary(rec1) #> # A tibble: 7 × 4 #> variable type role source #> <chr> <chr> <chr> <chr> #> 1 species nominal predictor original #> 2 island nominal predictor original #> 3 bill_length_mm numeric predictor original #> 4 bill_depth_mm numeric predictor original #> 5 flipper_length_mm numeric predictor original #> 6 sex nominal predictor original #> 7 body_mass_g numeric outcome original tidy(rec1) #> # A tibble: 5 × 6 #> number operation type trained skip id #> <int> <chr> <chr> <lgl> <lgl> <chr> #> 1 1 step dummy FALSE FALSE dummy_rc5a2 #> 2 2 step normalize FALSE FALSE normalize_U3yg4 #> 3 3 step nzv FALSE FALSE nzv_vruQ8 #> 4 4 step naomit FALSE TRUE naomit_PqP3J #> 5 5 step novel FALSE FALSE novel_6pjBL rec1 %>% prep() %>% bake(new_data = NULL) %>% head() #> # A tibble: 6 × 9 #> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g species_Chinstrap #> <dbl> <dbl> <dbl> <int> <dbl> #> 1 -0. Preparing German weather data https://data-se.netlify.app/2022/07/24/preparing-german-weather-data/ Sun, 24 Jul 2022 00:00:00 +0000 https://data-se.netlify.app/2022/07/24/preparing-german-weather-data/ 1 Load packages 2 Motivation 3 Licence 4 It’s a playful approach 5 Download data 5.1 Air temperature means 6 Download multiple files and bind them together rowwise 7 Format to long 8 More post-processing 9 Save to disk 10 Precipitation 11 Debrief 12 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(glue) 2 Motivation In this post, we’ll prepare official German weather data. Free resources for aspiring data adepts https://data-se.netlify.app/2022/06/13/free-resources-for-aspiring-data-adepts/ Mon, 13 Jun 2022 00:00:00 +0000 https://data-se.netlify.app/2022/06/13/free-resources-for-aspiring-data-adepts/ 1 Why data science? 2 Free resources overview 2.1 Machine learning conceps 2.2 Math basics 2.3 R basics 2.4 Machine learning framework with R 2.5 R online environment 2.6 Online course 2.7 Beautiful intuition 2.8 Blogs 2.9 Help 2.10 YouTube channels 1 Why data science? Data science is of of the most vibrating fields of research and industries at present. Its ubiquity and importance is likely on the rise. Vorher-Nachher-Messung und Vergleich zwischen Gruppen https://data-se.netlify.app/2022/06/04/vorher-nachher-messung-und-vergleich-zwischen-gruppen/ Sat, 04 Jun 2022 00:00:00 +0000 https://data-se.netlify.app/2022/06/04/vorher-nachher-messung-und-vergleich-zwischen-gruppen/ 1 Load packages 2 Forschungsfrage 3 Simulierte Daten 4 Differenzwert berechnen 5 Visualisieren 6 Deskriptive Statistik 7 Deskriptive Statistik als schöne Tabelle 8 Cohens d 9 Inferenzstatistik 10 Parameter (Koeffizienten des Modells) plotten 11 Ja, ist der Effekt jetzt groß oder nicht? 12 ROPE 13 Was ist mit R-Quadrat? 14 Fazit 15 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(easystats) # make stasts easy again library(rstanarm) # Bayes library(gt) # schöne Tabellen 2 Forschungsfrage Stellen Sie sich vor, Sie haben ein Experiment durchgeführt. Ableitung der Koeffizienten der einfachen Regression https://data-se.netlify.app/2022/05/23/ableitung-der-koeffizienten-der-einfachen-regression/ Mon, 23 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/23/ableitung-der-koeffizienten-der-einfachen-regression/ 1 Was ist die Regression? 2 Wie findet man die Regressionsgerade? 2.1 $b_0$ 2.2 $b_1$ 2.3 Weitere Umformung von $b_1$ 3 Quellenangabe 4 Fazit library(tidyverse) 1 Was ist die Regression? In diesem Post geht es um die einfache Regression (d.h. mit einem Prädiktor); genauer gesagt um die Frage, wie man auf die Formeln der Koeffizienten der einfachen Regression kommt. Gehen wir von einigen zweidimensionalen Datenpunkten aus, die zu einem Phänomen gemessen wurden: ${(x_1, y_1), (x_2, y_2), \ldots, (x_n,y_n)}$. A simple solution to ditch the question "what's the path of my data" when importing data to R https://data-se.netlify.app/2022/05/11/a-simple-solution-to-ditch-the-question-what-s-the-path-of-my-data-when-importing-data-to-r/ Wed, 11 May 2022 09:52:34 +0000 https://data-se.netlify.app/2022/05/11/a-simple-solution-to-ditch-the-question-what-s-the-path-of-my-data-when-importing-data-to-r/ Load packages library(tidyverse) Motivation: Get your data into R, different ways Importing data into R can cause headaches for newbies. For some, the concept of relative and absolute paths is new. That’s why I compiled here some recommendations on how to important data into R and on how to ditch the “what’s my path” problem. Pragmatic goal If you are in a hurry, just pick one way, maybe the first approach. Importing data into R https://data-se.netlify.app/2022/05/11/importing-data-into-r/ Wed, 11 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/11/importing-data-into-r/ 1 Load packages 2 Motivation: Get your data into R, different ways 3 Pragmatic goal 4 Approach 1: Quick and easy 5 Approach 2: Start an RStudio project 6 Approach 3: Import from an online source 7 Approach 4: Learn what a path means 8 Example time – dataset tips 1 Load packages library(tidyverse) 2 Motivation: Get your data into R, different ways Importing data into R can cause headaches for newbies. Comparing Jamovi and rstanarm https://data-se.netlify.app/2022/05/09/comparing-jamovi-and-rstanarm/ Mon, 09 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/09/comparing-jamovi-and-rstanarm/ 1 Load packages 2 Motivation 3 data 4 Model 1 4.1 rstanarm 4.2 Jamovi 5 Model 2 5.1 rstanarm 5.2 Jamovi 5.3 Interim conclusion 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation Let’s try to see how much the results of Jamovi (2.2.5) and rstanarm (2.21.1) converge. It’s probably difficult to say because the defaults are different, and it may not be straight forward to translate back and forth. Rowwise NA https://data-se.netlify.app/2022/05/09/rowwise-na/ Mon, 09 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/09/rowwise-na/ 1 Load packages 2 Sample data 3 Count NA rowwise 4 Way 1: rowwise sum with mutate and c_across 5 Way 2: apply() with margin 1 6 Way 3: rowSums 7 Way 4: cur_data() 8 Why not map()? 9 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Sample data data("mtcars") Create some NA: mtcars$mpg[c(1,2,3)] <- NA mtcars$hp[c(1,2,3)] <- NA 3 Count NA rowwise What we would like to achieve is to comfortable count the missing values per row. Empirische Verteilungsfunktion https://data-se.netlify.app/2022/05/02/empirische-verteilungsfunktion/ Mon, 02 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/02/empirische-verteilungsfunktion/ 1 R-Pakete 2 Hintergrund 3 Verteilungsfunktion der Normalverteilung 4 Empirische Verteilungsfunktion 4.1 Tidyverse 4.1.1 Tidyverse 1 4.1.2 Tidyverse 2 4.1.3 Plotten der ECDF 4.1.4 Quantile 4.2 Base R 4.2.1 Quantile 4.2.2 ECDF 4.2.3 Plot 4.3 Mosaic 4.3.1 ECDF 4.3.2 Quantile 5 Reproducibility 1 R-Pakete library(tidyverse) # data wrangling theme_set(theme_minimal()) # Stylesheet für ggplot2 2 Hintergrund Will man eine Verteilung untersuchen, sind Verteilungsfunktion $F$ und Quantilsfunktion $F^{-1}$ wichtige Größen. Energie sparen (!) https://data-se.netlify.app/2022/05/02/fallstudie-spritverbrauch/ Mon, 02 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/02/fallstudie-spritverbrauch/ 1 Vorbereitung 1.1 R-Pakete 1.2 Forschungsfrage 1.3 Versuchsdaten 2 Beschreibung des Datensatzes 2.1 Fragen 2.2 Umrechnung 3 Zusammenhang Geschwindigkeit und Verbrauch 3.1 Linearität des Zusammenhangs 3.2 Fragen 3.3 Verschönern 3.4 Vor- und Nachteile der Skalierung 4 Verbrauchsdaten 4.1 Fragen zum Datensatz 4.2 Beschreibung des Datensatzes 4.2.1 Interpretation der Verteilung 4.3 Empirische Verteilungsfunktion 4.3.1 Fragen 4.4 Quantile der empirischen Verteilung 4. Kontigenztabellen in R https://data-se.netlify.app/2022/05/02/kontigenztabellen-in-r/ Mon, 02 May 2022 00:00:00 +0000 https://data-se.netlify.app/2022/05/02/kontigenztabellen-in-r/ 1 Pakete und Daten 2 Häufigkeiten berechnen 2.1 Tidyverse 2.1.1 Univariat 2.1.2 Bivariat 2.1.3 Bezogen auf was? 2.1.4 Kontingenztabelle durch Pivotieren 2.2 Easystats 2.3 sjmisc 2.3.1 Kontingenztabelle 2.3.2 Anteile 2.3.3 Kontingenztabelle gruppiert 2.4 Base R 2.4.1 Kontingenztabelle 2.4.2 ftable 2.4.3 Anteile 3 Schöne Tabellen in html 3.1 gt 3.1.1 flat_table 3.1.2 pivot_wider 4 Exportieren 4. 3D Regression plane with scatter plot https://data-se.netlify.app/2022/04/19/3d-regression-plane-with-scatter-plot/ Tue, 19 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/19/3d-regression-plane-with-scatter-plot/ 1 Load packages 2 Define model 3 Define grid for regression plane 4 Scatter Plot 5 Scatter plot with 3D surface 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(plotly) # 3D plot interactive 2 Define model Here’s the linear model with 2 predictors, giving us a model that can be visualized in 3D: lm1 <- lm(mpg ~ hp + disp, data = mtcars) As is standard, we’ll predict mpg. Das arithmetische Mittel minimiert die Abweichungsquadrate https://data-se.netlify.app/2022/04/08/mittelwert-minimiert-abweichungsquadrate/ Fri, 08 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/08/mittelwert-minimiert-abweichungsquadrate/ 1 Behauptung 2 Beweis 3 Quellen 1 Behauptung Das arithmetische Mittel $\bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$ minimiert die Abweichungsquadrate der $x_i$ zu einem Wert $c$, eben der ist das arithmetische Mittel: $\text{arg min}_c \sum_{i=1}^n(x_i - c)^2$. Mit anderen Worten: Es gibt keine andere Zahl, für die obige Summe einen kleineren Wert liefert, so die Behauptung. Nennen wir die Summe der Abweichungsquadrate $s(c) = \sum_{i=1}^n(x_i -c)^2$. 2 Beweis \[ \begin{aligned} s(c) &= \sum_{i=1}^n (x_i -c)^2 \\ &= \sum_{i=1}^n (x_i^2 - 2x_ic + c^2) \\ &= \sum_{i=1}^n x_i^2 - \sum_{i=1}^n 2x_ic + \sum_{i=1}^n c^2 \\ &= \sum_{i=1}^n x_i^2 - 2c \sum_{i=1}^n x_i + n c^2 \end{aligned} \] Median minimiert Absolutabweichungen https://data-se.netlify.app/2022/04/08/median-minimiert-absolutabweichungen/ Fri, 08 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/08/median-minimiert-absolutabweichungen/ 1 Behauptung 2 Beweis 1 3 Beweis 2 4 Quellen 5 Reproducibility library(tidyverse) 1 Behauptung Der Median $md$ minimiert die Absolutabweichungen der $x_i$ zu einem Wert $c$, eben der ist Median: $\text{arg min}_c \sum_{i=1}^n|(x_i - c)|$. Mit anderen Worten: Es gibt keine andere Zahl, für die obige Summe einen kleineren Wert liefert, so die Behauptung. Nennen wir die Summe der Absolutabweichungen $e(c) = \sum_{i=1}^n|(x_i - c)|$. How to import GoogleSheets into R https://data-se.netlify.app/2022/04/02/how-to-import-googlesheets-into-r/ Sat, 02 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/02/how-to-import-googlesheets-into-r/ 1 Load packages 2 Motivation 3 Find your GoogleSheets File 4 Authentificate 5 Read it 6 Check 7 Rename 8 Some caveats 9 Further reading 10 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(googlesheets4) # GSheets API library(gt) # html tables 2 Motivation Data sharing is of primary concern for science and, increasingly, technology. Whereas there are specialized repositories for data storage and exchange (which are very useful), at times more quick and dirty solutions are desirable. Simple nomnoml in R examples https://data-se.netlify.app/2022/04/02/simple-nomnoml-in-r-examples/ Sat, 02 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/02/simple-nomnoml-in-r-examples/ 1 Load packages 2 Motivation 3 Introducing Nomnoml 4 R API 5 Adjust the size 6 Change the direction 7 Size of the HTML container 8 Save to disk 9 Load from SVG 10 Caveats 11 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(nomnoml) # graphs library(magick) # render SVG image 2 Motivation Sketching diagrams such as flow charts is a useful thing. Visualizing variation in data, simple ideas https://data-se.netlify.app/2022/04/02/visualizing-variation-in-data-simple-ideas/ Sat, 02 Apr 2022 00:00:00 +0000 https://data-se.netlify.app/2022/04/02/visualizing-variation-in-data-simple-ideas/ 1 Load packages 2 Simulate data 3 Plot 1 4 Plot 2 5 Plot 3 6 Plot 4 7 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Simulate data low_spread <- tibble(var = rnorm(n = 100), id = 1:100, type = "low spread") high_spread <- tibble(var= rnorm(n = 100, sd = 10), id = 1:100, type = "high spread") d <- low_spread %>% bind_rows(high_spread) 3 Plot 1 ggplot(d) + aes(x = id, y = var) + facet_wrap(~ type) + geom_hline(yintercept = 0, color = "grey40") + geom_point() + theme_minimal() 4 Plot 2 ggplot(d) + aes(x = type, y = var) + geom_boxplot() 5 Plot 3 ggplot(d) + aes(x = var, fill = type) + geom_density(alpha = . Simulation des wiederholten Stichprobenziehens https://data-se.netlify.app/2022/03/28/simulation-des-wiederholten-stichprobenziehens/ Mon, 28 Mar 2022 00:00:00 +0000 https://data-se.netlify.app/2022/03/28/simulation-des-wiederholten-stichprobenziehens/ 1 Vorbereitung 2 Kann man wirklich von einer Stichprobe auf eine Grundgesamtheit schließen? 3 Hier ist eine Population 4 Wir ziehen eine Stichprobe 5 Moment 6 Also gut, ziehen wir viele Stichproben 7 Zusammenfassen der Stichproben 8 Visualisierung 9 Fazit 10 Reproduzierbarkeit 1 Vorbereitung library(tidyverse) # Datenjudo library(infer) # Inferenzstatistik 2 Kann man wirklich von einer Stichprobe auf eine Grundgesamtheit schließen? Alle Welt behauptet, dass man von einer Stichprobe auf eine Grundgesamtheit schließen könne. Streaming aus den Hörsaal, ein einfacher Ansatz https://data-se.netlify.app/2022/03/23/streaming-aus-den-h%C3%B6rsaal-ein-einfacher-ansatz/ Wed, 23 Mar 2022 00:00:00 +0000 https://data-se.netlify.app/2022/03/23/streaming-aus-den-h%C3%B6rsaal-ein-einfacher-ansatz/ 1 Hintergrund 2 Warnung 3 Ausrüstung 4 Aufbau 5 Streaming-Einsatz im Hörsaal 6 Keine Kamera 7 Ablauf 8 Probleme und Lösungen 8.1 Es kommt kein Ton 8.2 Akku leer 8.3 Die Online-Studis hören nicht, was die Präsenz-Studis sagen 8.4 Echo 8.5 An die Tafel schreiben geht nicht 9 Fazit 1 Hintergrund Aktuell setzt sich an vielen Hochschulen wieder Präsenzlehre durch oder ist zumindest angesagtes Gebot der Stunde. Programming the tidyverse: quoted and unqouted parameters https://data-se.netlify.app/2022/03/11/programming-the-tidyverse-quoted-and-unqouted-parameters/ Fri, 11 Mar 2022 00:00:00 +0000 https://data-se.netlify.app/2022/03/11/programming-the-tidyverse-quoted-and-unqouted-parameters/ 1 Load packages 2 Motivation 3 First: Quoted (string) parameter 4 Second: Unquoted parameter 5 Check 6 Bonus 1 Load packages library(tidyverse) # data wrangling 2 Motivation If a project reaches some level of complexity, sooner or later, more systematical meausures of coding need to be employed. Using the tidyverse ecosystem, programming - instead of interactive use - may be something different or unusual and it may take some time to wrap your head around it. Data sets for for teaching https://data-se.netlify.app/2022/02/23/data-sets-for-for-teaching/ Wed, 23 Feb 2022 00:00:00 +0000 https://data-se.netlify.app/2022/02/23/data-sets-for-for-teaching/ 1 Load packages 2 Data 3 Data repositories 4 How to import into R 5 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Data Here’s a opionated list of data sets useful for teaching purposes: mtcars csv doc tips csv doc flights (NYCflights13) csv doc Saratoga houses csv doc diamonds csv doc wo_men csv OECD well-being source penguins csv doc Ames housing source on Kaggle, login needed teacher rating csv doc Note that the data sets are provided as standard CSV files (comma separeted, dots as delimiters). Die logististische Regression (glm) modelliert die zweite Stufe https://data-se.netlify.app/2022/02/11/die-logististische-regression-glm-modelliert-die-zweite-stufe/ Fri, 11 Feb 2022 00:00:00 +0000 https://data-se.netlify.app/2022/02/11/die-logististische-regression-glm-modelliert-die-zweite-stufe/ Welche Stufe modelliert die logististische Regression in R? Sagen wir, wir möchten vorhersagen, ob eine Person Frau oder Mann ist (nur diese zwei Stufen) anhand der Höhe des Trinkgelds, das diese Person gibt. Dazu nutzen wir die Funktio glm() in R. Vorbereitung library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.8 ## ✓ tidyr 1. tidyeval, some musings on dplyr::filter https://data-se.netlify.app/2022/02/09/tidyeval-some-musings-on-dplyr-filter/ Wed, 09 Feb 2022 00:00:00 +0000 https://data-se.netlify.app/2022/02/09/tidyeval-some-musings-on-dplyr-filter/ Programming with the tidyverse … Is not exactly self-evident. It actually requires some head wrapping, at least in my experience. In this post, we are exploring some aspects on programming when filtering rows. Let’s see. Setup library(tidyverse) Some filtering chunk Let’s say we would like to filter observations according to some variable and a given threshold in some data set: mtcars %>% filter(hp > 200) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Duster 360 14. Checking Moodle test log data https://data-se.netlify.app/2022/02/08/checking-moodle-test-log-data/ Tue, 08 Feb 2022 00:00:00 +0000 https://data-se.netlify.app/2022/02/08/checking-moodle-test-log-data/ Motivation: Let’s check whether server blackouts seem probable After one particular exam, a student complaint that Moodle was not reacting during some specified time period. In this post, we’ll check whether we find evidence in favor or against a failout of the server. Setup library("tidyverse") ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ## ✓ tibble 3.1.6 ✓ dplyr 1.0.7 ## ✓ tidyr 1. Erbie: Einfache, reproduzierbare Beispiele Ihres Problems mit (R-)Syntax https://data-se.netlify.app/2022/01/31/erbie-einfache-reproduzierbare-beispiele-ihres-problems-mit-r-syntax/ Mon, 31 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/31/erbie-einfache-reproduzierbare-beispiele-ihres-problems-mit-r-syntax/ Hilfe, mein R läuft nicht? Was soll ich tun? Angenommen, Sie haben ein Problem mit R … Oder präzisieren wir, Sie haben ein Problem mit einer bestimmten R-Syntax (ob R auch ein Problem mit uns haben kann, ist nicht bekannt). Jedenfall wollen Sie, dass R etwas bestimmtes tut. Macht es aber nicht. Jetzt könnte man es mit anschreien versuchen; Maschinen lassen das geduldig über sich ergehen. Man könnte den Computer zum Fenster rausschmeißen, das könnte auch Erleichterung bringen … Dem hab ich’s jetzt mal richtig gezeigt. Bayes in fünf Minuten, für Fortgeschrittene https://data-se.netlify.app/2022/01/28/bayes-in-f%C3%BCnf-minuten-f%C3%BCr-fortgeschrittene/ Fri, 28 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/28/bayes-in-f%C3%BCnf-minuten-f%C3%BCr-fortgeschrittene/ Das ist wieder ein Fünf-Minuten-Bayes-Kurs Sie würden gerne Bayes lernen und dafür zwischen 1-3 Wochen Zeit investieren? Dann sind Sie hier falsch. Dieser Post zeigt einen Kurzüberblick in leicht fortgechrittenen Bayes-Statistik in fünf Minuten. Naja, ich probiere es jedenfalls. Forschungsfrage Sagen wir, uns interessiert folgende Forschungsfrage, die mit Methoden der Inferenz-Statistik untersucht werden soll. In diesem Fall Bayes-Inferenz (nicht Frequentistische Statistik). Verbrauchen Autos mit Automatik-Getriebe im Durchschnitt mehr Sprit als Autos mit manuellem Getriebe? Bayes-Software installieren für R https://data-se.netlify.app/2022/01/28/bayes-software-installieren-f%C3%BCr-r/ Fri, 28 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/28/bayes-software-installieren-f%C3%BCr-r/ R und seine Freunde installieren. Schauen Sie, dass Sie zuerst R und seine Freunde installiert haben. Bayes-Software Bayes-Inferenz kann rechenintensiv sein. Daher braucht’s Software, die schnell rechnen kann. Außerdem sollte die Software sich gut mit Wahrscheinlichkeitsrechnung auskennen, denn Bayes ist nichts anderes als angewandte Wahrscheinlichkeitsrechnung. Aktuell ist die Software Stan die führende Software für diesen Zweck. Bevor Sie aber Stan installieren können, brauchen Sie eine (Software für eine) “schnelle Rechenmaschine” auf Ihrem Computer installiert. Kurs "Bayes:Start" https://data-se.netlify.app/2022/01/28/kurs-bayes-start/ Fri, 28 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/28/kurs-bayes-start/ Bayes lernen Sie möchten (oder müssen) Bayes lernen? Hier ist ein Kurs dazu. Alle Materialien des Kurses sind frei verfügbar, können kostenfrei genutzt werden und sind quelloffen. Kurs: "Vorhersage-Modellierung" https://data-se.netlify.app/2022/01/28/kurs-vorhersage-modellierung/ Fri, 28 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/28/kurs-vorhersage-modellierung/ Einführung in die Vorhersage-Modellierung 🔮 Ein Kurs zur Grundlagen der Datenanalyse und der Vorhersage-Modellierung mit R. Hier geht es zum Kurs. Alle Materialien des Kurses sind frei verfügbar, können kostenfrei genutzt werden und sind quelloffen. Wissenschaftliche Notation in R an und ausstellen https://data-se.netlify.app/2022/01/28/wissenschaftliche-notation-in-r-an-und-ausstellen/ Fri, 28 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/28/wissenschaftliche-notation-in-r-an-und-ausstellen/ Wissenschaftliche Notation, was is das? Zahlen können in der “fixierten” oder normalen Notation geschrieben sein: 1 ## [1] 1 oder 10 ## [1] 10 und so weiter. Die sog. wissenschaftliche Notation von Zahlen sieht so aus: ## [1] 1e+15 Die wissenschaftliche Notation dieser großen Zahl sagt uns: “Das ist eine Zahl, die mit der Ziffer 1 beginnt und dann folgen 15 Nullen”. Das e steht für Exponent. Eigentlich nutzt der Computer die typische Taschenrechner-Schreibweise, von dem, was in der Mathe so geschrieben würde: Bayes in fünf Minuten https://data-se.netlify.app/2022/01/27/bayes-in-f%C3%BCnf-minuten/ Thu, 27 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/27/bayes-in-f%C3%BCnf-minuten/ Das ist ein Fünf-Minuten-Bayes-Kurs Sie würden gerne Bayes lernen und dafür zwischen 1-3 Wochen Zeit investieren? Dann sind Sie hier falsch. Dieser Post zeigt einen Kurzüberblick in Bayes-Statistik in fünf Minuten. Naja, ich probiere es jedenfalls. Forschungsfrage Sagen wir, uns interessiert folgende Forschungsfrage, die mit Methoden der Inferenz-Statistik untersucht werden soll. In diesem Fall Bayes-Inferenz (nicht Frequentistische Statistik). Verbrauchen Autos mit Automatik-Getriebe im Durchschnitt mehr Sprit als Autos mit manuellem Getriebe? Visualizing error distribution in regression analysis https://data-se.netlify.app/2022/01/27/visualizing-residual-distribution-in-regression-analysis/ Thu, 27 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/27/visualizing-residual-distribution-in-regression-analysis/ Errors and residuals in Regression A residual is defined as $r_i = y_i - X_i \hat{\beta}$. That is, a residual is a tangible thing in the sense that it describes observables (cf. Gelman 2021, chap. 11.3, p. 161). That is, the residuals are the difference between observed and predicted values. In contrast, the error term is defined as the difference between the observed value and the true (unobserved) value: Warum Bayes anstelle von Frequentismus? https://data-se.netlify.app/2022/01/27/warum-bayes/ Thu, 27 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/27/warum-bayes/ Plädoyer pro Bayes Dieser Post ist ein Plädoyer, Bayes-Statistik in der Statistikausbildung und in der praktischen Forschung zu nutzen. Keines der hier vorgetragenen Argumente ist neu. Die Thematik ist schon 1000 Mal diskutiert worden und oft umfangreicher und systematischer, ja besser, als in diesem Post. Ich schreibe hier kurz meine Sichtweise zusammen und verweise auf weitere Liteatur. Bayes-Inferenz kenn ich nicht! Die klassische Statistikausbildung in den Sozialwissenschaften beinhaltet meist kein oder kaum Bayes. Visualizing a log-y regression model https://data-se.netlify.app/2022/01/14/visualizing-a-log-y-regression-model/ Fri, 14 Jan 2022 00:00:00 +0000 https://data-se.netlify.app/2022/01/14/visualizing-a-log-y-regression-model/ Setup library(tidyverse) data(mtcars) Using a log-Y regression Gelman et al., in “Regression and other stories” are stating that “when additivity and linearity are not reasonable assumptions” it may sense to “take the logarithms of outcomes that are all positive” (p. 189). A log-y regression can be defined as follows, in the simplest case: \[\text{log} \, y = b_0 + b_1X_1 + \ldots + \epsilon\] Exponentiating both sides yields Simulation on controlling confounders https://data-se.netlify.app/2021/12/01/simulation-on-controlling-confounders/ Wed, 01 Dec 2021 00:00:00 +0000 https://data-se.netlify.app/2021/12/01/simulation-on-controlling-confounders/ Confounder A confounder is on of the few (maybe three) “atoms” of causality, following the framework of Judea Parl and others. A confounder can be depicted like this: Following a study that reported a strong correlation between chocolate consumption and Nobel prices. Simulating a confounder structure Now let’s simulate a simple confounder structure. Here’s some code that will help us: Let’s have a look at the code: Installation von R und seiner Freunde https://data-se.netlify.app/2021/11/30/installation-von-r-und-seiner-freunde/ Tue, 30 Nov 2021 00:00:00 +0000 https://data-se.netlify.app/2021/11/30/installation-von-r-und-seiner-freunde/ 1 Überblick 2 Versions-Infos und Update 3 Installation 3.1 R 3.2 RStudio 3.3 RStudio Cloud 3.3.1 Konto anlegen 3.3.2 Projekte 3.4 R-Pakete 3.4.1 Was sind R-Pakete? 3.4.2 Welche R-Pakete brauche ich? 3.5 Hinweise 4 Wenn nichts mehr hilft … 1 Überblick Wir wollen uns hier nicht mit Fragen beschäftigen Warum R? (und auch nicht mit Warum, R?). Stattdessen soll Ihnen diese Seite helfen, R und was sonst noch so dazu gehört, zu installieren. Analyse einiger RKI-Coronadaten: Eine reproduzierbare Fallstudie https://data-se.netlify.app/2021/11/27/analyse-der-rki-coronadaten/ Sat, 27 Nov 2021 00:00:00 +0000 https://data-se.netlify.app/2021/11/27/analyse-der-rki-coronadaten/ 1 R-Pakete 2 Hintergrund 3 Inzidenzen in Deutschland - Daten vom RKI 4 Hospitalisierungen in Deutschland 4.1 Adjustierte Daten 4.1.1 Daten importieren 4.1.2 EDA 4.2 Unadjustierte Daten 4.2.1 Daten importieren 4.2.2 EDA 5 Impfungen in Deutschland 5.1 Neueste Daten 5.1.1 Daten laden 5.1.2 EDA 5.2 Impfquoten im Zeitverlauf 5.2.1 Daten laden 1 5.2.2 Daten laden 2 5. Jedes dritte Corona-Tote ist geimpft, also bringt Impfen nix? Falsch. https://data-se.netlify.app/2021/11/15/jedes-dritte-intensivbett-mit-geimpften-belegt-also-bringt-impfen-nix-falsch/ Mon, 15 Nov 2021 00:00:00 +0000 https://data-se.netlify.app/2021/11/15/jedes-dritte-intensivbett-mit-geimpften-belegt-also-bringt-impfen-nix-falsch/ 1 Der dritte Corona-Tote geimpft?! 2 tl;dr 3 Frage: Wie viele Menschen mit Corona wurden ins Krankenhaus eingeliefert? 4 Antwort: Hospitalisierungsquoten lagen jüngst zwischen 1% und 15% 5 Frage: Ist jede Dritte Corona-Tote geimpft? 6 Antwort: Ja, diese Zahl stimmt oder ist höher 7 Frage: Wenn es so viele geimpfte Corona-Opfer gibt, dann ist die Impfung also kaum wirksam? 8 Antwort: Der Anteil der Impfdurchbrüche ist abhängig von der Impfquote 9 Frage: Wie hoch ist die Impfquote unter den Coronatoten? Simulation sample and interval sizes for proportions https://data-se.netlify.app/2021/09/16/simulation-sample-and-interval-sizes-for-proportions/ Thu, 16 Sep 2021 00:00:00 +0000 https://data-se.netlify.app/2021/09/16/simulation-sample-and-interval-sizes-for-proportions/ 1 Exemplary Research question 2 Task definition 3 Technical setup 4 Define constants 5 Prepare data frame for the simulation 6 Simulation 7 Check some distributions 8 Minimum sample size 9 Plot results 10 Summary 11 Discussion 12 Suggested reading Bibliography 1 Exemplary Research question What is the sample size needed to estimate the proportion of the event “high quality study” with an error margin of ±5%, and a confidence level of 99%? MAD SD und 1.483 https://data-se.netlify.app/2021/08/11/mad-sd-und-1-483/ Wed, 11 Aug 2021 00:00:00 +0000 https://data-se.netlify.app/2021/08/11/mad-sd-und-1-483/ Setup library(tidyverse) library(mosaic) Sind wir nicht alle ein bisschen MAD? Der MAD oder Median Absolute Deviation ist eine robuste Kennzahl der Variabilität (eines quantitativen Merkmals). Definition MAD Seien $X_1, X_2, ..., X_n$ die Beobachtungen einer Stichprobe zu einem Merkmal $X$. Dann ist der MAD so definiert: $\text {MAD} =\operatorname{median} (|X_{i}-{\tilde {X}}|)$. Anders gesagt, der MAD ist der Median der Absolutwerte der Residuen. Robust? Robust heißt kurz (und vereinfacht) gesagt, dass der Kennwert nicht (zu sehr) von Extremwerten beeinflusst wird. Vergleich verschiedener Signifikanztstests bei einem Datensatz https://data-se.netlify.app/2021/07/29/vergleich-verschiedener-signifikanztstests-bei-einem-datensatz/ Thu, 29 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/29/vergleich-verschiedener-signifikanztstests-bei-einem-datensatz/ 1 Setup 2 Fallbeispiel 3 Datensatz 3.1 Funktion, um Daten zu simulieren 3.1.1 Kleine Stichprobe 3.1.2 Große Stichprobe 3.2 Unit Testing 4 Signifikanz-Tests 4.1 Simulationsbasierte Inferenz (SBI) 4.1.1 Kleine Stichprobe 4.1.2 Große Stichprobe 4.2 $\chi^2$-Test 4.2.1 Kleine Stichprobe 4.2.2 Große Stichprobe 4.3 Binomialtest 4.3.1 kleine Stichprobe 4.3.2 Große Stichprobe 4.4 Logistische Regression 5 Bayes-Test mit gleichverteilter Priorverteilung und MCMC-Sampler 5. Links in Markdown-Tabellen https://data-se.netlify.app/2021/07/14/links-in-markdown-tabellen/ Wed, 14 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/14/links-in-markdown-tabellen/ 1 Hintergrund 2 Beispiel-Daten laden 3 Daten aufbereiten 4 Tabelle 1: gt() 5 Tabelle 2: kable() 6 Tabelle 3: pander 7 Tabelle 4: datatable() 8 Fazit: library(tidyverse) library(gt) library(here) 1 Hintergrund Tabellen in Markdown sind mitunter nervig zu erstellen. Am einfachsten ist es, wenn die Daten in Form einer CSV- oder Excel-Tabelle vorliegen. Tipp: Große Mengen von (nur) Text (keine Zahlen) sind vielleicht besser nicht in Form einer Tabelle, sondern einer Liste anzuführen. Metadaten von Forschungsartikeln herunterladen https://data-se.netlify.app/2021/07/08/metadaten-von-forschungsartikeln-herunterladen/ Thu, 08 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/08/metadaten-von-forschungsartikeln-herunterladen/ 1 Vorbereitung 2 Via Crossref 2.1 Abfragen, einfach 3 Filter 3.1 Anzahl 4 Dois rausziehen 5 Zitationen herunterladen 6 Abstracts herunterladen 6.1 “Safely” Abstracts herunterladen 6.2 Artikel nur mit Abstracts 6.3 Abstract mit cr_abstract 6.4 Check 6.5 Auf einen Haps 7 Andere APIs 7.1 Google Scholar hat keine API, wie es aussieht 7.1.1 Weitere API 1 Vorbereitung library(tidyverse) library(printr) library(rcrossref) library(gt) 2 Via Crossref Von der Crossref-Webseite: Zeitungsartikel per API herunterladen https://data-se.netlify.app/2021/07/07/zeitungsartikel-per-api-herunterladen/ Wed, 07 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/07/zeitungsartikel-per-api-herunterladen/ library(tidyverse) library(newsanchor) library(printr) library(httr) library(jsonlite) News API Es gibt eine Seite News API, die es erlaubt, per API News (Artikel, Schlagzeilen) von weltweiten Quellen herunterzuladen, per JSON API. Gibt’s da auch ein R-Paket? Ja - NewsAnchor. Setup Zuerst muss man sich bei der Seite eine API Key holen, für Entwicklerzwecke kostenlos. Komfortabel ist, sich den Schlüssel in die R-environment-Datei (.Renviron) zu schreiben, s. hier für mehr Infos. Vorhersage-Modellierung des Diamantenpreises https://data-se.netlify.app/2021/07/06/diamantenpreis-vorhersagen/ Tue, 06 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/06/diamantenpreis-vorhersagen/ 1 Vorbereitung 1.1 Forschungsfrage 1.2 Aufgabe 1.3 Pakete laden 1.4 Daten laden 1.5 ID-Spalte ergänzen 2 Vorwissen 3 Woran erkennt man einen “starken Haupteffekt”? 4 Wichtige Prädiktoren 4.1 train2 4.2 test2 5 Feature Engineering 5.1 train3/test3 5.2 Korrelation 5.3 preds_important 6 Funktionale Form der Zusammenhänge 7 Filtern 7.1 train4 8 Transformationen 8.1 Log-Transformation (train5) 8.2 z-Transformation (train6) 9 Vorhersage-Modellierung 9. Diagrams with mermaid https://data-se.netlify.app/2021/07/01/diagrams-with-mermaid/ Thu, 01 Jul 2021 00:00:00 +0000 https://data-se.netlify.app/2021/07/01/diagrams-with-mermaid/ Setup library(tidyverse) library(DiagrammeR) Separating concept and appeal It can be useful to separate the content or concept from its graphical/visual implementation. For this reasons, slide shows have disadvantages: You spend a lot of time dragging arrows and boxes. This time would be better spend in thinking about why and where to move your arrows and boxes. In addition, software that intermingles concept and representation typically is a vendor lock: You cannot (easily) get out if you find some more useful softare. Talk on the the quantitative method in the sciences https://data-se.netlify.app/2021/06/25/talk-on-the-the-quantitative-method-in-the-sciences/ Fri, 25 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/25/talk-on-the-the-quantitative-method-in-the-sciences/ The slides of my talk on the use of the quantitative methods in the (neuro) sciences can be downloaded here (PDF file). Licenced under CC-By-Sa. Talent and Looks -- Collider bias https://data-se.netlify.app/2021/06/24/talent-and-looks-collider-bias/ Thu, 24 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/24/talent-and-looks-collider-bias/ Background Some musing on the collider bias. Let’s try to reverse engineer this image Setup library(tidyverse) library(ggdag) Simulate some data n <- 1000 d <- tibble( x = rnorm(n, mean = 0, sd = 1), y = rnorm(n, mean = 0, sd = 1), e = rnorm(n, mean = 0, sd = 0.3), z = abs(x) * abs(y)) d: Uncorrelated data The farer from the centroid the lighter the color. Overlaying facetted histograms with normal curve using ggplot2 https://data-se.netlify.app/2021/06/23/overlaying-facetted-histograms-with-normal-curve-using-ggplot2/ Wed, 23 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/23/overlaying-facetted-histograms-with-normal-curve-using-ggplot2/ Overlaying histograms with a normal curve Overlaying a histogram (possibly facetted) is not something far fetched when analyzing data. Surprisingly, it appears (to the best of my knowledge) that there’s no comfortable out-of-the-box solution in ggplot2, although it can be of course achieved with some lines of code. Here’s my take. Setup library(tidyverse) Some data d <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/openintro/speed_gender_height.csv") ## Warning: Missing column names filled in: 'X1' [1] ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## X1 = col_double(), ## speed = col_double(), ## gender = col_character(), ## height = col_double() ## ) glimpse(d) ## Rows: 1,325 ## Columns: 4 ## $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, … ## $ speed <dbl> 85, 40, 87, 110, 110, 120, 90, 90, 80, 95, 110, 90, 110, 70, 10… ## $ gender <chr> "female", "male", "female", "female", "male", "female", "female… ## $ height <dbl> 69, 71, 64, 60, 70, 61, 65, 65, 61, 69, 63, 72, 70, 68, 63, 78,… d %>% slice_head(n = 5) ## # A tibble: 5 x 4 ## X1 speed gender height ## <dbl> <dbl> <chr> <dbl> ## 1 1 85 female 69 ## 2 2 40 male 71 ## 3 3 87 female 64 ## 4 4 110 female 60 ## 5 5 110 male 70 Preparing data We’ll use a “total” histogram for the whole sample, to that end, we’ll need to remove the grouping information from the data. Rücktransformation logarithmierter y-Werte https://data-se.netlify.app/2021/06/18/r%C3%BCcktransformation-logarithmierter-y-werte/ Fri, 18 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/18/r%C3%BCcktransformation-logarithmierter-y-werte/ 1 Kontext 2 Vorbereitung 3 LogY-LogX-Modell 4 Modell 1 5 Vorhersage zum Beispiel aus der Fallstudie 6 Vorhersagen wie im Prognose-Wettbewerb 7 Check 1 Kontext Dieser Post bezieht sich auf diese Fallstudie. 2 Vorbereitung library(tidyverse) # Datenjudo ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4 ## ✓ tibble 3.1.2 ✓ dplyr 1.0.6 ## ✓ tidyr 1.1.3 ✓ stringr 1. Beispiel zur Interpretation des Interaktionseffekts https://data-se.netlify.app/2021/06/17/beispiel-zur-interpretation-des-interaktionseffekts/ Thu, 17 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/17/beispiel-zur-interpretation-des-interaktionseffekts/ 1 Der Interaktionseffekt in der Regressionsanalyse 2 Vorbereitung 3 Regression mit Interaktionseffekt, die Erste 3.1 z-Transformation des Prädiktors hp 3.2 lm1 4 Regression mit Interaktionseffekt, die Zweite 4.1 Daten 4.2 Ohne z-Transformation 4.3 Mit z-Transformation 4.4 Visualisierung 4.5 Interpretation zum Vorhandensein eines Interpretationseffekts 4.6 Interpretation der Koeffizienten 4.7 Viel besser mit z-Transformation 4.8 Berechnen eines vorhergesagten Wertes mit der Hand 5 Fazit 1 Der Interaktionseffekt in der Regressionsanalyse Der Interaktionseffekt in der Regressionsanalyse ist nicht einfach zu interpretieren. Ein Beispiel zum Nutzen einer Log-Transformation https://data-se.netlify.app/2021/06/17/ein-beispiel-zum-nutzen-einer-log-transformation/ Thu, 17 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/17/ein-beispiel-zum-nutzen-einer-log-transformation/ 1 Vorbereitung 2 Ein unschuldiger Datensatz 3 lm1: additiv 4 lm2: multiplikativ (exponenziell) 5 Fazit 6 Take home message 1 Vorbereitung library(tidyverse) library(arm) 2 Ein unschuldiger Datensatz Gehen wir davon aus, uns ist ein Datensatz gegeben. Die Hintergründe der Entstehung verlieren sich im Dunkel. Ich habe hier einen Datensatz simuliert; diese Details können Sie getrost überspringen. Nehmen Sie den Datensatz einfach als gegeben hin. Kurzprofil: Datenvisualisierung Praxiskurs https://data-se.netlify.app/2021/06/16/datenvisualisierung-praxiskurs/ Wed, 16 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/16/datenvisualisierung-praxiskurs/ 1 Überblick 2 Inhalte 2.1 Publikationsreife Diagramme 2.2 Dashboards 2.3 Automatisierte Berichte 2.4 Daten-Apps 3 Prüfung 4 Voraussetzungen Literatur 1 Überblick In diesem Modul lernen Sie Methoden der Datenvisualisierung für den Einsatz in der (Wirtschafts-)Praxis. Der Schwerpunkt liegt auf der praktischen Fähigkeit; theoretische Grundlagen spielen eine Nebenrolle. Alle Diagramme werden mit der Programmiersprache R erstellt. 2 Inhalte 2.1 Publikationsreife Diagramme Im Folgenden sind einige Beispiele für Diagramme dargestellt, die im Unterricht besprochen und “nachgebaut” werden. ARM, Kap. 4 Syntax im Tidyverse-Stil https://data-se.netlify.app/2021/06/15/arm-kap-4-syntax-im-tidyverse-stil/ Tue, 15 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/15/arm-kap-4-syntax-im-tidyverse-stil/ 1 Pakete laden 2 Lineare Transformationen 2.1 Daten laden: kidsiq 2.2 lm1: Interaktionseffekt 2.3 lm2: Zentrieren 2.4 lm3: z-Transformation 3 Modelle mit log(y) 3.1 Daten laden 3.2 lm4: earn_log 3.3 lm5: earn_log mit zwei Prädiktoren 3.4 lm6: Mit z-Transformation und Interaktion 4 LogY-LogX-Modelle 5 Weitere Transformationen 5.1 Diskretisierung metrischer Prädiktoren 6 “Buschbeispiel” - mesquite 6.1 Daten laden 6.2 Lineares Modell mit allen Prädiktoren 6. Vektorisierter Mittelwert in R https://data-se.netlify.app/2021/06/15/vektorisierter-mittelwert-in-r/ Tue, 15 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/15/vektorisierter-mittelwert-in-r/ Setup library(tidyverse) ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ── ## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4 ## ✓ tibble 3.1.2 ✓ dplyr 1.0.6 ## ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ## ✓ readr 1.4.0 ✓ forcats 0.5.1 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() Einige Funktionen in R sind vektorisiert, andere nicht Einige Funktionen in R sind vektorisiert: sie führen ihren Dienst für jedes Element eines Vektors aus. ARM, Kap. 3 Syntax im Tidyverse-Stil https://data-se.netlify.app/2021/06/05/arm-kap-3-syntax-im-tidyverse-stil/ Sat, 05 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/05/arm-kap-3-syntax-im-tidyverse-stil/ 1 Einführung 2 Pakete laden 3 Daten laden 4 Ein Prädiktor 4.1 lm1: Binärer Prädiktor 4.1.1 Diagramm 4.1.2 Unterschied im Mittelwert 4.2 lm2: Ein kontinuierlicher Prädiktor 5 Mehrere Prädiktoren 5.1 lm3: Ohne Interaktionseffekt 5.2 lm4: Mit Interaktionseffekt 6 Eingeschränkter Wertebereich 6.1 lm5: Regressionsmodell 7 Visualisierung von Ungewissheit im Model 7.1 Variation eines Prädiktors und anderen konstant gehalten 7. Normalverteilung der Residuen, nicht Normalverteilung von Y https://data-se.netlify.app/2021/06/05/normalverteilung-der-residuen-nicht-normalverteilung-von-y/ Sat, 05 Jun 2021 00:00:00 +0000 https://data-se.netlify.app/2021/06/05/normalverteilung-der-residuen-nicht-normalverteilung-von-y/ 1 Motivation 1.1 Setup 2 Datenbeispiel 2.1 Daten simulieren 2.2 Verteilung von Y 2.3 Verteilung der Residuen 2.4 Y vs. X 2.5 Residuen 1 Motivation Manchmal hört man, die Regression verlange, dass die Y-Variable normalverteilt sei. Das ist keine Annahme der Regression. Stattdessen sollten die Residuen normalverteilt sein. Übrigens ist die Normalverteilung der Residuen laut Gelman und Hill 2007 keine wichtige Annahmen in vielen Situationen: Der Verlauf der Regressionsgeraden wird nämlich die die Normalität der Residuen nicht beeinflusst (vgl. Logarithmen und Exponenten in Regressionen: Wer braucht sowas? https://data-se.netlify.app/2021/05/31/logarithmen-und-exponenten-in-regressionen-wer-braucht-sowas/ Mon, 31 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/31/logarithmen-und-exponenten-in-regressionen-wer-braucht-sowas/ Die Folien des Vortrags (HTML-Version) liegen hier. Eine Internetverbindung ist nötig, um die Folien zu betrachten. Die Rmd-Quelldatei liegt hier. Lizenz: CC-BY Analyse der Impfbereitschaft von Studentis https://data-se.netlify.app/2021/05/30/analyse-der-impfbereitschaft-von-studentis/ Sun, 30 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/30/analyse-der-impfbereitschaft-von-studentis/ 1 Analyse der Impfbereitschaft 2 Vorbereitung 2.1 Pakete laden 2.2 Daten laden 2.3 Daten und Variablen 2.4 Sind die Items schon umgepolt? 2.5 Liegen Mittelwerte für die Persönlichkeits-Dimensionen vor? 3 Daten verstehen 3.1 Fehlende Werte 3.2 Nominal skalierte Variablen in numerische umwandeln 3.3 Welche Variablen korrelieren mit der Impfbereitschaft? 3.4 Korrelation der Items pro Big-Five-Dimension 4 Modell mit den Big-Five-Dimensionen als Prädiktoren 5 Zur Datenqualität 6 Visualisierung 1 6. YACSDA Seitensprünge https://data-se.netlify.app/2021/05/28/yacsda-seitenspr%C3%BCnge/ Fri, 28 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/28/yacsda-seitenspr%C3%BCnge/ 1 Setup 2 Forschungsfrage und Hintergrund 3 ACHTUNG 4 Daten laden 5 Aufgaben 6 Los geht’s 6.1 Geben Sie zentrale deskriptive Statistiken an für Affärenhäufigkeit und Ehezufriedenheit! 6.2 Visualisieren Sie zentrale Variablen! 6.2.1 affairs 6.2.2 rating 6.3 Wer ist zufriedener mit der Partnerschaft: Personen mit Kindern oder ohne? 6.4 Wie viele fehlende Werte gibt es? Was machen wir am besten damit? 6.5 Wer ist glücklicher in der Partnerschaft: Männer oder Frauen? Beispiel für pivot_longer() https://data-se.netlify.app/2021/05/27/beispiel-f%C3%BCr-pivot-longer/ Thu, 27 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/27/beispiel-f%C3%BCr-pivot-longer/ 1 Setup 2 Daten laden 3 Von lang nach breit 4 Plotten 5 Kommentar 1 Setup library(tidyverse) 2 Daten laden d <- read_csv("https://raw.githubusercontent.com/sebastiansauer/2021-sose/master/data/Impfbereitschaft/d3.csv") 3 Von lang nach breit d2 <- d %>% select(willingness:open2) %>% pivot_longer(extra1:open2) d2 %>% slice_head(n = 7) #> # A tibble: 7 x 6 #> willingness health fear cases name value #> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> #> 1 10 9 5 1 extra1 2 #> 2 10 9 5 1 agree1 2 #> 3 10 9 5 1 cons1 3 #> 4 10 9 5 1 neuro1 2 #> 5 10 9 5 1 open1 4 #> 6 10 9 5 1 extra2 1 #> 7 10 9 5 1 agree2 4 4 Plotten d2 %>% ggplot() + aes(x = willingness, y = value) + facet_wrap(~ name) + geom_point() + geom_smooth(method = "lm") Jedes Diagramm zeigt den Zusammenhang von Impfbereitschaft mit einem Big-Five-Item. Datensatz flights: Finde den Tag mit den meisten Abflügen https://data-se.netlify.app/2021/05/27/datensatz-flights-finde-den-tag-mit-den-meisten-abfl%C3%BCgen/ Thu, 27 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/27/datensatz-flights-finde-den-tag-mit-den-meisten-abfl%C3%BCgen/ 1 Aufgabe: Finde den Tag mit den meisten Abflügen (flights)! 2 Setup 3 Daten laden 4 Spalte mit Datum ergänzen 5 Datensatz zusammenfassen 6 Maximalwert der Spalte n 1 Aufgabe: Finde den Tag mit den meisten Abflügen (flights)! An welchem Tag im Jahr 2013 sind die meisten Flüge aus NYC gestartet? 2 Setup library(tidyverse) # Datenjudo library(nycflights13) # Daten library(lubridate) # Datumsangaben 3 Daten laden data(flights) 4 Spalte mit Datum ergänzen flights <- flights %>% mutate(date = date(time_hour)) 5 Datensatz zusammenfassen flights2 <- flights %>% group_by(date) %>% summarise(n = n()) Synonym: Zeilenweise Operationen (tidyverse-Stil) https://data-se.netlify.app/2021/05/27/zeilenweise-operationen-tidyverse-stil/ Thu, 27 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/27/zeilenweise-operationen-tidyverse-stil/ 1 Aufgabe 2 Setup 3 Daten erzeugen 4 Spalten addieren, die Erste 5 Spalten addieren, die Zweite 6 Spalten addieren, die Dritte 7 Von erster Spalte bis zu letzter Spalte 8 Fazit 1 Aufgabe Berechnen Sie Zeilensummen! … Oder Zeilen-Mittelwerte oder eine andere zeilenbasierte Funktion. 2 Setup library(tidyverse) # Datenjudo 3 Daten erzeugen d <- tribble( ~"x1", ~"x2", ~"x3", 1, 2, 3, 4, 5, 6, 7, 8, 9 ) d #> # A tibble: 3 x 3 #> x1 x2 x3 #> <dbl> <dbl> <dbl> #> 1 1 2 3 #> 2 4 5 6 #> 3 7 8 9 4 Spalten addieren, die Erste d %>% mutate(summe = x1 + x2 + x3) %>% mutate(mw = (x1 + x2 + x3)/3) #> # A tibble: 3 x 5 #> x1 x2 x3 summe mw #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 2 3 6 2 #> 2 4 5 6 15 5 #> 3 7 8 9 24 8 Läuft! Modellierung Diamantenpreis 2 https://data-se.netlify.app/2021/05/25/modellierung-diamantenpreis-2/ Tue, 25 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/25/modellierung-diamantenpreis-2/ 1 Modellierung des Preises von Diamanten 2 Pakete laden 3 Daten laden 4 Datensatz verstehen 5 Modellierung 5.1 Modell 1 5.1.1 Modellgüte 5.1.2 Überprüfung der Annahmen 5.2 Modell 2 5.2.1 Genauerer Blick auf den Zusammenhang 5.2.2 Log-Modell 5.3 Modell 3 5.3.1 Modellgüte 5.3.2 Voraussetzungen prüfen 5.4 Modell 3a und 4 5.4.1 Konfundierung von Schliff und Karat 5.4.2 lm3a 5.4.3 lm4: Schliff als Prädiktor 5. Vorhersage-Modellierung des Preises von Diamanten https://data-se.netlify.app/2021/05/19/vohrersgage-modellierung-des-preises-von-diamanten/ Wed, 19 May 2021 00:00:00 +0000 https://data-se.netlify.app/2021/05/19/vohrersgage-modellierung-des-preises-von-diamanten/ 1 Hintergrund und Ziel 2 Pakete laden 3 Daten laden 4 Aufteilen in Train- und Test-Datensatz 5 EDA 6 Modellierung 6.1 Modell 1 6.2 Modell 2 7 Vorhersage im Test-Datensatz 8 R-Quadrat im Test-Datensatz 9 Weitere Überlegungen 10 Einreichen 1 Hintergrund und Ziel In diesem Post sagen wir den Preis von Diamanten vorher. Nehmen wir an, Sie hätten bei einem großen Online-Kaufhaus angeheuert und ihre Chefin möchte gerne wissen, welchen Preis sie wohl für bestimmte Diamanten erzielen kann. Deutschlandkarten zeichnen mit R, für Anfänger https://data-se.netlify.app/2021/04/19/deutschlandkarten-zeichnen-mit-r-f%C3%BCr-anf%C3%A4nger/ Mon, 19 Apr 2021 00:00:00 +0000 https://data-se.netlify.app/2021/04/19/deutschlandkarten-zeichnen-mit-r-f%C3%BCr-anf%C3%A4nger/ 1 Pakete laden 2 Welktarte zeichnen 3 Deutschlandkarte zeichnen 4 Mehr 5 Reproducibility 1 Pakete laden library(tidyverse) # data wrangling library(maps) Ggplot-Theme anpassen: theme_set( theme_void() ) 2 Welktarte zeichnen world <- map_data("world") ggplot(world) + aes(x = long, y = lat, group = group) %>% geom_polygon(color = "white", fill = "lightgray") 3 Deutschlandkarte zeichnen Deutschland aus der Liste der Länder auswählen: de <- map_data("world", region = "Germany") ggplot(de, aes(x= long, y= lat)) + geom_polygon(aes(group = group), fill = "lightgray", color = "white") + geom_polygon(aes(group = group), color = "black", fill = NA) 4 Mehr Mehr Hinweise zu Karten, insbesondere Choroplethenkarten, findet sich z. Modeling your research data: A crash course using R https://data-se.netlify.app/2021/03/24/modeling-your-research-data-a-crash-course-using-r/ Wed, 24 Mar 2021 00:00:00 +0000 https://data-se.netlify.app/2021/03/24/modeling-your-research-data-a-crash-course-using-r/ 1 Course description 2 We’re on a crash course 3 More on modelling 4 Course prerequisites 5 Learning objectives 6 Course website (book) 7 Course Literature 8 Course logistics 9 UPFRONT student preparation 10 Didactic outline 11 Schedule 11.1 Overview on topics covered 11.2 Block 1: Explorative Data Analysis 11.2.1 Visualization 11.2.2 Data Wrangling 11.2.3 Exercises / Case study 11.3 Block 2: Statistical Modelling: Basic 11. Fallstudie: Modellierung von Flugverspätungen https://data-se.netlify.app/2021/03/10/fallstudie-modellierung-von-flugversp%C3%A4tungen/ Wed, 10 Mar 2021 00:00:00 +0000 https://data-se.netlify.app/2021/03/10/fallstudie-modellierung-von-flugversp%C3%A4tungen/ 1 Hintergrund und Forschungsfrage 2 Pakete laden 3 Daten laden 4 flights2: Nicht benötigte Variablen entfernen und ID hinzufügen 5 Aufteilung in Train- und Testsample 6 flights_train2, flights_test2 7 lm0: Nullmodell 8 lm1: origin 9 lm2: All in 10 flights_train3: Textvariablen in Faktorvariablen umwandeln 10.1 flights_test3 11 flights_train4: Faktorstufen zusammenfassen 11.1 flights_test4 12 lm3: Alle zusammengefassten Faktorvariablen 13 lm4: Alle metrischen Variablen 14 lm5: Alle metrischen und alle (zusammengefassten) nominalen Variablen 15 Wetter-Daten ergänzen 15. EDA zu Flugverspätungen https://data-se.netlify.app/2021/03/08/eda-zu-flugversp%C3%A4tungen/ Mon, 08 Mar 2021 00:00:00 +0000 https://data-se.netlify.app/2021/03/08/eda-zu-flugversp%C3%A4tungen/ 1 Pakete laden 2 Hintergrund und Ziel 3 Daten laden 4 Was ist Verspätung? 4.1 Wie ähnlich sind Ankunfts- und Abflugsverspätung? 5 Verteilung der Verspätung 5.1 flights2: Extremwerte (der Verspätung) definieren 5.1.1 Boxplot-Methode 5.2 Fehlende Werte berechnen 5.3 flights3 6 Deskriptive Statistiken 6.1 Mit summarise 6.2 Mit skimr 7 Korrelate von Verspätung 7.1 Metrische Prädiktoren 7.1.1 Nur mit cor 7. Estimating population effect size, some thoughts https://data-se.netlify.app/2021/03/04/estimating-population-effect-size-some-thoughts/ Thu, 04 Mar 2021 00:00:00 +0000 https://data-se.netlify.app/2021/03/04/estimating-population-effect-size-some-thoughts/ 1 Load packages 2 Motivation 3 True value of the parameter 4 Drawing samples 5 Define some constants 6 Simulating samples 6.1 SummayFunction to compute sample statistics 6.2 Run the function multiple ($k$) times 6.3 Run the summary function $k$ times for all sample sizes 7 Plot the results 7.1 Estimated population mean 7.2 Width of the CI 8 Summarise the results 8. How to standardize variables in R https://data-se.netlify.app/2021/02/26/how-to-standardize-variables-in-r/ Fri, 26 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/26/how-to-standardize-variables-in-r/ 1 Motivation 2 Load packages 3 Some data 4 Research question 5 Regression with unstandardized input variables 6 Standardize input variables 7 Regression with standardized input variables 8 The models (lm1 and lm2) are identical 9 Interpretation of a standardized regression coefficient 10 Reproducibility 1 Motivation Running a regression in R yields unstandardized coefficients, not standardized ones. However, as is spelled out by eg., Gelman and Hill (2007), standardizing values is of advantages in many situations. Case study: data vizualization on flight delays using tidyverse tools https://data-se.netlify.app/2021/02/24/case-study-data-vizualization-on-flight-delays-using-tidyverse-tools/ Wed, 24 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/24/case-study-data-vizualization-on-flight-delays-using-tidyverse-tools/ 1 Load packages 2 Load data 3 Exercises/questions 4 Solutions 4.1 Plot the distribution of the delays. Describe your insights. 4.2 Plot the distribution of the delays per origin airport. 4.3 Visualize the assocation of delay and time of the day. Find a way to reduce overplotting. 4.4 Visualize the assocation of delay and distance to destination. Separate by origin and month. 4.5 Visualize the assocation of delay and time of the day. Exercises (no solutions): data vizualization on flight delays using tidyverse tools https://data-se.netlify.app/2021/02/24/exercises-no-solutions-data-vizualization-on-flight-delays-using-tidyverse-tools/ Wed, 24 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/24/exercises-no-solutions-data-vizualization-on-flight-delays-using-tidyverse-tools/ 1 Load packages 2 Get the data 2.1 Alternative way to get the data 2.2 Code book 3 Exercises 4 Solutions 5 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Get the data We’ll be analyzing the data set flights, describing the flights which started from NYC in 2013. Here’s how to get the data set: library(tidyverse) library(nycflights13) data("flights") 2.1 Alternative way to get the data Alternatively, import the data from a csv file: Exercises to data wrangling with the tidyverse https://data-se.netlify.app/2021/02/24/exercises-to-data-wrangling-with-the-tidyverse/ Wed, 24 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/24/exercises-to-data-wrangling-with-the-tidyverse/ 1 Exercise collection: Life exptectancy 2 Disclosure 3 Research questions 4 First steps 5 Getting help 6 Exercises 6.1 Data Wrangling 6.2 Data Visualization 7 Solutions 8 Reproducibility library(tidyverse) 1 Exercise collection: Life exptectancy Get the data from this source. gapminder_raw <- read_csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder-FiveYearData.csv") 2 Disclosure This exercises are based on a tutorial by Rebekka Barter. Great work! 3 Research questions How did life expectancy change in the course of the last decades? Modelling movie successes: linear regression https://data-se.netlify.app/2021/02/24/modelling-movie-successes-linear-regression/ Wed, 24 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/24/modelling-movie-successes-linear-regression/ 1 Load packages 2 Load data 3 Research question 4 Disclaimer 5 Get overview 5.1 Descriptive statistics 5.2 Missing values 5.3 Distribution of the output variable 5.4 Distribution of the predictors 5.5 Transform budget (via logarithm) 5.6 ggscatterstats 5.7 Pivot data set 5.8 Drop unused variables 5.9 Drop cases with missing values 6 Model 0 7 Model 1: budget_log10 8 Model 2: Adding number of votes 9 Model 3: Number of votes, quadratic 10 Model 4: Number of votes, 3rd degree 11 Model 5: Multiple regression 12 Model 6: Interaction 13 Model selection: ANOVA 14 Regression diagnostics: testing the assumptions 15 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(broom) # nice formatting of output library(skimr) # gives overview on descriptives library(ggfortify) # plotting regression diagnostics library(ggstatsplot) # fancy scatter plot 2 Load data Load this package to access the data set: Data Science Memes https://data-se.netlify.app/2021/02/23/data-science-memes/ Tue, 23 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/23/data-science-memes/ 1 What’s in here? 2 Some memes I like 1 What’s in here? This is a fun post. Let’s celebrate some memes about data science, statistics, and the like. 2 Some memes I like Scraping Cochrane Reviews, some trials https://data-se.netlify.app/2021/02/19/scraping-cochrane-reviews/ Fri, 19 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/19/scraping-cochrane-reviews/ 1 Load packages 2 Parse one review 3 Parse the title 4 Parse the abstract 5 Segment the abstract 5.1 Background 5.2 Objectives 5.3 And so forth 6 Summary of Findings table 6.1 Parse node of class ‘summaryOfFindings’ 6.2 Table by ID 6.3 Looking for tables 7 Extract (Primary) Outcomes with the GRADE 7.1 Get column with outcomes 7.2 Delete non-data rows 8 Delete footer 8. Explorative Datenanalyse zum Datensatz "OECD Wellbeing" https://data-se.netlify.app/2021/02/11/explorative-datenanalyse-zum-datensatz-oecd-wellbeing/ Thu, 11 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/11/explorative-datenanalyse-zum-datensatz-oecd-wellbeing/ 1 Load packages 2 Benötigte Pakete 3 Datensatz laden 4 Erster Blick 5 Metrische Variablen einzeln (univariat) 5.1 Histogramm nach Gruppen 5.2 VERTIEFUNG: Histogramm für alle Variablen 6 Forschungsfrage 7 Datensatz filtern - nur Länder, keine Landesteile 8 Vergleich der Lebenszufriedenheit der Länder 8.1 Umwandling in eine Faktor-Variable 8.2 Ranking und Top-10-Prozent der Zufriedenheit 8.3 Vertiefung 8.4 Vertiefung 8.5 Vertiefung 9 Zusammenhang zweier metrischer Variablen – Punktediagramm 9. YACSDA: Topgear https://data-se.netlify.app/2021/02/11/yacda-topgear/ Thu, 11 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/11/yacda-topgear/ 1 Load packages 1.1 Numerischer Überblick 1.2 Wie verteilen sich die Preise? 1.3 Wie ist der Zusammenhang von Preis und Beurteilung des Autos? 2 Wie verteilt sich das Gewicht der Autos? 3 Hängt Gewicht mit Preis zusammen? 4 Wie verteilt sich die Geschwindigkeit der Autos? 5 Hängt Preis mit Geschwindigkeit zusammen? 5.1 Wie hängt Geschwindigkeit mit Beurteilung zusammen? 5.2 Welche Hersteller hat die meisten Autotypen? Plotting multiple plots using purrr::map and ggplot https://data-se.netlify.app/2021/02/06/plotting-multiple-plots-using-purrr-map-and-ggplot/ Sat, 06 Feb 2021 00:00:00 +0000 https://data-se.netlify.app/2021/02/06/plotting-multiple-plots-using-purrr-map-and-ggplot/ 1 Load packages 2 Sample data 3 Motivation 4 Way 1 5 Way 2 6 Way 3 7 More general 8 Introducing curly-curly 9 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Sample data mtcars to the rescue! mtcars <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv") 3 Motivation Say we have a data frame where we would like to plot each numeric variables’s distribution. There are a number of good solutions outthere such as this one, or here, or here. Grading a prediction contest https://data-se.netlify.app/2021/01/20/grading-a-prediction-contest/ Wed, 20 Jan 2021 00:00:00 +0000 https://data-se.netlify.app/2021/01/20/grading-a-prediction-contest/ 1 Motivation 2 Setup 3 Helper functions 3.1 Function to parse data 3.2 Function to compute $R^2$ 3.3 Function to compute $MSE$ 3.4 Function to compute generalized error function 4 Import solution (true) data (ie., solution) 5 Parse the data 6 Build master data frame 6.1 List df where each submission is one row 6.2 Change character to numeric 6.3 Add observed (true) values 6. Vorhersagen mit lm https://data-se.netlify.app/2020/12/15/vorhersagen-mit-lm/ Tue, 15 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/15/vorhersagen-mit-lm/ 1 Pakete laden 2 Daten laden 3 Forschungsfrage 3.1 Daten aufbereiten 3.2 Modell schätzen 4 Vorhersage mit predict() – ohne Schätzbereich 5 Vorhersage mit predict() – mit Schätzbereich 6 Reproducibility 1 Pakete laden library(tidyverse) # data wrangling library(moderndive) 2 Daten laden data(movies, package = "ggplot2movies") 3 Forschungsfrage Wie beliebt ist erwartungsgemäß ein Actionfilm nach dem Jahr 2000, der zu den Top-10-Prozent der Budgetverteilung gehört? titanic-tidymodels: boost https://data-se.netlify.app/2020/12/14/titanic-tidymodels-boost/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-boost/ 1 Objective 2 Detect available cores 3 Load and prepare data 3.1 Hide details in a function 4 Split data into train and test 5 Define recipe 6 Define model 7 Define cross validation scheme 8 Define workflow 9 Define analysis and validation (oob) set 10 Fit the grid 11 View results 12 Get best model 13 Final fit (on train data) 13.1 Fit final workflow (on test data) 14 Predict test data 15 Save predictions to disk 16 Reproducibility library(tidyverse) # data wrangling library(tidymodels) # modelling library(broom) # tidy model output library(skimr) # overview on descriptives library(parallel) # multiple cores -- unix only 1 Objective Predicting the survival in the Titanic disaster. titanic-tidymodels: boost simple https://data-se.netlify.app/2020/12/14/titanic-tidymodels-boost-simple/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-boost-simple/ 1 Load packages 2 Objective 3 Load and prepare data 3.1 Hide details in a function 4 Split data into train and test 5 Define recipe 6 Define model 7 Define workflow 8 Fit the model 9 Predict the test data 10 Save csv file to disk 11 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling 2 Objective Predicting the survival in the Titanic disaster. titanic-tidymodels: glm1 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-glm1/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-glm1/ 1 Load packages 2 Objective 3 Load and prepare data 3.1 Hide details in a function 4 Split data into train and test 5 Define recipe 6 Define model 7 Define workflow 8 Fit the model 9 Predict the test data 10 Save csv file to disk 11 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling library(broom) # tidy model output library(skimr) # overview on descriptives library(testthat) # unit testing 2 Objective Predicting the survival in the Titanic disaster. titanic-tidymodels: rf1 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-rf1/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titanic-tidymodels-rf1/ 1 Load packages 2 Objective 3 Load and prepare data 3.1 Hide details in a function 4 Split data into train and test 5 Define recipe 6 Define model 7 Define workflow 8 Fit the model 9 Predict the test data 10 Save csv file to disk 11 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling 2 Objective Predicting the survival in the Titanic disaster. titanic-tidymodels: rf2 https://data-se.netlify.app/2020/12/14/titanic-itdymodels-rf2/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titanic-itdymodels-rf2/ 1 Load packages 2 Objective 3 Setup 4 Load and prepare data 4.1 Hide details in a function 5 Split data into train and test 6 Define recipe 7 Define model 8 Define cross validation scheme 9 Define workflow 10 Fit the grid 11 View results 12 Get best model 13 Final fit (on train data) 13.1 Fit final workflow (on test data) 14 Predict test data 15 Save predictions to disk 16 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling library(broom) # tidy model output library(skimr) # overview on descriptives library(parallel) # multiple cores -- unix only 2 Objective Predicting the survival in the Titanic disaster. titanic-tidymodels: tree https://data-se.netlify.app/2020/12/14/titani-tidymodels-tree/ Mon, 14 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/14/titani-tidymodels-tree/ 1 Load packages 2 Objective 3 Load and prepare data 3.1 Hide details in a function 4 Split data into train and test 5 Define recipe 6 Define model 7 Define workflow 8 Fit the model 9 Predict the test data 10 Save csv file to disk 11 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling 2 Objective Predicting the survival in the Titanic disaster. Kaggle Notebook on the Titanic competition using tidymodels https://data-se.netlify.app/2020/12/12/kaggle-notebook-on-the-titanic-competition-using-tidymodels/ Sat, 12 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/12/kaggle-notebook-on-the-titanic-competition-using-tidymodels/ Here is a Kaggle notebook on the Titanic prediction (ie., classifiactio) competition. Trying tidymodels: step_num2factor https://data-se.netlify.app/2020/12/12/trying-tidymodels-step-num2factor/ Sat, 12 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/12/trying-tidymodels-step-num2factor/ 1 Load packages 2 Understanding recipes and preprocessing 3 Load data 4 Define recipe 5 Prepare (prep()) the recipe 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(tidymodels) # modelling 2 Understanding recipes and preprocessing Having defined a recipe in this Kaggle competition, I was left wondering about some details of the recipe definition. Let’s explore that. 3 Load data traindata_url <- "https://raw. Beispiel für eine Vorwärts-schrittweise-Regression https://data-se.netlify.app/2020/12/10/beispiel-f%C3%BCr-eine-vorw%C3%A4rts-schrittweise-regression/ Thu, 10 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/10/beispiel-f%C3%BCr-eine-vorw%C3%A4rts-schrittweise-regression/ 1 Hintergrund 2 Achtung 3 Pakete 4 Daten laden 5 Fehlende Werte 6 Modell 0 7 Modelle mit einer Variablen (lm1) 7.1 lm1a 7.2 lm1b 7.3 lm1c 7.4 Moment mal… 8 Automatisiertes Vorwärts-Regression 9 Modellgüten der Modelle mit einem Prädiktor 10 Reproduzierbarkeit 1 Hintergrund Diese Übung bezieht sich auf ISRS, Kap. 6.2. 2 Achtung Gelman hasst schrittweise Regression … 3 Pakete library(tidyverse) # data wrangling library(broom) # tidy Regressionsoutput library(skimr) # EDA library(moderndive) # Komfort library(olsrr) # Schrittweise Regression 4 Daten laden Auf dieser Seite sind die Daten zu finden. Modellannahmen grafisch überprüfen https://data-se.netlify.app/2020/12/10/modellannahmen-grafisch-%C3%BCberpr%C3%BCfen/ Thu, 10 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/10/modellannahmen-grafisch-%C3%BCberpr%C3%BCfen/ 1 Hintergrund 2 Pakete 3 Daten laden 4 Fehlende Werte 5 Modell 1 6 Überprüfen der Annahmen 6.1 Linearität 6.2 Varianzgleichheit der Residuen 6.3 Normalverteilung der Residuen 7 Reproducibility 1 Hintergrund Diese Übung bezieht sich auf ISRS, Kap. 6.3. 2 Pakete library(tidyverse) # data wrangling #library(broom) # tidy Regressionsoutput library(skimr) # EDA library(moderndive) # Komfort 3 Daten laden Auf dieser Seite sind die Daten zu finden. Example for Meng's 2018 article on big data bias https://data-se.netlify.app/2020/12/09/example-for-meng-s-2018-article-on-big-data-bias/ Wed, 09 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/09/example-for-meng-s-2018-article-on-big-data-bias/ 1 Load packages 2 Motivation 3 Computing the effective sample size in the 2016’ US federal elections 4 Conclusion 5 Further reading 6 Reproducibility 1 Load packages library(tidyverse) # data wrangling 2 Motivation My colleague, Karsten Lübke, first grade statistician, pointed me out to a paper … In 2018, the statistican Meng wrote a paper about biases in big data see here. In a nutshell, he argues that non-random samples will be worse when data is larger. Plotting a regression surface (3D) https://data-se.netlify.app/2020/12/08/plotting-a-regression-surface-3d/ Tue, 08 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/08/plotting-a-regression-surface-3d/ Load packages library(tidyverse) library(plotly) Data Some sample data data(tips, package= "reshape2") Regression model lm1 <- lm(tip ~ total_bill + size, data = tips) lm1_coef <- coef(lm1) Sequence x1_seq <- seq(min(tips$total_bill), max(tips$total_bill), length.out = 25) x2_seq <- seq(min(tips$size), max(tips$size), length.out = 6) Compute grid z2 <- t(outer(x1_seq, x2_seq, function(x,y) lm1_coef[1]+lm1_coef[2]*x+lm1_coef[3]*y)) z2 #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] 1. Ex: Visualizing diamonds https://data-se.netlify.app/2020/12/07/ex-visualizing-diamonds/ Mon, 07 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/07/ex-visualizing-diamonds/ 1 Load packages 2 Load data 3 Objective 4 Plot 1 5 Plot 2 6 Plot 3: Interactive plot 7 Reproducibility 1 Load packages library(tidyverse) # data wrangling library(plotly) # make interactive JS plots library(printr) # print dataframes as tables 2 Load data data_url <- "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv" diamonds <- read_csv(data_url) glimpse(diamonds) #> Rows: 53,940 #> Columns: 11 #> $ X1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18… #> $ carat <dbl> 0. Comparison of R and Knime: Largish data set 1 (taxi rides 2020-06) https://data-se.netlify.app/2020/12/05/comparison-of-r-and-knime-largish-data-set-1-taxi-rides-2020-06/ Sat, 05 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/05/comparison-of-r-and-knime-largish-data-set-1-taxi-rides-2020-06/ Motivation Knime and R have their specific strengths (and weaknesses). Let’s compare the R workflow in this post with this knime workflow. Comparison What you think? As an old-fart R user I feel pressed to admit that Knime appears to be a useful and handy tool. Caveat Repeating the workflow for a larger data set, NYC yellow cabs 2019-01, it seems that my Knime got stuck (on a 2020 MacBook Pro, 16 GB machine). Execution time for largish data https://data-se.netlify.app/2020/12/05/execution-time-for-largish-data/ Sat, 05 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/05/execution-time-for-largish-data/ 1 Motivation 2 Setup 3 Data set 1 3.1 Import data 3.1.1 Download from website 3.1.2 Import from local disk 3.1.3 using read_csv() 3.1.4 Using fread() 3.2 Data set size 3.3 Typical data wrangling 4 Data Set 2 4.1 Import data 4.1.1 using read_csv() 4.1.2 Using fread() 4.2 Data set size 4.3 Typical data wrangling 4.4 Data viz 5 Reproducibility 1 Motivation In this post, we play around with some largish data set, approx. Simple Knime workflow for the Titanic Kaggle competition using a random forest model https://data-se.netlify.app/2020/12/05/simple-knime-workflow-for-the-titanic-kaggle-competation-using-a-random-forest-model/ Sat, 05 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/05/simple-knime-workflow-for-the-titanic-kaggle-competation-using-a-random-forest-model/ 1 Kaggle Competition: Titanic Disaster 2 Simple Random Forest model 3 Enjoy! 4 Reproducibility 1 Kaggle Competition: Titanic Disaster The Titanic disaster Kaggle Competition is well-known, beginner friendly playground for predictive modelling. 2 Simple Random Forest model Here, I present a simple Random Forest model for predicting Survival: The respective workflow can be found here. 3 Enjoy! 4 Reproducibility #> ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── #> setting value #> version R version 4. ModernDive, Chapter 05 - Exercises/Aufgaben (in Deutsch) https://data-se.netlify.app/2020/12/02/moderndive-chapter-05-exercises-aufgaben-in-deutsch/ Wed, 02 Dec 2020 00:00:00 +0000 https://data-se.netlify.app/2020/12/02/moderndive-chapter-05-exercises-aufgaben-in-deutsch/ 0.1 Überblick 1 Stärkster univariater Prädiktor der Dozentenbeurteilung 1.1 Aufgabe 1.2 Hilfe 1.3 Hinweise 1.4 Lösung 1.5 Für Fortgeschrittene 2 $R^2$ für univariate Regression von score auf den stärksten Prädiktor 2.1 Aufgabe 2.2 Lösung 3 Visualisieren Sie die univariate Regression 3.1 Aufgabe 3.2 Lösung 3.3 Variante 4 Vergleich zur Korrelation 4.1 Aufgabe 4.2 Lösung 5 Standardisierte Prädiktoren 5.1 Aufgabe 5. Comparing Knime and R: ETL_Basics https://data-se.netlify.app/2020/11/28/comparing-knime-and-r-etl-basics/ Sat, 28 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/28/comparing-knime-and-r-etl-basics/ Knime workflow R translation Setup Chunk 1: Read, sort, filter Chunk 2: group and aggregate Chunk 3: filter Chunk 4: concatenate Chunk 5: join Chuunk 6: write to csv Knime workflow Consider this Knime workflow: R translation Setup library(tidyverse) library(lubridate) library(knitr) Chunk 1: Read, sort, filter datafile <- "https://raw.githubusercontent.com/sebastiansauer/sesa-blog/main/static/datasets/sales_2008-2011.csv" d <- read_csv(datafile) ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## product = col_character(), ## country = col_character(), ## date = col_date(format = ""), ## quantity = col_double(), ## amount = col_double(), ## card = col_character(), ## Cust_ID = col_character() ## ) glimpse(d) ## Rows: 47 ## Columns: 7 ## $ product <chr> "prod_4", "prod_3", "prod_3", "prod_3", "prod_3", "prod_3", … ## $ country <chr> "unknown", "China", "China", "China", "USA", "Brazil", "USA"… ## $ date <date> 2008-12-12, 2009-04-10, 2009-04-10, 2009-05-10, 2009-05-20,… ## $ quantity <dbl> 1, 2, 2, 2, 20, 15, 2, 2, 20, 15, 15, 1, 1, 20, 1, 1, 25, 2,… ## $ amount <dbl> 3, 160, 160, 160, 1600, 1200, 70, 70, 1600, 600, 600, 35, 35… ## $ card <chr> NA, "N", "Y", NA, NA, NA, "Y", NA, NA, NA, "N", "Y", "Y", NA… ## $ Cust_ID <chr> "Cust_8", "Cust_2", "Cust_5", "Cust_2", "Cust_3", "Cust_7", … The data is already recognized as date; no need for transformation. Comparing Knime and R: Simple Random Forest https://data-se.netlify.app/2020/11/28/comparing-knime-and-r-simple-random-forest/ Sat, 28 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/28/comparing-knime-and-r-simple-random-forest/ Knime Workflow Translate it to R! Load Packages Load Data Stratified sampling Random Forest classification model in R Define and run Random Forest classification model Define learner (model) Define recipe Put workflow together Fit the model to the train data OOB results Model results in test data Multiclass accuracy ROC Confusion Matrix Random Forest regression model in R Define and run the model Update model to regression Define recipe Put workflow together OOB results Model results in test data Variabble importance Collect performance metrics Knime Workflow Consider this Knime workflow: derivation-of-the-logistic-regression https://data-se.netlify.app/2020/11/28/derivation-of-the-logistic-regression/ Sat, 28 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/28/derivation-of-the-logistic-regression/ The logistic regression is an incredible useful tool, partly because binary outcomes are so frequent in live (“she loves me - she doesn’t love me”). In parts because we can make use of well-known “normal” regression instruments. But the formula of logistic regression appears opaque to many (beginners or those with not so much math background). Let’s try to shed some light on the formula by discussing some accessible explanation on how to derive the formula. Simple derivation of linear regression coefficients https://data-se.netlify.app/2020/11/18/simple-derivation-of-linear-regression-coefficients2/ Wed, 18 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/18/simple-derivation-of-linear-regression-coefficients2/ Load packages library(tidyverse) Motivation The (simple) linear regression is a standard tool in data analysis and statistics. Its properties are well-known but sometimes not known in details to the applied analyst; which is ok. However, if one wishes to understand deeper the internals of the system, the question may arise how to derive the coefficients of the linear regression. Here’s one way. This approach focuses on simple calculus and derivatives; no matrix algebra, and only the simple case for one predictor. The mean minimizes the sum of squares https://data-se.netlify.app/2020/11/18/the-mean-minimizes-the-sum-of-squares/ Wed, 18 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/18/the-mean-minimizes-the-sum-of-squares/ Load packages library(tidyverse) Properties of the arithmetic mean The stuff presented here is far from new, that’s all well-known and basic. See here for a source. The arithmetic mean has a number of properties … Residuals cancel out … such as that the residuals cancel out, i.e, the sum of the deviations from the mean (the residuals) sum up to zero: \[\sum (x_i - \bar{x}) = \sum x_i - \sum \bar{x} = n \cdot \bar{x} - n \cdot \bar{x} = 0\] Fallstudie zur Regressionsanalyse -- ggplot2movies https://data-se.netlify.app/2020/11/13/fallstudie-zur-regressionsanalyse-ggplot2movies/ Fri, 13 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/13/fallstudie-zur-regressionsanalyse-ggplot2movies/ 1 Pakete laden 2 Daten laden 3 Forschungsfrage 4 Ihre salvatorische Klausel 5 Überblick über die Kennzahlen 6 Fehlende Werte 7 Verteilung der Output-Variablen 8 Verteilung der Input-Variablen 9 Explorative Analyse 10 Budget logarithmieren 11 Datensatz umbauen (pivotieren): Moderierender Effekt von Genre 12 Korrelation zwischen den Gruppen 13 Einfluss von Genre 14 Datensatz vereinfachen 15 Datensatz aufteilen (Train- und Test) 16 Modell 0 (“Nullmodell”) 17 Modell 1: budget_log10 18 Unendliche Werte entfernen 19 Model 1 Ergebnisse 20 Anzahl der Stimmen (votes) 21 Modell 2: Anzahl der Stimmen, linear 22 Modell 3: Anzahl der Stimmen als Polynomialmodell, quadratisch 23 Modell 4: Anzahl der Stimmen als Polynomialmodell, 3. Fallstudie zur Datenvisualisierung -- Datensatz "flights" https://data-se.netlify.app/2020/11/12/fallstudie-zur-datenvisualisierung-datensatz-flights/ Thu, 12 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/12/fallstudie-zur-datenvisualisierung-datensatz-flights/ 1 Vorbereitung 2 Aufgaben zur Datenvisualisierung 3 Hinweise 4 Lösungen 4.1 1. Visualisieren Sie die Verteilung der Verspätungen der Flüge. 4.2 2. Visualisieren Sie die Verteilung der Verspätung der Flüge pro Abflugsort. 4.3 3. Visualisieren Sie den Zusammenhang von Verspätung und Tageszeit. Reduzieren Sie dabei Overplotting. 4.4 4. Visualisieren Sie den Zusammenhang von Verspätung und Flugstrecke (distance) – aufgeteilt nach Abflugsort und nach Monat! 4.5 5. Visualisieren Sie den Zusammenhang von Verspätung und Tageszeit – für die drei Airlines mit der höchsten Durchschnittsverspätung. On a popular confidence interval myth https://data-se.netlify.app/2020/11/04/on-a-popular-confidence-interval-myth/ Wed, 04 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/04/on-a-popular-confidence-interval-myth/ Load packages Setup A story about data Confidence interval around the mean Plot the CI CI using simulation Myth time Draw many samples from the population Myth is wrong What is actually true Does this information help? Now what? UPDATE 2020-11-30 based on discussion with Norman Markgraf, see disqus below. Load packages library(tidyverse) library(mosaic) Setup data(flights, package = "nycflights13") A story about data Say we have a decent sample of $n=100$, and we would like to compute a standard, plain vanilla confidence interval (95% CI). Prove of a local optimum of k-means (exercise in Witten et al., 2013) https://data-se.netlify.app/2020/11/02/prove-of-a-local-optimum-of-k-means-exercise-in-witten-et-al-2013/ Mon, 02 Nov 2020 00:00:00 +0000 https://data-se.netlify.app/2020/11/02/prove-of-a-local-optimum-of-k-means-exercise-in-witten-et-al-2013/ Load packages library(tidyverse) The K-Means optimization reduces the variance in each iteration. To illuminate on that Witten et al. in An Introduction to Statistical Learning (2013) present the following entity (p. 388, chap. 10): \[\frac{1}{|C_k|} \sum\limits_{i,i^{\prime} \in C_k} \sum\limits_{j=1}^p (x_{ij} - x_{i^\prime j})^2 = 2 \sum\limits_{i \in C_k} \sum\limits_{j=1}^{p} (x_{ij} - \bar{x}_{kj})^2\] A proof can be found here; I’ll add some explanations. Note 1. Note that $\sum\limits_{i,i^{\prime} \in C_k}(\dots)$ essentially amounts to $\sum\limits_{i \in C_k}\sum\limits_{i^{\prime} \in C_k}(\dots)$, when the order of summation does not matter. A simple solution to ditch the question "what's the path of my data?" when importing data to R https://data-se.netlify.app/2020/10/19/what-s-my-path/ Mon, 19 Oct 2020 00:00:00 +0000 https://data-se.netlify.app/2020/10/19/what-s-my-path/ Load packages library(tidyverse) Motivation Importing data to R can cause headaches for newbies. For some, the concept of relative and absolute paths is new. That’s why I compiled here some recommendations on how to important data into R and on how to ditch the “what’s my path” problem. Approach 1: Start an RStudio project That’s an approach I generally recommend. Start an RStudio project. Put your code files and your data files in this very folder that you just defined as your RStudio project folder. How to import data without whats-the-path-pain https://data-se.netlify.app/2020/10/19/how-to-import-data-without-whats-the-path-pain/ Mon, 19 Oct 2020 00:00:00 +0000 https://data-se.netlify.app/2020/10/19/how-to-import-data-without-whats-the-path-pain/ Load packages library(tidyverse) Visualizing decision trees https://data-se.netlify.app/2020/10/17/visualizing-decision-trees/ Sat, 17 Oct 2020 00:00:00 +0000 https://data-se.netlify.app/2020/10/17/visualizing-decision-trees/ Load packages library(tidyverse) #remotes::install_github("grantmcdermott/parttree") library(parttree) library(rpart) library(rpart.plot) library(parsnip) library(titanic) library(tidyverse) Train learner Build the tree using parsnip with rpart as the model engine: set.seed(123) titanic_train$Survived = as.factor(titanic_train$Survived) ti_tree = decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% fit(Survived ~ Pclass + Age, data = titanic_train) Plot the model partitions titanic_train %>% ggplot(aes(x=Pclass, y=Age)) + geom_jitter(aes(col=Survived), alpha=0.7) + geom_parttree(data = ti_tree, aes(fill=Survived), alpha = 0.1) + theme_minimal() Plot the tree Help me help you: Wie man ein R-Problem so formuliert, dass einem geholfen werden kann https://data-se.netlify.app/2020/09/23/help-me-help-you-wie-man-ein-r-problem-so-formuliert-dass-einem-geholfen-werden-kann/ Wed, 23 Sep 2020 00:00:00 +0000 https://data-se.netlify.app/2020/09/23/help-me-help-you-wie-man-ein-r-problem-so-formuliert-dass-einem-geholfen-werden-kann/ Hier werden Sie geholfen – oder doch nicht? Die Hausarbeit zur Datenanalyse mit R muss morgen Abend abgegeben werden – und nichts läuft! Wer kennt das nicht?! Der knurrige Dozent hat die Abgabefrist wieder viel zu knapp bemessen, warum auch immer. Was ist jetzt zu tun? Nach 3 13 30 60 Minuten eigenen – erfolglosen – Tüftelns will man jetzt den Dozenten um Hilfe fragen. Man schreibt also: “Lieber Herr Süß, R läuft nicht, was soll ich tun? Mean of the upper half of a Gaussian https://data-se.netlify.app/2020/07/22/mean-of-the-upper-half-of-a-gaussian/ Wed, 22 Jul 2020 00:00:00 +0000 https://data-se.netlify.app/2020/07/22/mean-of-the-upper-half-of-a-gaussian/ Load packages library(tidyverse) library(lsr) Motivation Recently, I listened to the great Paul Meehl in the audioscripts of some lectures of him. There, he asked the students what’s the mean value of the upper half of a Gaussian distribution? Let’s explore that using simulation techniques. Simulation time Let’s draw some instances from a standard Normal distribution, $X$. n <- 1e05 x <- rnorm(n) Mean and SD in our sample are quite close to what can be expected: Randomization in presence of an interaction effect https://data-se.netlify.app/2020/07/07/randomization-in-presence-of-an-interaction-effect/ Tue, 07 Jul 2020 00:00:00 +0000 https://data-se.netlify.app/2020/07/07/randomization-in-presence-of-an-interaction-effect/ Load packages library(tidyverse) library(rockchalk) library(MASS) library(ggdag) Problem statement Assume that $X$ and $Y$ are correlated contingent on some third variable, $Z$. For simplicity, assume that, if $z=0$, $_0=0.7$, and if $z=1$, then $r_1=-0.7$. This is not a causal statement. Simulate data Let the sample size amount to $n=1000$. n <- 1e03 Group A, $z=0$: myR <- lazyCor(X = 0.7, d = 2) mySD <- c(1, 1) myCov <- lazyCov(Rho = myR, Sd = mySD) set. First grade math exercise https://data-se.netlify.app/2020/07/03/first-grade-math-exercise/ Fri, 03 Jul 2020 00:00:00 +0000 https://data-se.netlify.app/2020/07/03/first-grade-math-exercise/ Problem statement My son, being a first grader, recently struggled with this piece of math: Consider this system of equations: \[ a + b + c = 20\\ d + e + f = 14\\ g + h + i = 11\\ a + d + g = 15\\ b + e + h = 10\\ c + f + i = 20\\ a + e + i = 20\\ g + e + c = 10\] How to sort the labels of the legend in a ggplot-diagram https://data-se.netlify.app/2020/06/26/how-to-sort-the-labels-of-the-legend-in-a-ggplot-diagram/ Fri, 26 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/26/how-to-sort-the-labels-of-the-legend-in-a-ggplot-diagram/ Load packages library(tidyverse) library(forcats) library(hrbrthemes) What we want to achieve: barplot ggplot2-diagram where bars and legend labels are sorted Say we would like to plot frequencies, and would like to use ggplot2 for that purpose. How can we get a decent graph? This post shows some ways. Some data data(diamonds) A glimpse to the data glimpse(diamonds) #> Rows: 53,940 #> Columns: 10 #> $ carat <dbl> 0. Simulationsbasierte Inferenz – Kurzfassung https://data-se.netlify.app/2020/06/26/simulationsbasierte-inferenz-kurzfassung/ Fri, 26 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/26/simulationsbasierte-inferenz-kurzfassung/ Simulationsbasierte Inferenz Simulationsbasierte Inferenz (SBI) ist eine Variante der Inferenzstatistik, in der Schätzwerte einer Population nicht anhand theoretischer Verteilungen (wie der Normalverteilung) hergeleitet werden, sondern durch Nachstellen eines Versuchs mithilfe des Computers. Damit wird der Zugang zur Inferenzstastistik vereinfacht und es werden Parameterberechnung möglich (bzw. genauer), die vorher (ohne Computersimulationen) nicht möglich waren. Folien Hier finden sich meine Folien zur Kurzfassung der SBI (als HTML-Version). Die HTLM-Folien können nur online betrachtet werden. Introduction to Statistics: A modeling-based approach -- Course Syllabus https://data-se.netlify.app/2020/06/19/introduction-to-statistics-a-modeling-based-approach-course-syllabus/ Fri, 19 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/19/introduction-to-statistics-a-modeling-based-approach-course-syllabus/ 1 Load packages 2 Course description 3 Course prerequisites 4 Learning objectives 5 Course Literature 6 Course logistics 7 UPFRONT student preparation 8 Didactic outline 9 Schedule 9.1 Overview on topics covered 9.2 Block 1: Explorative Data Analysis 9.2.1 Visualization 9.2.2 Data Wrangling 9.2.3 Exercises / Case study 9.3 Block 2: Statistical Modelling: Basic 9.3.1 Theory 9.3.2 Case study 9.4 Block 3: Statistical Modelling: Multiple Regression and interaction 9. Simulating data for a Gamma regression https://data-se.netlify.app/2020/06/17/simulating-data-for-a-gamma-regression/ Wed, 17 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/17/simulating-data-for-a-gamma-regression/ Load packages library(tidyverse) Intro A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function. In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example. Simulate data and define structural model Let $X$ be a discrete variable following uniform distribution, and $x_i \in \{1,2,3\}$. set.seed(42) n <- 1000 X <- sample(x = c(1,2,3), size = n, replace = TRUE) hist(X) Let \(y_i = 0. Absolute vs. relative Covid cases in modelling https://data-se.netlify.app/2020/06/10/absolute-vs-relative-covid-cases-in-modelling/ Wed, 10 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/10/absolute-vs-relative-covid-cases-in-modelling/ Load packages library(tidyverse) library(mosaic) require(scales) library(directlabels) library(ggrepel) library(ggthemes) library(hrbrthemes) options(scipen = 8) Covid-19 growth rate We are in the decline midst wake onset SOMEHWERE in the Corona crisis. A lot of hasty more or less useful research is being conducted. One of the circulating claims is: “There Corona growth rate in country X is higher than in country Y!” Let’s assume some doubling (growth) rate: Spell out your model explicitly https://data-se.netlify.app/2020/06/10/spell-out-your-model-explicitly/ Wed, 10 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/10/spell-out-your-model-explicitly/ Load packages library(tidyverse) library(hrbrthemes) library(MASS) library(moments) Why you should spell out your model explicitly Often, assumptions of widely used models, such as linear models, appear opaque. Why is heteroscedasticity important? Where is a list of the model assumptions I need to consider? As it turns out, there are straight forward answers to these (and similar) questions. The solution is to explicitly spell out your model. All “assumptions” can easily read off from these model specifications. Distribution of residuals is of interest for linear models, not the distribution of y https://data-se.netlify.app/2020/06/09/distribution-of-residuals-is-of-interest-for-linear-models-not-the-distribution-of-y/ Tue, 09 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/09/distribution-of-residuals-is-of-interest-for-linear-models-not-the-distribution-of-y/ Load packages library(tidyverse) library(e1071) My $y$ is not distributed according to my wishes! Let $Y$ be a variable that we would like to model, for instance, Covid-19 cases. Now, there’s a widely hold belief that my $Y$ must be distributed normally, or, in some cases, following some other assumed distribution (maybe some long-tailed distribution). However, this belief is not (strictly) true. What a linear model assumes is that the residuals are distributed normally, not the $Y$ distribution. On a confidence interval myth https://data-se.netlify.app/2020/06/05/on-a-confidence-interval-myth/ Fri, 05 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/05/on-a-confidence-interval-myth/ Load packages library(tidyverse) library(mosaic) Setup data(flights, package = "nycflights13") A story about data Say we have a decent sample of $n=100$, and we would like to compute a standard, plain vanilla confidence interval (95% CI). For the sake of having a story, assume you are the boss of the NYC airports and you are investigating the 2013 “typical” arrival delays. OK, here we go. Get the sample: Simulating values according to some distribution https://data-se.netlify.app/2020/06/05/simulating-values-according-to-some-distribution/ Fri, 05 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/05/simulating-values-according-to-some-distribution/ Load packages library(tidyverse) library(mosaic) What’s a Monte Carlo simulation? A Monte Carlo Simulation is a numeric approach to solving difficult problems. Instead of having an analytic way of solving the problem, one just says “ok, let’s try it out and see what happens”. Coin flip distribution Simalatin a single coin flip (Bernoulli) distribution can be achieved like this: rflip() #> #> Flipping 1 coin [ Prob(Heads) = 0. Simulation based inference for non-parametric tests, and a trick https://data-se.netlify.app/2020/06/05/sbi-nonparametric/ Fri, 05 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/05/sbi-nonparametric/ Load packages library(tidyverse) library(mosaic) Data data("tips", package = "reshape2") Non-parametric tests and simulation based inference Simulation-based inference (SBI) is an old tool that has seen a surge in research interest in recent years probably due to the large amount of computational powers at the hands of researchers. SBI is less prone to violations of assumptions, particularly with distributional assumptions. This is because inference is not based on the idea that some variable follows a – for example – normal distribution. Chi-squared test using simulation based inference https://data-se.netlify.app/2020/06/04/chi-squared-test-using-simulation-based-inference/ Thu, 04 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/04/chi-squared-test-using-simulation-based-inference/ Load packages library(tidyverse) Simulation based inference Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips. Here’s a summary of the central ideas. However, this post does not aim at explaining simulation based inference, which is done elsewhere. Testing the association of two categorical variables One application of statistical tests – simulation based or classical – is testing the association of two categorical variables. When adding variable hurts – The collider bias https://data-se.netlify.app/2020/06/04/when-adding-variable-hurts-the-collider-bias/ Thu, 04 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/04/when-adding-variable-hurts-the-collider-bias/ Load packages library(tidyverse) library(conflicted) library(ggdag) library(broom) library(GGally) Motivation Assume there is some scientist with some theory. Her theory holds that X and Z are causes of Y. dag1 shows her DAG (ie., her theory depicted as a causal diagram). Our scientist is concerned with the causal effect of X on Y, where X is a treatment variable (exposure) and Y is the dependent variable under scrutiny (outcome). Plot for mean comparison https://data-se.netlify.app/2020/06/02/plot-for-mean-comparison/ Tue, 02 Jun 2020 00:00:00 +0000 https://data-se.netlify.app/2020/06/02/plot-for-mean-comparison/ Load packages library(tidyverse) library(reshape2) # for data library(mosaic) library(sjmisc) library(skimr) Data setup data(tips) Aggregate data per group tips_aggr <- tips %>% group_by(smoker) %>% summarise(tip_avg = mean(tip), tip_md = median(tip), tip_sd = sd(tip), tip_iqr = IQR(tip)) tips_aggr #> # A tibble: 2 x 5 #> smoker tip_avg tip_md tip_sd tip_iqr #> <fct> <dbl> <dbl> <dbl> <dbl> #> 1 No 2.99 2.74 1.38 1.50 #> 2 Yes 3.01 3 1. Plotting a correlated bivariate Gaussian https://data-se.netlify.app/2020/05/30/plotting-a-correlated-bivariate-gaussian/ Sat, 30 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/30/plotting-a-correlated-bivariate-gaussian/ Load packages library(tidyverse) library(rockchalk) library(MASS) Defining the data myR <- lazyCor(X = 0.7, d = 2) mySD <- c(1, 1) myCov <- lazyCov(Rho = myR, Sd = mySD) myR #> [,1] [,2] #> [1,] 1.0 0.7 #> [2,] 0.7 1.0 mySD #> [1] 1 1 myCov #> [,1] [,2] #> [1,] 1.0 0.7 #> [2,] 0.7 1.0 Drawing from the multivariate normal Let’s draw 1000 cases. Various methods for plotting 3d bivariate Gaussians https://data-se.netlify.app/2020/05/30/various-methods-for-plotting-3d-bivariate-gaussians/ Sat, 30 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/30/various-methods-for-plotting-3d-bivariate-gaussians/ Load packages library(tidyverse) Motivation This post is a compilation, rather uncommented compilation, of various methods of plotting 3D (bivariate) Gaussian distributions in R. I add the source to each method. Note that some methods (5, 6) open a interactive window wihich is not supported here. I added a static version of the plot then. Adjustment set exercise from Elwert 2013 https://data-se.netlify.app/2020/05/19/adjustment-set-exercise-from-elwert-2013/ Tue, 19 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/19/adjustment-set-exercise-from-elwert-2013/ Load packages library(tidyverse) library(ggdag) library(dagitty) Define DAG I’ve drawn the DAG in dagitty.net, that’s why the coordinates look weird. dag3_str <- ' dag { bb="-2.865,-5.146,2.956,4.896" U [latet, pos="2.456,-0.958"] X [exposure, pos="-2.365,-4.309"] Y [outcome, pos="-0.271,4.059"] Z1 [pos="-0.491,-1.925"] Z2 [pos="-0.915,1.269"] Z3 [pos="1.713,1.984"] U -> Z1 U -> Z3 X -> Z1 Z2 -> Y Z2 -> Z1 Z2 -> Z3 Z3 -> Y }' Then tidify: dag3 <- dagitty(dag3_str) dag3_tidy <- tidy_dagitty(dag3) dag3_tidy #> # A DAG with 6 nodes and 7 edges #> # #> # Exposure: X #> # Outcome: Y #> # #> # A tibble: 9 x 8 #> name x y direction to xend yend circular #> <chr> <dbl> <dbl> <fct> <chr> <dbl> <dbl> <lgl> #> 1 U 2. Plotting equivalence class for confounder triangle https://data-se.netlify.app/2020/05/19/plotting-equivalence-class-for-confounder-triangle/ Tue, 19 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/19/plotting-equivalence-class-for-confounder-triangle/ Load packages library(tidyverse) library(ggdag) library(dagitty) Define DAG dag1_str <- 'dag { C [pos = "2,2"] X [exposure, pos = "1,1"] Y [outcome, pos = "3,1"] C -> X C -> Y }' Plot DAGs First tidify: dag1 <- dagitty(dag1_str) dag1_tidy <- tidy_dagitty(dag1) dag1_tidy #> # A DAG with 3 nodes and 2 edges #> # #> # Exposure: X #> # Outcome: Y #> # #> # A tibble: 4 x 8 #> name x y direction to xend yend circular #> <chr> <int> <int> <fct> <chr> <int> <int> <lgl> #> 1 C 2 2 -> X 1 1 FALSE #> 2 C 2 2 -> Y 3 1 FALSE #> 3 X 1 1 <NA> <NA> NA NA FALSE #> 4 Y 3 1 <NA> <NA> NA NA FALSE Then plot: How to find the package of a R function https://data-se.netlify.app/2020/05/15/how-to-find-the-package-of-a-r-function/ Fri, 15 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/15/how-to-find-the-package-of-a-r-function/ Load packages library(tidyverse) Where does my function reside? Finding the package of a given R function is some hassle. I am not aware of a quick built-in way in R to find the package of a function. That’s why I came up with my own function, check it out: Install package Speaking of packages of function, that’s the package where this function stays: library(devtools) install_github("sebastiansauer/prada") Example library(prada) find_funs("select") #> # A tibble: 11 x 3 #> package_name builtin_pckage loaded #> <chr> <lgl> <lgl> #> 1 BDgraph FALSE FALSE #> 2 dplyr FALSE TRUE #> 3 jmvcore FALSE FALSE #> 4 jqr FALSE FALSE #> 5 MASS TRUE FALSE #> 6 plotly FALSE FALSE #> 7 raster FALSE FALSE #> 8 rstatix FALSE FALSE #> 9 tidygraph FALSE FALSE #> 10 tidylog FALSE FALSE #> 11 VGAM FALSE FALSE find_funs("tidy") #> # A tibble: 14 x 3 #> package_name builtin_pckage loaded #> <chr> <lgl> <lgl> #> 1 broom FALSE FALSE #> 2 broom. Statistical power: Why small effects need big samples – An intuition https://data-se.netlify.app/2020/05/15/statistical-power-why-small-effects-need-big-samples-an-intuition/ Fri, 15 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/15/statistical-power-why-small-effects-need-big-samples-an-intuition/ Load packages library(tidyverse) Why small effects need big samples That’s a question that periodically comes up in class. Suppose someone is planning a study. As demanded by her teacher, she computes the needed sample size upfront. So the question arises: Given some to-be-achieved level of power (80%), some effect size, and some other details: How large does my sample need to be? Some students are puzzled by the fact that small effects need larges samples. Crashkurs 'Umfrageforschung' https://data-se.netlify.app/2020/05/14/crashkurs-umfrageforschung/ Thu, 14 May 2020 00:00:00 +0000 https://data-se.netlify.app/2020/05/14/crashkurs-umfrageforschung/ Eine Einführung zur Erstellung, Durchführung und Auswertung von wissenschaftlich fundierten Fragebogen Lehr-Lern-Ziele Die Teilnehmenden sollen befähigt werden, eine sozialwissenschaftlich fundierte Umfrage grundständig selbständig zu planen, durchzuführen und auszuwerten. Nebem dem Ziel der Kompetenz ist das Ziel der Selbstwirksamkeit zentral. Die Teilnehmenden sollen erfahren, dass es Ihnen (in grundständiger Variante) gut möglich ist, das Ziel zu erreichen, sich also als selbstwirksam zu erleben. Nicht Ziel ist es, tiefer gehende theoretische Konzepte zu vermitteln. Simulating Berkson's paradox https://data-se.netlify.app/2020/04/16/simulation-berkson-s-paradox/ Thu, 16 Apr 2020 00:00:00 +0000 https://data-se.netlify.app/2020/04/16/simulation-berkson-s-paradox/ This post was inspired by this paper of Karsten Luebke and coauthors. library(ggdag) library(ggthemes) library(mosaic) We’ll stratify our sample into two groups: students (Studium) and non-students (kein Studium). Structural causal model First, we define the structure of our causal model. set.seed(42) # reproducibilty N <- 1e03 IQ = rnorm(N) Fleiss = rnorm(N) Eignung = 1/2 * IQ + 1/2 * Fleiss + rnorm(N, 0, .1) That is, aptitude (Eignung) is a function of intelligence (IQ) and dilligence (Fleiss), where the input variables have the same impact on the outcome variable (aptitude). Folien für den Workshop zur simulationsbasierten Inferenz, 2020-02-05 https://data-se.netlify.app/2020/02/02/folien-f%C3%BCr-den-workshop-zur-simulationsbasierten-inferenz-2020-02-05/ Sun, 02 Feb 2020 00:00:00 +0000 https://data-se.netlify.app/2020/02/02/folien-f%C3%BCr-den-workshop-zur-simulationsbasierten-inferenz-2020-02-05/ Workshop zu simulationsbasierter Inferenz Die Folien für meinen Workshop zur simulationsbasierten Inferenz finden sich hier. Die PDF-Version findet sich hier. Der Quellcode liegt hier. Die Folien sind unter CC-BY 4.0 De lizensiert. Cluster analysis and image size reduction https://data-se.netlify.app/2020/01/10/cluster-analysis-and-image-size-reduction/ Fri, 10 Jan 2020 00:00:00 +0000 https://data-se.netlify.app/2020/01/10/cluster-analysis-and-image-size-reduction/ Idea This post is a remake of this casestudy: https://fallstudien.netlify.com/fallstudie_bildanalyse/bildanalyse brought to you by Karsten Lübke. The main purpose is to replace the base R command that Karsten used with a more tidyverse-friendly style. I think that’s easier (for me). We will compute a cluster analysis to find the typical RGB color per cluster. WARNING There’s still a bug in the code. That’s why the image at the end appear blurred. Pictogram waffle plot using emojifont https://data-se.netlify.app/2019/11/25/pictogram-waffle-plot-using-emojifont/ Mon, 25 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/25/pictogram-waffle-plot-using-emojifont/ Load packages library(tidyverse) library(emojifont) library(showtext) library(ggpubr) Pictogram waffle plot A Pictogram may be defined as a (statistical) diagram using icons or similar “iconic” graphics to illstrate stuff. The waffle plot (see this post) is a nice object where to combine waffle and pictorgrams. Originally, this post was inspired by HRBRMSTR waffle package, see this post, but I could not get it running. Maybe the easiest way is to work through an example (spoiler: see below for what we’re heading at). Correlation cannot be more extreme than +1/-1, proof using Cauchy-Schwarz inequality https://data-se.netlify.app/2019/11/19/correlation-cannot-be-more-extreme-than-1-1-proof-using-cauchy-schwartz-inequality/ Tue, 19 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/19/correlation-cannot-be-more-extreme-than-1-1-proof-using-cauchy-schwartz-inequality/ Load packages library(tidyverse) The correlation coefficient cannot exceed an absolute value of 1 This is well-known. But why is that the case? How can we proof it? This post gives one explanation using the Cauchy-Schwarz inequality. Here’s one version of the definition of correlation: \[ r = \frac{\sum(\Delta x \Delta y)}{\sqrt{\sum \Delta x^2} \sqrt{\sum \Delta y^2}} \] where $\Delta x$ and $\Delta y$ are the differences of $x_i$ and $\bar{x}$, that is: $\Delta x_i = x_i - \bar{x}$, and similarly for $\Delta y_i$. Plotting functions in 3d https://data-se.netlify.app/2019/11/19/plotting-functions-in-3d/ Tue, 19 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/19/plotting-functions-in-3d/ Load packages library(tidyverse) library(mosaic) library(plotly) Gimme a function Say, you have some function such as \[ f(x) = x^2+z^2 \] In more R-ish: f <- makeFun(x^2 + z^2 ~ x & z) And you would like to plot it. Observe that this function has two input (independent) variables, $x$ and $z$, plus one output (dependent) variables, $y$. The thing is, you’ll need to define the values for a number of output values for $y$, as defined by the function. Some intution on the Gaussian distribution formula https://data-se.netlify.app/2019/11/18/some-intution-on-the-gaussian-distribution-formula/ Mon, 18 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/18/some-intution-on-the-gaussian-distribution-formula/ Load packages library(tidyverse) library(mosaic) The Gaussian The ubiquituous Gaussian (aka normal) distribution is probably the most widely known distribution for stochastic process (although maybe as frequently encountered as a unicorn). Here it is in all its glory. gf_dist("norm") There are two typical ways, why it may be considered “normal”, one is using the Galton Board, and one approach is building on the Central Limit Theorem. While such considerations are great for understanding “where” the Gaussian distribution comes from, this post explore some other direction of intuiton. Most important asssumption in linear models ... and the second most https://data-se.netlify.app/2019/11/11/most-important-asssumption-in-linear-models/ Mon, 11 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/11/most-important-asssumption-in-linear-models/ Load packages library(tidyverse) library(mosaic) We are following here the advise of Gelman and Hill (2007, p. 46-47). Validity Quite obviously, the right predictors must be included in the model in order to learn something from the model. The “right” predictors means: avoiding the wrong ones, and including the correct ones. Easier said than done, particularly with a look to the causal inference aspects. Let’s turn to the next most important assumption. Some notes on data transformations for regression https://data-se.netlify.app/2019/11/11/some-notes-on-data-transformations-for-regression/ Mon, 11 Nov 2019 00:00:00 +0000 https://data-se.netlify.app/2019/11/11/some-notes-on-data-transformations-for-regression/ Load packages library(tidyverse) library(mosaic) Motivation What are data transformation good for? Why do we bother to transform variables for regression analysis? This post explores some nuances around these themes. Simulate an exponentially distributed assocation len <- 42 # 42 x values x <- rep(runif(len), 30) # each x value repeated 30 times y <- dexp(x) + rnorm(length(x), mean = 0, sd = .01) # add some noise Plot it: Some ways for plotting 3D linear models https://data-se.netlify.app/2019/10/21/some-ways-for-plotting-3d-linear-models/ Mon, 21 Oct 2019 00:00:00 +0000 https://data-se.netlify.app/2019/10/21/some-ways-for-plotting-3d-linear-models/ Load packages library(tidyverse) # data wrangling library(mosaic) # funplot library(plotly) # interactive plots library(scatterplot3d) # nomen est omen library(rsm) # 3d scatterplots Motivation Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet. Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. P-values are uniformly distributed under the H0, a simulation https://data-se.netlify.app/2019/10/11/p-values-are-equally-distributed-under-the-h0/ Fri, 11 Oct 2019 00:00:00 +0000 https://data-se.netlify.app/2019/10/11/p-values-are-equally-distributed-under-the-h0/ Load packages library(tidyverse) library(mosaic) Motivation The p-value is a ubiquituous tool for gauging the plausibility of a Null hypothesis. More specifically, the p-values indicates the probability of obtaining a test statistic at least as extreme as in the present data if the Null hypothesis was true and the experiment would be repeated an infinite number of times (under the same conditions except the data generating process). The distribution of the p-values depends on the strength of some effect (among other things). Simple proof that the correlation coefficient cannot exceed abs(1) https://data-se.netlify.app/2019/10/07/simple-proof-that-the-correlation-coefficient-cannot-exceed-abs-1/ Mon, 07 Oct 2019 00:00:00 +0000 https://data-se.netlify.app/2019/10/07/simple-proof-that-the-correlation-coefficient-cannot-exceed-abs-1/ Load packages library(tidyverse) library(MASS) Motivation It is well-known that the notorious (Pearson’s) correlation cannot exceed an absolute value greater than 1, that is \[ -1 \le r \le +1 \] or \[ |r| \le 1 \] However, proofing this fact is less straightforward. A classical way of proofing the above inequality is by using the Cauchy-Schwarz inequality. From a teacher’s perspective, the CS inequality may not be ideal, because the students may lack some knowledge necessary for appreciating this proof. Some algebraic properties of z-scores https://data-se.netlify.app/2019/10/07/some-algebraic-properties-of-z-scores/ Mon, 07 Oct 2019 00:00:00 +0000 https://data-se.netlify.app/2019/10/07/some-algebraic-properties-of-z-scores/ Load packages library(tidyverse) Motivation Z-scores (z-values) are a useful and widely employed tool to gauge and compare measurements. For instance, z-scores help to compare the relative position of some measurements with respect to their distributions. In this post, we will prove some basic (algebraic) properties of z-values. There’s nothing new to that, it’s just I’d like to have it neat and concise somewhere to quickly find it. I’ll add some explanation for the ease of reception. Looping over function arguments using purrr https://data-se.netlify.app/2019/09/28/looping-over-function-arguments-using-purrr/ Sat, 28 Sep 2019 00:00:00 +0000 https://data-se.netlify.app/2019/09/28/looping-over-function-arguments-using-purrr/ Load packages library(tidyverse) Problem statement Assume you have to call a function multiple times, but each with (possibly) different argument. Given enough repitioons, you will not want to repeat yourself. In other words, we would like to loop over function arguments, each round in the loop giving the respective argument’value(s) to the function. One example would be to generate many random values but each with different mean and/or sd: Slides for my workshop on Markdown and Git https://data-se.netlify.app/2019/09/09/slides-for-my-workshop-on-markdown-and-git/ Mon, 09 Sep 2019 00:00:00 +0000 https://data-se.netlify.app/2019/09/09/slides-for-my-workshop-on-markdown-and-git/ Here are my slides for my Workshop on Markdown and Git (2019-09-16). Note that you need to be online to render the slides (due to heavy use of JS). The Rmd source code (master file) can be found here. The PDF version of the slides can be found here. Computing rater accuracy across multiple raters and multiple criteria https://data-se.netlify.app/2019/08/27/computing-rater-accuracy-across-multiple-raters-and-multiple-criteria/ Tue, 27 Aug 2019 00:00:00 +0000 https://data-se.netlify.app/2019/08/27/computing-rater-accuracy-across-multiple-raters-and-multiple-criteria/ Load packages library(tidyverse) Background Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients. Performance measures for `caret` and `lm()` https://data-se.netlify.app/2019/08/02/performance-measures-for-caret-and-lm-r/ Fri, 02 Aug 2019 00:00:00 +0000 https://data-se.netlify.app/2019/08/02/performance-measures-for-caret-and-lm-r/ Recently, I run into performance issue when fitting a linear model together with a resampling scheme and a tuning grid (via caret). The dataset was recently large - some 200k rows and approx. 20 columns (nycflights13 train). Still, I was suprised that my machine got stuck during the computation. Now I wonder whether I ran into memory constraints (16BG on my machine), or whether some other stuff went wrong. Geoplotting - update to my MODAR-book https://data-se.netlify.app/2019/07/29/geoplotting-update-to-my-modar-book/ Mon, 29 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/29/geoplotting-update-to-my-modar-book/ In my book on modern data analyisis using R, I show some basics of geoplotting. It seems that some software update for the package simple features broke my code. So, here ’s some update. Load packages and data library(tidyverse) library(viridis) library(sf) data(socec, package = "pradadata") data(wahlkreise_shp, package = "pradadata") Check data glimpse(socec) #> Observations: 316 #> Variables: 51 #> $ V01 <chr> "Schleswig-Holstein", "Schleswig-Holstein", "Schleswig-Holst… #> $ V02 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 901, 12, 13, 14, 15, 16, … #> $ V03 <chr> "Flensburg – Schleswig", "Nordfriesland – Dithmarschen Nord"… #> $ V04 <int> 130, 197, 178, 163, 3, 92, 49, 95, 49, 126, 28, 1110, 132, 1… #> $ V05 <dbl> 2128. Slides (in German) for my talk on "Datenkompetenz für alle" at the R-User-Group Nürnberg July 2019 https://data-se.netlify.app/2019/07/17/slides-in-german-for-my-talk-on-datenkompetenz-f%C3%BCr-alle-at-the-r-user-group-n%C3%BCrnberg-july-2019/ Wed, 17 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/17/slides-in-german-for-my-talk-on-datenkompetenz-f%C3%BCr-alle-at-the-r-user-group-n%C3%BCrnberg-july-2019/ The slides (pdf) of my talk “Datenkompetenz für alle – Ein Werkstattbericht zum FOM-Statistik-Curriculum” can be found here. Collapse rows to eliminate NAs https://data-se.netlify.app/2019/07/03/collapse-rows-to-eliminate-nas/ Wed, 03 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/03/collapse-rows-to-eliminate-nas/ Load packages library(tidyverse) Starters Assume you have this data frame: x <- tribble( ~ colA, ~colB, ~colC, NA, 1, NA, 1, NA, 1 ) x #> # A tibble: 2 x 3 #> colA colB colC #> <dbl> <dbl> <dbl> #> 1 NA 1 NA #> 2 1 NA 1 But you want this one: y <- tribble( ~ colA, ~colB, ~colC, 1, 1, 1 ) y #> # A tibble: 1 x 3 #> colA colB colC #> <dbl> <dbl> <dbl> #> 1 1 1 1 That is, you’d like to collapse rows so that if there’s a NA in a column it is replaced by the value found in some other line. Generalized rowwise operations using purrr::pmap https://data-se.netlify.app/2019/07/03/generalized-rowwise-operations-using-purrr-pmap/ Wed, 03 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/03/generalized-rowwise-operations-using-purrr-pmap/ Load packages library(tidyverse) Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult. This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few. Testing for equality rowwise https://data-se.netlify.app/2019/07/03/testing-for-equality-rowwise/ Wed, 03 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/03/testing-for-equality-rowwise/ Load packages library(tidyverse) Basic testing for equality Testing for equality in a kind of very basic function in computer (and data) science. There is a straightforward function in R to test for equality: identical(1, 1) #> [1] TRUE identical("A", "A") #> [1] TRUE identical(1, 2) #> [1] FALSE identical(1, NA) #> [1] FALSE However this get more complicated if we want to compare more than two elements. One way to achieve this is to compute the number of the different items. Testing multiple vectors for equality https://data-se.netlify.app/2019/07/03/testing-multiple-vectors-for-equality/ Wed, 03 Jul 2019 00:00:00 +0000 https://data-se.netlify.app/2019/07/03/testing-multiple-vectors-for-equality/ Load packages library(tidyverse) Problem statement Assume we have some vectors (eg, 3), and we want to check if they are equal (the same elements in each vector). Assume further we do not in advance the number of vectors to check. Here’s some toy data. a<- c(1,2,3,4) b<- c(1,2,3,5) c<- c(1,3,4,5) The gist This soluation is based on the code of Akrun from this SO post (slightly adapted). How to document a conference talk in citation manager software https://data-se.netlify.app/2019/06/28/how-to-document-a-conference-talk-in-citation-manager-software/ Fri, 28 Jun 2019 00:00:00 +0000 https://data-se.netlify.app/2019/06/28/how-to-document-a-conference-talk-in-citation-manager-software/ There are several popular citation manager software packages around. I used to work with Mendeley in class, but I stopped using it since it was acquired by El$sevier. Luckily there are good alternatives around, particularly Zotero. Zotero features a Word (MS Word, Libre Word) plugin, which is a must have for many of us. The more technically inclined folks will use Bibtex. Good news is that Zotero syncs it Library to Bibtex. Talk 'Data Science in Business' https://data-se.netlify.app/2019/05/10/talk-data-science-for-business/ Fri, 10 May 2019 00:00:00 +0000 https://data-se.netlify.app/2019/05/10/talk-data-science-for-business/ Talk "Intro to Data Science in Business See here the slides (pdf) for the talk. Talk “Reviewing rapid prototype candidates” See here the slides (pdf) for the talk. Colophon CC-BY How to convert raw scores to different types of standardized scores https://data-se.netlify.app/2019/04/11/how-to-convert-raw-scores-to-different-types-of-standardized-scores/ Thu, 11 Apr 2019 00:00:00 +0000 https://data-se.netlify.app/2019/04/11/how-to-convert-raw-scores-to-different-types-of-standardized-scores/ A common undertaking in applied research settings such as in some areas of psychology is to convert a raw score into some type of standardized score such as z-scores. This post shows a way how to accomplish that. Load packages library(tidyverse) Load some psychometric data data("extra", package = "pradadata") The data can be downloaded here. The dataset shows some data on extraversion (the personality trait) items along with some correlates of extraversion. A stochastic problem by Warren Buffet solved with simulation https://data-se.netlify.app/2019/04/04/a-stochastic-problem-by-warren-buffet-solved-with-simulation/ Thu, 04 Apr 2019 00:00:00 +0000 https://data-se.netlify.app/2019/04/04/a-stochastic-problem-by-warren-buffet-solved-with-simulation/ This post presents a stochastic problem, with application to financial theory taken from this magazine article. Some say the problem goes back to Warren Buffett. Thanks to my colleague Norman Markgraf, who pointed it out to me. Assume there are two coins. One is fair, one is loaded. The loaded coin has a bias of 60-40. Now, the question is: How many coin flips do you need to be “sure enough” (say, 95%) that you found the loaded coin? Reducing residual variance in modeling https://data-se.netlify.app/2019/03/26/reducing-residual-variance-in-modeling/ Tue, 26 Mar 2019 00:00:00 +0000 https://data-se.netlify.app/2019/03/26/reducing-residual-variance-in-modeling/ Modeling is a central part not only of statistical inquiry, but also of everyday human sense-making. We use models as metaphors for the world, in a broader sense. Of course, a model that explains the world better (than some other model) is to be preferred, all other things being equal. In this post, we demonstrate that a more “clever” statistical model reduces the residual variance. It should be noted that this “noise reduction” comes at a cost, however: The model gets more complex; there a more parameters in the model. Beispiel für eine logistische Regression https://data-se.netlify.app/2019/03/20/beispiel-f%C3%BCr-eine-logistische-regression/ Wed, 20 Mar 2019 00:00:00 +0000 https://data-se.netlify.app/2019/03/20/beispiel-f%C3%BCr-eine-logistische-regression/ Wozu ist das gut? Kurz gesagt ist die logistische Regression ein Werkzeug, um dichotome (zweiwertige) Ereignisse vorherzusagen (auf Basis eines Datensatzes mit einigen Prädiktoren). Was sagt uns die logistische Regression? Möchte man z.B. vorhersagen, ob eine E-Mail Spam ist oder nicht, so ist es nützlich, für jede zu prüfende Mail eine Wahrscheinlichkeit zu bekommen. So könnte uns die logistische Regression sagen: “Eine Mail mit diesen Ausprägungen in den Prädiktoren hat eine Wahrschenlichkeit von X Prozent, dass es sich um Spam handelt”. Slides of my talk at ECDA 2019: Modeling of AfD election success https://data-se.netlify.app/2019/03/16/slides-of-my-talk-at-ecda-2019-modeling-of-afd-election-success/ Sat, 16 Mar 2019 00:00:00 +0000 https://data-se.netlify.app/2019/03/16/slides-of-my-talk-at-ecda-2019-modeling-of-afd-election-success/ Slides of my talk at ECDA 2019 can be found here: http://data-se.netlify.com/slides/afd_ecda2019/afd-modeling-ECDA-2019.html#1. Note that you need to be online to render the slides. The (standalone) PDF version can be found here: http://data-se.netlify.com/slides/afd_ecda2019/afd-modeling-ECDA-2019.pdf How to mutate all columns of a data frame https://data-se.netlify.app/2019/03/13/how-to-mutate-all-columns-of-a-data-frame/ Wed, 13 Mar 2019 00:00:00 +0000 https://data-se.netlify.app/2019/03/13/how-to-mutate-all-columns-of-a-data-frame/ Say, you have a data frame with a number of columns, and you need to change every column in a similar way. A common example might be to standardize all (numeric) variables. How to do that in R? This post shows and explains an example using mutate_all() from the tidyverse. Let’s stick to the question “how to z-standardize all columns” for the sake of simplicity (and neglect that there are precooked solutions, for example from the superb package sjmisc by strengejacke. Emails schreiben an Dozierende https://data-se.netlify.app/2019/02/28/emails-schreiben-an-dozierende/ Thu, 28 Feb 2019 00:00:00 +0000 https://data-se.netlify.app/2019/02/28/emails-schreiben-an-dozierende/ E-Mails schreiben ist eine wesentliche Form der Korrespondenz mit eigenen Vorteilen und Schwächen. Jedenfalls ist sie allgegenwärtig. Dieser Beitrag soll (meinen) Studierenden Orientierung geben, wie man eine E-mail an Dozierende schreibt. Natürlich ist das meine Sicht der Dinge; andere Dozierende möchten vielleicht ein andere Art von E-Mails erhalten. Letztlich ist die Art von E-Mails an Dozierende nichts anderes als eine Form der Geschäftskorrespondenz. Daher gelten die entsprechenden Regeln; allerdings behält sich die akademische Welt vielleicht ein paar Feinheiten (und Freiheiten) vor, die man kennen sollten, wenn man solche Mails schreiben will oder muss. Ornaments with ggformula https://data-se.netlify.app/2019/02/12/ornaments-with-gformula/ Tue, 12 Feb 2019 00:00:00 +0000 https://data-se.netlify.app/2019/02/12/ornaments-with-gformula/ Since some time, there’s a wrapper for ggplot2 available, bundled in the package ggformula. One nice thing is that in that it plays nicely with the popular R package mosaic. mosaic provides some useful functions for modeling along with a tamed and consistent syntax. In this post, we will discuss some “ornaments”, that is, some details of beautification of a plot. I confess that every one will deem it central, but in some cases in comes in handy to know how to “refine” a plot using ggformula. Online reaction time experiments using lab.js https://data-se.netlify.app/2019/01/29/online-reaction-time-experiments-using-lab-js/ Tue, 29 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/29/online-reaction-time-experiments-using-lab-js/ Collecting data over the internet used to be fancy, some twenty years or so ago. Nowadays it can be considered standard, if not old school (collecting data using mobile apps is where the cool kids go at the moment). However, there’s one noteable exception: Collecting reaction time data over the internet remained a challenge. The reason is simply a technological artefact in that an html response time may vary, vary too much as to invalidate the signal from some behavorial reaction time research study. Reading text files and Umlaute hassle https://data-se.netlify.app/2019/01/25/reading-text-files-and-umlaute-hassle/ Fri, 25 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/25/reading-text-files-and-umlaute-hassle/ Data is often stored as plain text file. That’s good because it is a simple format. However, simplicity comes with a cost: Not all questions may have definite answers. The most common hassle when reading/importing text files is that the encoding scheme is unknown, aka wrong. This problem mostly occurs when, say, a Mac user stores a text file, where per default UTF8 text encoding is applied. In contrast, on a Windows machine, Windows-encoding (often dubbed “latin1”,“Windows 1252” or “ISO-8859-1”) is the default. Poster: A Bayes model of AfD party success https://data-se.netlify.app/2019/01/24/poster-a-bayes-model-of-afd-party-success/ Thu, 24 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/24/poster-a-bayes-model-of-afd-party-success/ At the Dozentenmeeting 2019 of the FOM Hochschule, I presented a poster of an analysis of the AfD election success, based on a Bayes multi level regression. The poster can be downloaded here. Poster: Populism in German politicians https://data-se.netlify.app/2019/01/17/poster-populism-in-german-politicians/ Thu, 17 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/17/poster-populism-in-german-politicians/ At the Dozentenmeeting 2019 of the FOM Hochschule, I presented a poster of an analysis of populism in German politicians. The poster can be downloaded here. An illustration of tidyverse’ gather/spread https://data-se.netlify.app/2019/01/15/an-illustration-of-tidyverse-gather-spread/ Tue, 15 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/15/an-illustration-of-tidyverse-gather-spread/ Frequently, datasets have to be reshaped before further analysis. One particular important step is to transform a data frame from “wide” to “long” format. This is illustrated by the following diagram, taken from by new book on data analysis (Image licence: CC-BY-NC). A clean sessionInfo page https://data-se.netlify.app/2019/01/14/a-clean-sessioninfo-page/ Mon, 14 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/14/a-clean-sessioninfo-page/ Writing a technical or academic report, or even a presentation, it is sensible to render the (R) code in such a writing reproducible. Same thing applies when asking for help at StackOverflow: you’ll be asked for a reprex. One aspect for rendering a report reproducible is to include details on the version of packages needed. The well-known command sessionInf() provides the building blocks for that. However, the output of that function can feel verbose, and it consumes a lot of space. Barplots with mosaic https://data-se.netlify.app/2019/01/10/barplots-with-mosaic/ Thu, 10 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/10/barplots-with-mosaic/ Plotting barplots is a frequent endeavor for the analysis of qualitative data. Numerous methods for plotting barplots exist; the popular R package mosaic also provides methods. More recently, mosaic switched to a ggplot wrapper for plotting diagrams, that is gf_XXX(), packaged in ggformula. That implies that input data is expected to be tidy, because ggplot, a central member of the tidyverse, excepts its input data to be tidy. Let’s check an example. A short tutorial for the logistic regression https://data-se.netlify.app/2019/01/07/a-short-tutorial-for-the-logistic-regression/ Mon, 07 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/07/a-short-tutorial-for-the-logistic-regression/ Here’s q quick walk-through for a logistic regression in R. Setup library(tidyverse) library(reshape2) # dataset "tips" library(caret) library(mosaic) We’ll use the tips dataset: data(tips) Research question Assume we would like to predict if a person is female based on some predictor such as the amount of tip she/he give. How many instances of each type of the outcome variable are in the data set? tally(~ sex, data = tips, format = "proportion") #> sex #> Female Male #> 0. Folien für Vortrag 'Papers publizieren' https://data-se.netlify.app/2019/01/04/folien-f%C3%BCr-vortrag-papiers-publizieren/ Fri, 04 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/04/folien-f%C3%BCr-vortrag-papiers-publizieren/ Die Folien für meinen Vortrag “Papers publizieren” zum Dozententreffen 2019 der FOM Hochschule liegen hier. Why standard regression is not (so) adequate for regressing proportions https://data-se.netlify.app/2019/01/03/why-standard-regression-is-not-so-adequate-for-regressing-proportions/ Thu, 03 Jan 2019 00:00:00 +0000 https://data-se.netlify.app/2019/01/03/why-standard-regression-is-not-so-adequate-for-regressing-proportions/ Intro Professor Sweet is conducting some research to investigate the risk factor and drivers of student exam success. In a recent analysis he considers the variable “exam successfully passed” (vs. not passed) as the criterion (output) and the amount of time spent for preparation (aka study time) as predictor. Setup Please make sure that all packages are installed before proceeding. Except pradadata, all packages are on CRAN. [ Here’s] (https://github. Force bibtex to show the exact date https://data-se.netlify.app/2018/12/29/force-bibtex-to-show-the-exact-date/ Sat, 29 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/29/force-bibtex-to-show-the-exact-date/ Citing (aka scientific citation) is quite straight forward in RMarkdown. However, there are some shortcomings. Primarily, as citations are rendered via Pandoc’s reference engine, bibtex is used as a standard. Though is quite commonly used, bibtex has been, over and above, replaced by biblatex. biblatex is much more straight forward than bibtex (as text is formatted using latex and not bibtex, still making use of bibtex for the collection of references). Using BibLaTeX instead of Bibtex in Rmarkdown for finer control https://data-se.netlify.app/2018/12/28/using-biblatex-instead-of-bibtex-in-rmarkdown-for-finer-control/ Fri, 28 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/28/using-biblatex-instead-of-bibtex-in-rmarkdown-for-finer-control/ As a standard, bibtex is used as a citation-renderer in Pandoc’s Markdown, that is, in RMarkdown as well. bibtex is useful for a fair amount of citation task, but biblatex allows for a finer control. For instance, multiple bibliographies for one document are possible. For instance, citing a newspaper article using bibtex left me scratching my head, as I wanted to have the exact day of the date (not only the year) cited. Generating mass reports using Rmarkdown in R https://data-se.netlify.app/2018/12/19/generating-mass-reports/ Wed, 19 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/19/generating-mass-reports/ Sometimes, one document must be recreated in similar fashions a lot of times. For instance, invoices to customers, grading schemes for students, progress reports in projects, and so on. In this post, I demonstrate one way to do that in R using RMarkdown. Specifically, it is assumed that there’s a tabular data set, where each row refers to a document instance (eg., a mail or report to one given person), and each column holds the variables to appear in each reports (see examples below). Visualizing a multivariate normal distribution https://data-se.netlify.app/2018/12/13/visualizing-a-multivariate-normal-distribution/ Thu, 13 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/13/visualizing-a-multivariate-normal-distribution/ In R, it is quite straight forward to plot a normal distribution, eg., using the package ggplot2 or plotly. Setup library(tidyverse) library(mvtnorm) library(plotly) library(MASS) Simulate multivariate normal data First, let’s define a covariance matrix $\Sigma$: sigma <- matrix(c(4,2,2,3), ncol = 2) sigma ## [,1] [,2] ## [1,] 4 2 ## [2,] 2 3 Then, simulate observations n = n from these covariance matrix; the means need be defined, too. Visualizing a regression plane (two predictors) https://data-se.netlify.app/2018/12/13/visualizing-a-regression-plane-two-predictors/ Thu, 13 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/13/visualizing-a-regression-plane-two-predictors/ Plotting a “simple” regression (one regression) is pretty straight forward in R. Setup library(tidyverse) data(mtcars) library(mosaic) library(modelr) library(plotly) Define model lm1 <- lm(mpg ~ hp, data = mtcars) mtcars <- mtcars %>% mutate(lm1_pred = predict(lm1)) Plot One way: ggplot(mtcars) + aes(y = mpg, x = hp) + geom_point() + geom_lm() Another way: ggplot(mtcars) + aes(x = hp) + geom_point(aes(y = mpg)) + geom_point(aes(y = lm1_pred), color = "blue") + geom_line(aes(y = lm1_pred), color = "blue") Using the ggformula interface to ggplot2: Changing the default color scheme in ggplot2 https://data-se.netlify.app/2018/12/12/changing-the-default-color-scheme-in-ggplot2/ Wed, 12 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/12/changing-the-default-color-scheme-in-ggplot2/ UPDATE: see update below based on comments from nmarkgraf. UPDATE 2: I changed the theme to theme_minimal thanks to the comment from @neuwirthe. UPDATE 3: A more efficient way to plot a discrete scale using viridis. Thanks to flying sheep; see way 4 below The default color scheme in ggplot2 is suitable for many purposes, but, for instance, it is not suitable for b/w printing, and maybe not suitable for persons with limited color perception. New split-apply-combine variant in dplyr: group_split() https://data-se.netlify.app/2018/12/10/new-split-apply-combine-variant-in-dplyr-group-split/ Mon, 10 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/10/new-split-apply-combine-variant-in-dplyr-group-split/ UPDATE 2018-12-11 - I’m talking about the package DPLYR, not PURRR, as I had mistakenly written. There are many approaches for what is called the “split-apply-combine” approach (see this paper by Hadley Wickham). I recently thought about the best approach to use split-apply-combine approaches in R (see tweet, and this post). And I retweeted some criticism on the “present era” tidyverse approach (see this tweet), and check out the mentioned post by @coolbutuseless. Applying a function to each row of a data frame https://data-se.netlify.app/2018/12/07/applying-a-function-to-each-row-of-a-data-frame/ Fri, 07 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/07/applying-a-function-to-each-row-of-a-data-frame/ A typical and quite straight forward operation in R and the tidyverse is to apply a function on each column of a data frame (or on each element of a list, which is the same for that regard). However, the orthogonal question of “how to apply a function on each row” is much less labored. We will look at this question in this post, and explore some (of many) answers to this question. Coercing an index over a character vector https://data-se.netlify.app/2018/12/06/coercing-an-index-over-a-character-vector/ Thu, 06 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/06/coercing-an-index-over-a-character-vector/ Assume we have a vector (of type character) such as countries, names, or products. Each element is allowed to show up multiple times. Further assume that there is a rather large number of unique (different) elements. What we would like to achieve is to give each element a unique ID, where the ID ranges from 1 to k (k is the number of different elements). Of course there are different ways to achieve this goal, we’ll explore one or two. This blog has a DOI https://data-se.netlify.app/2018/12/06/this-blog-has-a-doi/ Thu, 06 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/06/this-blog-has-a-doi/ This blog has a DOI now: Plot many ggplot diagrams using nest() and map() https://data-se.netlify.app/2018/12/05/plot-many-ggplot-diagrams-using-nest-and-map/ Wed, 05 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/05/plot-many-ggplot-diagrams-using-nest-and-map/ At times, it is helpful to plot a multiple of related diagrams, such as a scatter plot for each subgroup. As always, there a number of ways of doing so in R. Specifically, we will make use of ggplot2. library(tidyverse) library(glue) data(mtcars) d <- mtcars %>% rownames_to_column(var = "car_names") Is d a tibble? is_tibble(d) #> [1] FALSE What is it? class(d) #> [1] "data.frame" Okay, let’s make a tibble out of it: What are the names of the cars with 4 cylinders? https://data-se.netlify.app/2018/12/03/what-are-the-names-of-the-cars-with-4-cylinders/ Mon, 03 Dec 2018 00:00:00 +0000 https://data-se.netlify.app/2018/12/03/what-are-the-names-of-the-cars-with-4-cylinders/ Recently, some one asked me in a workshop this question: “What are the names of the cars with 4 (6,8) cylinders?” (he referred to the mtcars data set). That was a workshop on the tidyverse, so the question is how to answer this question using tidyverse techniques. First, let’s load the usual culprits. library(tidyverse) library(purrrlyr) library(knitr) library(stringr) data(mtcars) d <- as_tibble(mtcars) %>% rownames_to_column(var = "car_names") d %>% head() %>% kable() car_names mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21. Image paths in Hugo/blogdown https://data-se.netlify.app/2018/11/28/image-paths-in-hugo-blogdown/ Wed, 28 Nov 2018 00:00:00 +0000 https://data-se.netlify.app/2018/11/28/image-paths-in-hugo-blogdown/ Images from R are instantly included into (R) markdown files, and the same applies for blogdown posts. See: x <- 1:10 plot(x) However, for external images - such as photos - things are more complicated. First, all is still fine, if an image is found on some URL/server on the internet: knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/310px-R_logo.svg.png") Of course, one can apply direct markdown syntax for including external images: ![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/310px-R_logo.svg.png){width=20%} Now assume we are in an R project that gives the base for a blogdown blog. Compute all pairwise differences in matrix https://data-se.netlify.app/2018/11/21/compute-all-pairwise-differences-in-matrix/ Wed, 21 Nov 2018 00:00:00 +0000 https://data-se.netlify.app/2018/11/21/compute-all-pairwise-differences-in-matrix/ A quite frequent task in many fields of applied math is to compute pairwise differences of elements in a matrix. Actually, it need not be a difference; a product is frequent, too. In this post, we explore some (base) R ways to achieve this. library(mosaic) library(gdata) library(tidyverse) Using outer() An elegant approach, using base R, is applying outer(). That’s useful if one has two vectors, and wants to compute the outer product: Slides for the „hands-on data exploration workshop" https://data-se.netlify.app/2018/11/12/slides-for-the-hands-on-data-exploration-workshop/ Mon, 12 Nov 2018 00:00:00 +0000 https://data-se.netlify.app/2018/11/12/slides-for-the-hands-on-data-exploration-workshop/ Find the slides for my workshop “hands-on data exploration using R” here: http://data-se.netlify.com/slides/hands-on-data-exploration/handson-data-workshop_2018-11-21.html. Note that the slides need access to the internet, in order to be rendered correctly. : Get PDF of slides here : Get Rmd source code of slides here The workshop is delivered at the Data Natives Conference 2018 Berlin. Simple Examples with DiagrammeR https://data-se.netlify.app/2018/11/07/simple-examples-with-diagrammer/ Wed, 07 Nov 2018 00:00:00 +0000 https://data-se.netlify.app/2018/11/07/simple-examples-with-diagrammer/ UPDATE 2018-12-13: Based on a comment from @nmarkgraf, I added a section on how to export diagrammeR diagrams. Here are some examples of diagrams build with DiagrammeR: Setup library(tidyverse) library(DiagrammeR) library(DiagrammeRsvg) library(magick) DiagrammeR using grViz() Define the graph: g1 <- "digraph boxes_and_circles { graph [layout = circo, overlap = true] node [shape = circle, fixedsize = true, fontname = Helvetica, width = 1] Problem; Plan; Data; Analysis; Conclusion edge [color = grey] Problem -> Plan Plan -> Data Data -> Analysis Analysis -> Conclusion Conclusion -> Problem }" Print it to the screen: Plot columns repeatedly https://data-se.netlify.app/2018/11/02/plot-columns-repeatedly/ Fri, 02 Nov 2018 00:00:00 +0000 https://data-se.netlify.app/2018/11/02/plot-columns-repeatedly/ Suppose you have a large number of columns of a dataframe, and you want to plot each column – say a histogram for each column. This post shows some ways of achieving this. Let’s take the mtcars dataset as an example. data(mtcars) We will use the tidyverse approach: library(tidyverse) Way 1 mtcars %>% select_if(is_numeric) %>% map2(., names(.), ~ {ggplot(data = data_frame(.x), aes(x = .x)) + geom_histogram() + labs(x= .y)}) #> $mpg #> #> $cyl #> #> $disp #> #> $hp #> #> $drat #> #> $wt #> #> $qsec #> #> $vs #> #> $am #> #> $gear #> #> $carb Some explanations: OECD Wellbeing - Explorative Analyse https://data-se.netlify.app/2018/10/16/oecd-wellbeing-explorative-analyse/ Tue, 16 Oct 2018 00:00:00 +0000 https://data-se.netlify.app/2018/10/16/oecd-wellbeing-explorative-analyse/ In diesem Post untersuchen wir einige Aspekte der explorativen Datenanalyse für den Datensatz oecd wellbeing aus dem Jahr 2016. Hinweis: Als Vertiefung gekennzeichnete Abschnitt sind nicht prüfungsrelevant. Benötigte Pakete Ein Standard-Paket zur grundlegenden Datenanalyse: library(mosaic) Datensatz laden Der Datensatz kann hier bezogen werden. Doi: https://doi.org/10.1787/data-00707-en. Falls der Datensatz lokal (auf Ihrem Rechner) vorliegt, können Sie ihn in gewohnter Manier laden. Geben Sie dazu den Pfad zum Datensatz ein: OECD Wellbeing dataset (2016) https://data-se.netlify.app/2018/10/16/oecd-wellbeing-dataset-2016/ Tue, 16 Oct 2018 00:00:00 +0000 https://data-se.netlify.app/2018/10/16/oecd-wellbeing-dataset-2016/ Packages We will need the following packages in this post: library(mosaic) library(knitr) library(DT) The OECD wellbeing study series The OECD keeps measuring the wellbeing (and associated variables) among its members states. On the project website, the OECD states: In recent years, concerns have emerged regarding the fact that macro-economic statistics, such as GDP, don’t provide a sufficiently detailed picture of the living conditions that ordinary people experience. While these concerns were already evident during the years of strong growth and good economic performance that characterised the early part of the decade, the financial and economic crisis has further amplified them. Change standard theme of ggplot https://data-se.netlify.app/2018/10/10/change-standard-theme-of-ggplot/ Wed, 10 Oct 2018 00:00:00 +0000 https://data-se.netlify.app/2018/10/10/change-standard-theme-of-ggplot/ ggplot2 is customizeable. Frankly, one can change a heap of details - not everything probably, but a lot. Of course, one can add a theme to the ggplot call, in order to change the theme. However, a more catch-it-all approach would be to change the standard theme of ggplot itself. In this post, we’ll investigate this option. Load some data and the right packages: data(mtcars) library(tidyverse) Here’s the standard theme of ggplot, let’s have a look at it Talk - Populism in tweets of German politicians (talk at DGPs 2018) https://data-se.netlify.app/2018/09/14/talk-populism-in-tweets-of-german-politicians-talk-at-dgps-2018/ Fri, 14 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/14/talk-populism-in-tweets-of-german-politicians-talk-at-dgps-2018/ The slides of my talk Populism in tweets of German politicians can be found here http://data-se.netlify.com/slides/populist-twitter/populist-twitter-dgps2018.html#1. Data, code, and more can be found at Github: https://github.com/sebastiansauer/polits_tweet_mining DataExploR: Typische Businessfragen mit R analysieren https://data-se.netlify.app/2018/09/12/dataexplor-typische-businessfragen-mit-r-analysieren/ Wed, 12 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/12/dataexplor-typische-businessfragen-mit-r-analysieren/ In diesem Post untersuchen wir eine recht häufige Fragestellung im Bereich der Datenanalyse – die Auswertung von Umfragedaten. Umfragen sind eine gängige Angelegenheit in vielen Organisationen: man möchte wissen, ob die Kunden zufrieden sind oder was die Mitarbeiter vom Management denken. Wir werden nicht alle Aspekte der Analyse betrachten – da gibt es viel zu tun –, sondern ein paar zentrale Aspekte herausgreifen. Laden wir zuerst ein paar nützliche Pakete: Wenn Excel aufgibt: Datenvisualisierung kann zu komplex für Excel werden https://data-se.netlify.app/2018/09/11/wenn-excel-aufgibt-datenvisualisierung-kann-zu-komplex-f%C3%BCr-excel-werden/ Tue, 11 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/11/wenn-excel-aufgibt-datenvisualisierung-kann-zu-komplex-f%C3%BCr-excel-werden/ Ms Excel ist ein beliebtes Werkzeug der Datenanalyse, auch für Datenvisualisierung. Es gibt einige Beispiele, dass andere Werkzeuge, wie R, zu ansehnlicheren Diagrammen führen können, s. diesen Post. In diesem Post geht es um eine verwandte Frage: Gibt es Diagramme, die nicht – oder nur sehr aufwendig – mit Excel zu erstellen sind? Die Meine Antwort lautet: Ja, die gibt es. Betrachten wir ein Beispiel. Bayesianische Modelle visualisieren Als Hintergrund dient uns eine Analyse (s. Plotting a logistic regression - some considerations https://data-se.netlify.app/2018/09/03/plotting-a-logistic-regression-some-considerations/ Mon, 03 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/03/plotting-a-logistic-regression-some-considerations/ library(mosaic) data(tips, package = "reshape2") Recode sex: tips %>% mutate(sex_n = case_when( sex == "Female" ~ 0, sex == "Male" ~ 1 )) -> tips2 Fit model: glm1 <- glm(sex_n ~ total_bill, data = tips2, family = "binomial") Way 1 plotModel(glm1) Way 2 Add predictions to data frame: tips2 %>% mutate(pred = predict(glm1, newdata = tips, type = "response")) %>% mutate(predict_Male = pred > .5) -> tips3 Check values of predictions: Reproducible academic writing with RMarkdown - Talk at DGPs 2018 https://data-se.netlify.app/2018/09/03/reproducible-academic-writing-with-rmarkdown-talk-at-dgps-2018/ Mon, 03 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/03/reproducible-academic-writing-with-rmarkdown-talk-at-dgps-2018/ Talk at DGPs 2018. Get slides here: http://data-se.netlify.com/slides/rmd-writing/rmd-writing_dgps2018.html. Talk - Predictors of AfD party success in the 2017 elections. A Bayesian modeling approach https://data-se.netlify.app/2018/09/02/predictors-of-afd-party-success-in-the-2017-elections-a-bayesian-modeling-approach/ Sun, 02 Sep 2018 00:00:00 +0000 https://data-se.netlify.app/2018/09/02/predictors-of-afd-party-success-in-the-2017-elections-a-bayesian-modeling-approach/ Talk at DGPs 2018. Get slides here http://data-se.netlify.com/slides/afd_dgps2018/afd_dgps2018.html Bayesian modeling of populist party success in German federal elections - A notebook from the lab https://data-se.netlify.app/2018/08/25/bayesian-modeling-of-populist-party-success-in-german-federal-elections/ Sat, 25 Aug 2018 00:00:00 +0000 https://data-se.netlify.app/2018/08/25/bayesian-modeling-of-populist-party-success-in-german-federal-elections/ Following up on an earlier post, we will model the voting success of the (most prominent) populist party, AfD, in the recent federal elections. This time, Bayesian modeling techniques will be used, drawing on the excellent textbook my McElreath. Note that this post is rather a notebook of my thinking, doing, and erring. I’ve made no efforts to hide scaffolding. I think it will be confusing to the uniniate and the initiate as well … Binning and recoding with R - some recommendations https://data-se.netlify.app/2018/08/09/binning-and-recoding-with-r-some-recommendations/ Thu, 09 Aug 2018 00:00:00 +0000 https://data-se.netlify.app/2018/08/09/binning-and-recoding-with-r-some-recommendations/ Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”. Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks. Let’s load some example data: data(tips, package = "reshape2") Some packages: Finding NAs in multiples columns (per row) https://data-se.netlify.app/2018/08/09/finding-nas-in-multiples-columns-per-rows/ Thu, 09 Aug 2018 00:00:00 +0000 https://data-se.netlify.app/2018/08/09/finding-nas-in-multiples-columns-per-rows/ Assume you would like to check for missing data, but not for one column only but for several columns. First, data and some packages: data(mtcars) library(tidyverse) Then, let’s introduce some missing data: mtcars[c(1,2), 1] <- NA mtcars[c(1, 3:4), 2] <- NA Don’t check columns individually Of course, you do not want to repeat yourself, and check each column individually, like this: sum(is.na(mtcars[[1]])) #> [1] 2 sum(is.na(mtcars[, 1])) # same #> [1] 2 Neither one would like to check each row individually: Power calculation for the general linear model https://data-se.netlify.app/2018/07/24/power-calculation-for-the-general-linear-model/ Tue, 24 Jul 2018 00:00:00 +0000 https://data-se.netlify.app/2018/07/24/power-calculation-for-the-general-linear-model/ Before conducting an experiment, one should compute the power - or, preferably, estimate the precision of the expected results. There are numerous way to achieve this, here’s one using the R package pwr. Package pwr library(pwr) The workhorse function here is pwr.f2.test. Note that f2 refers to the effect size $f^2$ (see here), defined as: \[f^2 = \frac{R^2}{1-R^2}\]. See for details of the function its help page: help("pwr.f2.test") pwr.f2.test(u = NULL, v = NULL, f2 = NULL, sig. How to prepare data for a gantt diagram https://data-se.netlify.app/2018/07/05/how-to-prepare-data-for-a-gantt-diagram/ Thu, 05 Jul 2018 00:00:00 +0000 https://data-se.netlify.app/2018/07/05/how-to-prepare-data-for-a-gantt-diagram/ There’s the new cool world of project management - agile, scrumbling, cool. There’s the old sluggish way of project management using stuff like gantt diagrams. Let’s stick to the old world and come up with a gantt diagram. The gant diagram itself is no big deal. Just some horizontal lines referring to dates. Somewhat more interesting is to populate a raw data frame in a way that allows for convenient plotting. Work with bibtex bib files like a pro https://data-se.netlify.app/2018/07/05/work-with-bibtex-bib-files-like-a-pro/ Thu, 05 Jul 2018 00:00:00 +0000 https://data-se.netlify.app/2018/07/05/work-with-bibtex-bib-files-like-a-pro/ Recently, I had to curate a list of publications for our institution. Where’s the point? One might ask. Let’s leave aside that a number of colleagues do not use citation management software to work with their publications. They just hack the citation, if and when needed, in some word files. Done. Fair enough, unless someone tries to come up with a list of all the publication of that institution. In that case, the curator will need some structured data, otherwise he or she will end up copy-pasting the rest of the day. How to cite "in press" using Bibtex https://data-se.netlify.app/2018/07/01/how-to-cite-in-press-using-bibtex/ Sun, 01 Jul 2018 00:00:00 +0000 https://data-se.netlify.app/2018/07/01/how-to-cite-in-press-using-bibtex/ Bibtex entry type for conference talks suitable for APA https://data-se.netlify.app/2018/06/26/bibtex-entry-type-for-conference-talks-suitable-for-apa/ Tue, 26 Jun 2018 00:00:00 +0000 https://data-se.netlify.app/2018/06/26/bibtex-entry-type-for-conference-talks-suitable-for-apa/ I’ ve wondered how to best cite a talk given at a conference that is not “really” published in the sense that there’s no ISBN or similar identifier One can argue that it is not worth citing a non-identifiable source - I agree with that basically. However, for some reasons it maybe helpful to cite anyway. For example, one may have to document the talks being given. For that purpose, I found this bibtex entry type helpful: Easy way to convert factors zu numbers https://data-se.netlify.app/2018/06/22/easy-way-to-convert-factors-zu-numbers/ Fri, 22 Jun 2018 00:00:00 +0000 https://data-se.netlify.app/2018/06/22/easy-way-to-convert-factors-zu-numbers/ Converting factors to numbers in R can be frustrating. Consider the following sitation: We have some data, and try to convert a factor (sex in tips, see below) to a numeric variable: library(tidyverse) library(sjmisc) # for recoding data(tips, package = "reshape2") glimpse(tips) #> Observations: 244 #> Variables: 7 #> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.... #> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4. Some musings on the logistic map https://data-se.netlify.app/2018/06/19/some-musings-on-the-logistic-map/ Tue, 19 Jun 2018 00:00:00 +0000 https://data-se.netlify.app/2018/06/19/some-musings-on-the-logistic-map/ The logistic map is a well-known and simple growth model that is defined by the iterative equation \[x_{t+1} = 4rx_t(1-t_t)\], where $r$ is a parameter that can be thought of as a fertility and reproduction rate of the population. The allowed values of $x$ range between 0 an 1 inclusively, where 0 means the population is extinct. The maximum of 1 can be interpreted as the ecological carrying capacity of the system. Visualizing mean values between two groups - the tidyverse way https://data-se.netlify.app/2018/06/10/visualizing-summary-statistics-the-tidyverse-way/ Sun, 10 Jun 2018 00:00:00 +0000 https://data-se.netlify.app/2018/06/10/visualizing-summary-statistics-the-tidyverse-way/ A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way. library(tidyverse) data(mtcars) library(mosaic) library(knitr) library(sjmisc) library(sjPlot) Visualizing mean values between two groups First, let’s compute the mean hp for automatic cars (am == 0) vs. manual cars (am == 1). mtcars %>% group_by(am) %>% summarise(hp_am = mean(hp)) -> hp_am Now just hand over this data frame of summarized data to ggplot: Playing around with geo mapping: combining demographic data with spatial data https://data-se.netlify.app/2018/05/28/playing-around-with-geo-mapping-combining-demographic-data-with-spatial-data/ Mon, 28 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/28/playing-around-with-geo-mapping-combining-demographic-data-with-spatial-data/ In this post, we will play around with some basic geo mapping. More preciseyl, we will explore some easy ways to plot a choropleth map. First, let’s load some geo data from Bundeswahlleiter, and combine it with some socio demographic data from the same source. Preparation Let’s load some packages: library(tidyverse) ## Warning: package 'dplyr' was built under R version 3.5.1 library(sf) library(viridis) suppressPackageStartupMessages(library(googleVis)) Geo data: my_path_wahlkreise <- "~/Documents/datasets/geo_maps/btw17_geometrie_wahlkreise_shp/Geometrie_Wahlkreise_19DBT.shp" file. Playing around with dumbbell plots https://data-se.netlify.app/2018/05/23/playing-around-with-dumbbell-plots/ Wed, 23 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/23/playing-around-with-dumbbell-plots/ Dumbbell plots can be used to show differences between two groups. Bob Rudis demonstrated a beautiful application of such plots using ggplot2 board methods. In this plot, I will explain or comment his code, and adapt a few changes. First, load some packages. pacman::p_load(tidyverse, ggalt) Let’s make up some data. Tip: Make up some data conveniently in Excel, copy it to the clipboard, and then paste it as tribble (see below) into R. Playing around with dataviz: Comparing distributions between groups https://data-se.netlify.app/2018/05/18/playing-around-dataviz-comparing-distributions-between-groups/ Fri, 18 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/18/playing-around-dataviz-comparing-distributions-between-groups/ What’ a nice way to display distributional differences between a (larger) number of groups? Boxplots is one way to go. In addition, the raw data may be shown as dots, but should be demphasized. Third, a trend or big picture comparing the groups will make sense in some cases. Ok, based on this reasoning, let’s do som visualizing. Let’s load some data (movies), and the usual culprits of packages. Playing around with dataviz: Showing correlations https://data-se.netlify.app/2018/05/18/playing-around-with-dataviz-showing-correlations/ Fri, 18 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/18/playing-around-with-dataviz-showing-correlations/ In this plot, we are looking into some ways of displaying association between (two) quantitative variables, aka correlation. Our goal is to present a rich representation of the correlation. Let’s take the dataset flights as an example. data(flights, package = "nycflights13") library(tidyverse) ## Warning: package 'dplyr' was built under R version 3.5.1 library(viridis) flights %>% filter(arr_delay < 100, dep_delay < 100) %>% ggplot(aes(x = dep_delay, y = arr_delay, color = origin)) + geom_point(alpha = . Showcase of Viridis, maps, and ggcounty https://data-se.netlify.app/2018/05/18/showcase-of-viridis-maps-and-ggounty/ Fri, 18 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/18/showcase-of-viridis-maps-and-ggounty/ This posts shows how easy it can be to build an visually pleasing plot. We will use hrbrmster’s ggcounty, which is an R package at this Github repo. Graphics engine is as mostly in my plots, Hadley Wickhams ggplot. All build on R. Standing on shoulders… Disclaimer: This example heavily draws on hrbrmster example on this page. All credit is due to Rudy, and those on whose work he built up on. Why is the sample mean a good point estimator of the population mean? A simulation and some thoughts. https://data-se.netlify.app/2018/05/18/why-is-the-sample-mean-a-good-point-estimator-of-the-population-mean-a-simulation-and-some-thoughts/ Fri, 18 May 2018 00:00:00 +0000 https://data-se.netlify.app/2018/05/18/why-is-the-sample-mean-a-good-point-estimator-of-the-population-mean-a-simulation-and-some-thoughts/ It is frequently stated that the sample mean is a good or even the best point estimator of the according population value. But why is that? In this post we are trying to get an intuition by using simulation inference methods. Assume you played throwing coins with some one at some dark corner. “Some one” throws the coin 10 times, and wins 8 times (the guy was betting on heads, but that’s only for the sake of the story). Convenient way to cite blog posts using Bibtex https://data-se.netlify.app/2018/04/11/convenient-way-to-cite-blog-posts-using-bibtex/ Wed, 11 Apr 2018 00:00:00 +0000 https://data-se.netlify.app/2018/04/11/convenient-way-to-cite-blog-posts-using-bibtex/ Writing (scholarly) texts - a great way is using Markdown. Bibtext interacts nicely with Markdown, so one can easily cite literature. One question that came up for me a couple of times recently was how to cite blogs in Bibtex? I found this solution to be the most convenient: @misc{stats_test, Author = {Sebastian Sauer}, Date-Added = {2018-03-29 13:54:38 +0000}, Date-Modified = {2018-03-29 13:55:51 +0000}, Doi = {10.17605/OSF.IO/SJHUY}, Howpublished = {Data Set}, Month = {01}, Title = {Results from an exam in inferential statistics}, Year = {2017}} The important points are the @misc class, and the Howpublished field. One-way ANOVA power analysis https://data-se.netlify.app/2018/04/11/one-way-anova-power-analysis/ Wed, 11 Apr 2018 00:00:00 +0000 https://data-se.netlify.app/2018/04/11/one-way-anova-power-analysis/ Computing or estimating power is a very useful procedure in order to weigh the reliability of study results. One frequent procedure in inferential statistics is the ANOVA, with the simplest form being the one-way ANOVA. This post shows how to compute power for this test. What’s the effect size? The first thing to not is that there is no such thing as “power” - in the sense that a sample or a test would have “its power”. Parse libraries from R project https://data-se.netlify.app/2018/04/11/parse-libraries-from-r-project/ Wed, 11 Apr 2018 00:00:00 +0000 https://data-se.netlify.app/2018/04/11/parse-libraries-from-r-project/ Having written a larger R project is may be of interest which packages have been used. As I did not find a read-to-use package, a colleague of mine - Norman Markgraf - came up with a nice solution. In this post, I build on his solution to provide a function that suits my needs of today: @Norman: Thanks for your great idea! First, some libraries: library(tidyverse) library(bibtex) library(testthat) Then, here is some path of an R project where we want to parse all rmd files: Visualisation of interaction for the logistic regression https://data-se.netlify.app/2018/04/02/visualisation-of-interaction-for-logistic-regression/ Mon, 02 Apr 2018 00:00:00 +0000 https://data-se.netlify.app/2018/04/02/visualisation-of-interaction-for-logistic-regression/ In this post we are plotting an interaction for a logistic regression. Interaction per se is a concept difficult to grasp; for a GLM it may be even more difficult especially for continuous variables’ interaction. Plotting helps to better or more easy grasp what a model tries to tell us. First, load some packages. library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0. Why "n-1" in empirical variance? A simulation. https://data-se.netlify.app/2018/03/24/why-n-1-in-empirical-variance-a-simulation/ Sat, 24 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/24/why-n-1-in-empirical-variance-a-simulation/ It is well-known that the empirical variance underestimates the population variance. Specifically, the empirical variance is defined as: $var_{emp} = \frac{\sum_i (x_i - \bar{x})^2}{n-1}$. But why $n-1$, why not just $n$, as intuition (of some) dictates? Put shortly, as the variance of a sample tends to underestimate the population variance we have to inflate it artificially, to enlarge it, that’s why we do put a smaller number (the “n-1”) in the denominator, resulting in a larger value of the whole fraction. Beispiel zu Simpsons Paradox https://data-se.netlify.app/2018/03/16/beispiel-zu-simpsons-paradox/ Fri, 16 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/16/beispiel-zu-simpsons-paradox/ knitr::opts_chunk$set(echo = FALSE) In diesem Post diskutieren wir ein Beispiel zu Simpson’s Paradox. Der Fokus liegt nicht auf der R-Syntax, sondern auf einer intuitiven Erläuterung des Simpson Paradox. (Die Syntax findet sich in ähnlicher Form in diesem Post.) Sagen wir, Sie müssen sich zwischen zwei Ärzten (Dr. Arriba und Dr. Bajo) entscheiden und fragen sich, welcher “besser” ist. Unter “besser” verstehen Sie “höhere Heilungsquote”. Die beiden Ärzte behandeln die gleichen zwei Krankheiten: Severitis und Nervosia maskulina. Tangible data of normal distributed data https://data-se.netlify.app/2018/03/16/tangible-data-of-normal-distributed-data/ Fri, 16 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/16/tangible-data-of-normal-distributed-data/ A classical example for a normally distributed variable is height. However, I kept on looking for data as to the mean and sd for some populations, such as Germany. Now I found some reliably looking data here. We will not question whether the assumption of normality holds, we just assume it. In the source, we can read that in Germany, the adult men population has the following parameters: mean: 174cm Map students to presentation slots https://data-se.netlify.app/2018/03/11/map-students-to-presentation-slots/ Sun, 11 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/11/map-students-to-presentation-slots/ As a teacher, I not only teach but also assess the achievements of students. One example of a typical student assignments is a presentation. You know, powerpoint slides and stuff. For that purpose, I often need to map students to one of several time slots. Here’s the R code I use for that purpose. library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0. Intuition to Simpson's paradox https://data-se.netlify.app/2018/03/09/intuition-to-simpson-s-paradox/ Fri, 09 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/09/intuition-to-simpson-s-paradox/ Say, you have to choose between two doctors (Anna and Berta). To decide which one is better, you check their success rates. Suppose that they deal with two conditions (Coolities and Dummities). So let’s compare their success rate for each of the two conditions (and the total success rate): This is the proportion of healing (success) of the first doctor, Dr. Anna for each of the two conditions: How to create columns in a dataframe in R https://data-se.netlify.app/2018/03/07/how-to-create-columns-in-a-dataframe-in-r/ Wed, 07 Mar 2018 00:00:00 +0000 https://data-se.netlify.app/2018/03/07/how-to-create-columns-in-a-dataframe-in-r/ Note that we will use this library for this post: library(dplyr) ## Warning: package 'dplyr' was built under R version 3.5.1 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union By the way, loading mosaic, will load dplyr too. One of the major data wrangling activities (in R and elsewhere) is to create a new column in a data frame. Papers publizieren. Versuch einer Anleitung https://data-se.netlify.app/2018/01/25/papers-publizieren-versuch-einer-anleitung/ Thu, 25 Jan 2018 00:00:00 +0000 https://data-se.netlify.app/2018/01/25/papers-publizieren-versuch-einer-anleitung/ Unter https://sebastiansauer.github.io/Talks-ses/pubws.html#/ finden sich die HTML-Folien zu einem Talk von mir zum Thema, wie man Papers publiziert (oder es zumindest versucht). Der Quelltext findet sich in diesem Github-Repo. Der Talk steht unter der CC-BY-Lizenz. Simulate p-hacking - adding observations https://data-se.netlify.app/2018/01/24/simulate-p-hacking-adding-observations/ Wed, 24 Jan 2018 00:00:00 +0000 https://data-se.netlify.app/2018/01/24/simulate-p-hacking-adding-observations/ Let’s simulate p-values as a funtion of sample size. We assume that some researcher collects one data point, computes the p-value, and repeats until p-value falls below some arbitrary threshold. Oh and yes, there is no real effect. For the sake of spending the budget, assume that our researcher collects a sample size of $n=100$. This idea stems from this great article False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant; cf. Visualizing a logistic regression the easy way https://data-se.netlify.app/2018/01/23/visualizing-a-logistic-regression-the-easy-way/ Tue, 23 Jan 2018 00:00:00 +0000 https://data-se.netlify.app/2018/01/23/visualizing-a-logistic-regression-the-easy-way/ Let’s visualize a GLM (logistic regression). First laod some data: data(tips, package = "reshape2") Compute a glm: glm_tips <- glm(sex ~ tip, data = tips, family = "binomial") Plot the model using mosaic: library(mosaic) plotModel(glm_tips) The curve does not look really s-typed (ogive) but that’s ok because the data suggest not a strong trend. The plot is not very beautiful either, but hey - it’s quick to produce 😁. Zusammenhang von Lernen und Noten im Statistikunterricht https://data-se.netlify.app/2017/12/20/zusammenhang-von-lernen-und-noten-im-statistikunterricht/ Wed, 20 Dec 2017 00:00:00 +0000 https://data-se.netlify.app/2017/12/20/zusammenhang-von-lernen-und-noten-im-statistikunterricht/ Führt Lernen zu besseren Noten? Eigene Erfahrung und allgemeiner Konsens stimmen dem zu; zumindest schadet Lernen des Stoffes nicht und hilft oft, gute Noten bei einer Prüfung zu diesem Stoff zu erzielen. Aber welche Belege, wissenschaftliche Belege gibt es dazu? An unserer Hochschule, die FOM, haben wir eine kleine Untersuchung zu dieser Frage durchgeführt. Genauer gesagt haben wir unseren Studierenden einen Statistik-Test vorlegt und gefagt, wie sehr sie sich für diesen Test vorbereitet hätten. A p-value picture https://data-se.netlify.app/2017/11/29/a-p-value-picture/ Wed, 29 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/29/a-p-value-picture/ Much ado and to say about the p-value. Let me add one more point; actually not really from myself, but from Diez, Barr, and Cetinkaya-Rundel (2012), p. 189; good book in one is looking for “orthodox” statistics. library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.6 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1. Grundlagen des Textminings mit R https://data-se.netlify.app/2017/11/28/textmining-grundlagen/ Tue, 28 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/28/textmining-grundlagen/ Lernziele: - Sie kennen zentrale Ziele und Begriffe des Textminings. - Sie wissen, was ein 'tidy text dataframe' ist. - Sie können Worthäufigkeiten auszählen. - Sie können Worthäufigkeiten anhand einer Wordcloud visualisieren. In dieser Übung benötigte R-Pakete: library(tidyverse) # Datenjudo library(stringr) # Textverarbeitung library(tidytext) # Textmining library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen Bitte installieren Sie rechtzeitig alle Pakete, z.B. in RStudio über den Reiter Packages > Install. Grundlagen des Textminings mit R - Teil 2 https://data-se.netlify.app/2017/11/28/grundlagen-des-textminings-mit-r-teil-2/ Tue, 28 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/28/grundlagen-des-textminings-mit-r-teil-2/ In dieser Übung benötigte R-Pakete: library(tidyverse) # Datenjudo library(stringr) # Textverarbeitung library(tidytext) # Textmining library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen library(skimr) # Überblicksstatistiken Bitte installieren Sie rechtzeitig alle Pakete, z.B. in RStudio über den Reiter Packages … Install. Aus dem letzten Post Daten einlesen: osf_link <- paste0("https://osf.io/b35r7/?action=download") afd <- read_csv(osf_link) ## Rows: 96 Columns: 2 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (1): content ## dbl (1): page ## ## ℹ Use `spec()` to retrieve the full column specification for this data. Image path for blogdown https://data-se.netlify.app/2017/11/28/image-path-for-blogdown/ Tue, 28 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/28/image-path-for-blogdown/ How to include external images to a hugo post? Suppose we have a file img1.png in project1, ie., project1/img1.png. Do this: Copy your folder with images to static/. Use this path in your blogdown post: /project/img1.png. Mind the leading slash! Example time This code (on my machine) ![](/images/textmining/tidytext-crop.png){ width="20%" } renders this: Note the nice width option. Knitr way The knitr way works similarly: knitr::include_graphics("/images/textmining/tidytext-crop.png") Dummy variables and regression https://data-se.netlify.app/2017/11/27/dummy-variables-and-regression/ Mon, 27 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/27/dummy-variables-and-regression/ For modeling cause-effect relationships, linear regression is among the most typically used methods. Take, for example, the idea that the Gross Domestic Product (GDP) drives religiosity. Of course, we should have a strong theory that defends this choice and this directionality. Without a convincing theory it may be argued that the cause-relationship is the other way round or complete different (ie., some third variable accounts for any association between GDP and religiosity). Interactive diagrams in lieu of shiny? https://data-se.netlify.app/2017/11/27/interactive-diagrams-in-lieu-of-shiny/ Mon, 27 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/27/interactive-diagrams-in-lieu-of-shiny/ One frequent use of the Shiny server software is displaying interactive data diagrams. The pro of using Shiny is the great flexibility; much more than “just graphics” can be done. Basically Shiny provides a flexible GUI for your R program. But if you simply aiming at displaying or exploring some data interactively, a much simplor approach may do it for you; there are some nice libraries available in R for that. My favorite stats text book https://data-se.netlify.app/2017/11/27/my-favorite-stats-text-book/ Mon, 27 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/27/my-favorite-stats-text-book/ Some thoughts how my favorite applied stats text book would look like. I am looking at eg., business fields such as MBA as consumers. My ideal applied stats text book is case study oriented (“Assume you would like to predict which movie will score highest next year based on some movie characteristics you know”) makes use of recent data analytics techniques such as tree based methods (Random Forests) or Shrinkage models (Lasso) Compute effect sizes with R. A primer. https://data-se.netlify.app/2017/11/21/compute-effect-sizes-with-r-a-primer/ Tue, 21 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/21/compute-effect-sizes-with-r-a-primer/ A typical “cook book recipe” for doing data analysis is an applied stats course is: report descriptive statistics plot some nice diagrams test hypothesis report effect sizes Let’s have a quick glance at these steps. We will use the dataset flights of the package nycflights13. data(flights, package = "nycflights13") This post will be tidyverse-driven. library(tidyverse) library(skimr) library(mosaic) Let’s compute some summaries: flights %>% select(arr_delay) %>% skim Data summary Name Piped data Number of rows 336776 Number of columns 1 _______________________ Column type frequency: numeric 1 ________________________ Group variables None Variable type: numeric Hello World, this is Blogdown https://data-se.netlify.app/2017/11/21/hello-world-this-is-blogdown/ Tue, 21 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/21/hello-world-this-is-blogdown/ My blog at https://sebastiansauer.github.io/posts/ has moved. It is now here! This is the new home of my blog. In (the unlikely) case you are asking yourself “Why did you move your blog?”, here is the answer. I was using Jekyll at Github pages which is great as long as you do not have a lot of R in your posts. But I did have a lot of R in my posts. Great dataviz examples in rstats https://data-se.netlify.app/2017/11/20/great-dataviz-examples-in-rstats/ Mon, 20 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/20/great-dataviz-examples-in-rstats/ Here come some stunning examples of data visualizations, all built with R. R code of each diagram is available at the source. Enjoy! #beautiful. UPDATE: I’ve included links to the R source! Plotting geo maps along with subplots in ggplot2 I like this one by Ilya Kashnitsky: Similarly, by the same author: Source Great work, @ikashnitsky! Cirlize (Chord) diagrams Plotting association in a circular form yields aesthetic examples of diagrams, see the following examples Wie gut schätzt eine Stichprobe die Grundgesamtheit? https://data-se.netlify.app/2017/11/17/inference/ Fri, 17 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/17/inference/ Daten Sie arbeiten bei der Flughafenaufsicht von NYC. Cooler Job. library(nycflights13) data(flights) Pakete laden library(mosaic) Stichprobe ziehen Die Aufsichtsbehörde zieht eine Probe von 100 Flügen und ermittelt die “typische” Verspätung. set.seed(42) sample(flights$arr_delay, size = 100) -> flights_sample Und berechnen wir die typischen Kennwerte: favstats(~flights_sample, na.rm = TRUE) #> min Q1 median Q3 max mean sd n missing #> -51 -18.75 -5 11.75 150 0.4387755 31.1604 98 2 Ob $n=3$ ausreichen würde? Some thoughts on tidyveal and environments in R https://data-se.netlify.app/2017/11/16/tidyeval_basense/ Thu, 16 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/16/tidyeval_basense/ The tidyeval framework is a rather new, and in parts complementary, framework to dealing with non-standarde evaluation (NSE) in R. In short, NSE is about capturing some R-code, witholding execution, maybe editing the code, and finally execuing it later and/or somewhere else. This post borrows heavily by Edwin Thon’s great post, and this post by the same author. In addtion, most of the knowledge is derived from Hadley Wickham’s book Advanced R. Yart - Yet Another Markdown Report Template https://data-se.netlify.app/2017/11/15/yart/ Wed, 15 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/15/yart/ It would be useful to have a RMarkdown template for typical (academic) reports such as class assigments and bachelor/master thesises. The LaTeX class “report” provides a suitable format for that. This package provides a simple wrapper around this class built on the standard pandoc template. Thanks to Yart, ie, this package leans on earlier work by Aaron Wolen in his pandoc-letter repository, and extends it for use from R via the rmarkdown package. Package 'pradadata' on Github - feature social science data https://data-se.netlify.app/2017/11/07/pradadata/ Tue, 07 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/07/pradadata/ Recently, I’ve put a package on Github featureing some social science data set. Some data came from official sites; my contribution was to clear ‘em up, and render comfortably accessable for automatic inquiry (nice header lines, no special enconding, flat csvs….). In other cases it’s unpublished data collected by friends, students of mine or myself. Let’s check its contents using a function by Maiasaura from this SO post. library(pradadata) lsp <- function (package, all. Populism in tweets of German politicians https://data-se.netlify.app/2017/11/01/afd01/ Wed, 01 Nov 2017 00:00:00 +0000 https://data-se.netlify.app/2017/11/01/afd01/ The last months (years? since ever???) have seen a surge in populism and a rise in nationalism. Not only in Russia, the United States, Turkey, but also in some EU countries the ghost of nationalism-populism seems to be marching and gaining ground. As to Germany, in September 24, 2017, the 19. German federal elections took place. The newly founded alt-right AfD (Alternative for Deutschland) has made a leap and moved in the Bundestag. Data, machine-friendly, of the 2017 German federal elections https://data-se.netlify.app/2017/10/30/de-elec-data/ Mon, 30 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/30/de-elec-data/ On September 2017, the 19. German Bundestag has been elected. As of this writing, the parties are still busy sorting out whether they want to part of the government, with whom, and maybe whether they even want to form a government at all. This post is about providing the data in machine friendly form, and in English language. All data presented in this post regarding this (and previous) elections are published by the Bundeswahlleiter. Mapping foreigner ratio to AfD election results in the German Wahlkreise https://data-se.netlify.app/2017/10/22/afd-map-foreigners/ Sun, 22 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/22/afd-map-foreigners/ In a previous post, we have shed some light on the idea that populism - as manifested in AfD election results - is associated with socioeconomic deprivation, be it subjective or objective. We found some supporting pattern in the data, although that hypothesis is far from being complete; ie., most of the variance remained unexplained. In this post, we test the hypothesis that AfD election results are negatively associated with the proportion of foreign nationals in a Wahlkreis. Simple way to separate train and test sample in R https://data-se.netlify.app/2017/10/17/train-test/ Tue, 17 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/17/train-test/ For statistical modeling, it is typical to separate a train sample from a test sample. The training sample is used to build (“train”) the model, whereas the test sample is used to gauge the predictive quality of the model. There are many ways to split off a test sample from the train sample. One quite simple, tidyverse-oriented way, is the following. First, load the tidyverse. Next, load some data. library(tidyverse) data(Affairs, package = "AER") Then, create an index vector of the length of your train sample, say 80% of the total sample size. Two R plot side by side in .Rmd-Files https://data-se.netlify.app/2017/10/12/two-plots-rmd/ Thu, 12 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/12/two-plots-rmd/ I kept wondering who to plot two R plots side by side (ie., in one “row”) in a .Rmd chunk. Here’s a way, well actually a number of ways, some good, some … not. library(tidyverse) library(gridExtra) library(grid) library(png) library(downloader) library(grDevices) data(mtcars) Plots from ggplot Say, you have two plots from ggplot2, and you would like them to put them next to each other, side by side (not underneath each other): Mapping unemployment ratio to AfD election results in German Wahlkreise https://data-se.netlify.app/2017/10/10/afd-map/ Tue, 10 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/10/afd-map/ There is the idea that the alt-right German party AfD is followed by those who are deprived of chances, thoses of fearing to falling down the social ladder, and so on. Let’s test this hypothesis. No, I am not thinking on hypothesis testing, p-values, and stuff. Rather, let’s color a map of German election districts (Wahlkreise) according to whether the area is poor AND the AfD gained a lot of votes (and vice versa: the area is rich AND the AfD gained relatively few votes). Mapping unemployment rate to German district areas https://data-se.netlify.app/2017/10/09/unemp-map/ Mon, 09 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/09/unemp-map/ A chloropleth map is a geographic map where statistical information are mapped to certain areas. Let’s plot such a chloropleth map in this post. Packages library(sf) library(stringr) library(tidyverse) library(readxl) Geo data Best place to get German geo data is from the “Bundesamt für Kartografie und Geodäsie (BKG)". One may basically use the data for a purposes unless it is against the law. I have downloaded the data 2017-10-09. More specifically, we are looking at the “Verwaltungsgebiete” (vg), that is, the administrative areas of the country, ie. Drawing a country map https://data-se.netlify.app/2017/10/06/chloromap/ Fri, 06 Oct 2017 00:00:00 +0000 https://data-se.netlify.app/2017/10/06/chloromap/ Let’s draw a map of Bavaria, a state of Germany, in this post. Packages library(tidyverse) library(maptools) library(sf) library(RColorBrewer) library(ggmap) library(viridis) library(stringr) Data Let’s get the data first. Basically, we need to data files: the shape file, ie., a geographic details of state borders and points of interest the semantic information to points of interest eg., town names Shape file The shape file can be downloaded from this source: http://www. Kongresse 2018 - Wirtschaftspsychologie und verwandte Gebiete https://data-se.netlify.app/2017/09/27/kongresse_2018/ Wed, 27 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/27/kongresse_2018/ Hier finden Sie eine Auswahl an wissenschaftlichen Kongressen in 2018 aus der Wirtschaftspsychologie und angrenzenden Feldern. Nationale Kongresse (in DACH) 64. GfA-Frühjahrskongress: Arbeit(s).Wissen.Schaf(f)t – Grundlage für Management & Kompetenzentwicklung, 21.-23. Februar in Frankfurt am Main Veranstalter: FOM in Frankfurt Frist für Einreichung von Beiträgen: 15. September 2017 Jubliäumskongress 20 Jahre Wirtschaftspsychologie der Gesellschaft für angewandte Wirtschaftspsychologie (GWPs), 8.-10. März 2018 in Wernigerode Veranstalter: Gesellschaft für angewandte Wirtschaftspsychologie (GWPs) Frist für Einreichung: OFFEN Some intriguing psychology papers (open access) https://data-se.netlify.app/2017/09/26/psy-paper-suggestions/ Tue, 26 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/26/psy-paper-suggestions/ This post presents a compilation of links to psychology papers; I have chosen papers I find intriguing particularly for working in class. All papers are open access (or a from open access repositories) which renders classroom work easier. The papers are collected from a broad range of topics but mostly with focus on general interest. The perspective is an applied one; I have not tried to select based on methodological rigor. Crashkurs Datenanalyse mit R https://data-se.netlify.app/2017/09/12/r-crashkurs/ Tue, 12 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/12/r-crashkurs/ Willkommen zum R-Crashkurs Nicht jeder liebt Datenanalyse und Statistik… in gleichem Maße! Das ist zumindest meine Erfahrung aus dem Unterricht 🔥. Crashkurse zu R sind vergleichbar zu Tanzkursen vor der Hochzeit: Hat schon vielen das Leben gerettet, aber ersetzt nicht ein Semester in der Pariser Tanzakademie (man beachte den Vergleich zum Unterricht an der Hochschule). Dieser Crashkurs ist für Studierende oder Anfänger der Datenanalyse gedacht, die in kurzer Zeit einen verzweifelten Versuch … äh … einen grundständigen Überblick über die Datenanalyse erwerben wollen. Different ways to count NAs over multiple columns https://data-se.netlify.app/2017/09/08/sum-isna/ Fri, 08 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/08/sum-isna/ There are a number of ways in R to count NAs (missing values). A common use case is to count the NAs over multiple columns, ie., a whole dataframe. That’s basically the question “how many NAs are there in each column of my dataframe”? This post demonstrates some ways to answer this question. Way 1: using sapply A typical way (or classical way) in R to achieve some iteration is using apply and friends. Different ways to present summaries in ggplot2 https://data-se.netlify.app/2017/09/08/ggplot-summaries/ Fri, 08 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/08/ggplot-summaries/ A convenient and well applicable visualization for comparing groups with respect to a metric variable is the boxplot. However, often, comparing means is accompanied by t-tests, ANOVAs, and friends. Such tests test the mean, not the median, and hence the boxplot is presenting the tested statistic. It would be better to align test and diagram. How can that be achieved using ggplot2? This posts demonstrates some possibilities. First, let’s plot a boxplot. Replacing dplyr::do by purrr:map. Some considerations https://data-se.netlify.app/2017/09/05/purrr-map-no-do/ Tue, 05 Sep 2017 00:00:00 +0000 https://data-se.netlify.app/2017/09/05/purrr-map-no-do/ Hadley Wickham has announced to depreceate dplyr::do in favor of purrr:map. In a recent post, I have made use of do, so some commentators informed me about that. In this post, I will show use cases of map, specifically as a replacement of do. map is for lists; read more about lists here. library(tidyverse) library(broom) We will use mtcars as a sample dataframe (boring, I know, but convenient). data(mtcars) Cor is a function that takes a dataframe as its input As in the last post, assume we would like to conduct a correlation test. Comparing the pipe with base methods https://data-se.netlify.app/2017/08/31/some-pipes/ Thu, 31 Aug 2017 00:00:00 +0000 https://data-se.netlify.app/2017/08/31/some-pipes/ Some say, the pipe (#tidyverse) makes analyses in R easier. I agree. This post demonstrates some examples. Let’s take the mtcars dataset as an example. data(mtcars) ?mtcars Say, we would like to compute the correlation between gasoline consumption (mpg) and horsepower (hp). Base approach 1 cor(mtcars[, c("mpg", "hp")]) ## mpg hp ## mpg 1.0000000 -0.7761684 ## hp -0.7761684 1.0000000 We use the [-operator (function) to select the columns; note that df[, c(col1, col2)] sees dataframes as matrices, and spits out a dataframe, not a vector: Shading normal curve made easy https://data-se.netlify.app/2017/08/29/simple-shading/ Tue, 29 Aug 2017 00:00:00 +0000 https://data-se.netlify.app/2017/08/29/simple-shading/ Shading values/areas under the normal curve is a quite frequent taks in eg educational contexts. Thanks to Hadley in this post, I found this easy solution. library(ggplot2) ```r ggplot(NULL, aes(c(-3,3))) + geom_area(stat = "function", fun = dnorm, fill = "#00998a", xlim = c(-3, 0)) + geom_area(stat = "function", fun = dnorm, fill = "grey80", xlim = c(0, 3)) ``` ![plot of chunk unnamed-chunk-1](/images/2017-08-29/unnamed-chunk-1-1.png) Simple, right? Some minor beautification: ```r ggplot(NULL, aes(c(-3,3))) + geom_area(stat = "function", fun = dnorm, fill = "#00998a", xlim = c(-3, 1)) + geom_area(stat = "function", fun = dnorm, fill = "grey80", xlim = c(1, 3)) + labs(x = "z", y = "") + scale_y_continuous(breaks = NULL) + scale_x_continuous(breaks = 1) ``` ! Programming with dplyr: Part 03, working with strings https://data-se.netlify.app/2017/08/09/dplyr_strings/ Wed, 09 Aug 2017 00:00:00 +0000 https://data-se.netlify.app/2017/08/09/dplyr_strings/ More on programming with dplyr: converting quosures to strings In this post, we have programmed a simple function using dplyr’s programming capabilities based on tidyeval; for more intro to programming with dplyr, see here. In this post, we’ll go one step further and programm a function where a quosure will be turned to a string. Why this? Because quite a number of functions out there except strings as input parameters. Precipitation - It never rains in Southern Nuremberg (?). Working with dates/times. https://data-se.netlify.app/2017/08/01/weather/ Tue, 01 Aug 2017 00:00:00 +0000 https://data-se.netlify.app/2017/08/01/weather/ In this post, we will explore some date and time parsing. As an example, we will work with weather data provided by City of Nuremberg, Environmental and Meteorological Data. We will need these packages: library(tidyverse) # data reading and wrangling library(lubridate) # working with dates/times First, let’s import some precipitation data: file_name <- "~/Downloads/export-sun-nuremberg--flugfeld--airport--precipitation-data--1-hour--individuell.csv" rain <- read_csv2(file_name, skip = 13, col_names = FALSE) ## Warning in rbind(names(probs), probs_f): number of columns of result is not ## a multiple of vector length (arg 1) ## Warning: 300 parsing failures. Programming with dplyr: Part 02, writing a function https://data-se.netlify.app/2017/07/06/prop_fav/ Thu, 06 Jul 2017 00:00:00 +0000 https://data-se.netlify.app/2017/07/06/prop_fav/ Recently, since dplyr <= 0.6.0 a new way of dealing with NSE was introduced, called tidyeval. As with every topic that begs our attention, the question “why bother” is in place. Theone answer is “you’ll need this stuff if you want to lock dplyr verbs inside a function”. Once you like dplyr and friends, a natural second step is to use the ideas not only for interactive use, but for more “programming” type, ie. Effect sizes for the Mann-Whitney U Test: an intuition https://data-se.netlify.app/2017/07/04/effsize_utest/ Tue, 04 Jul 2017 00:00:00 +0000 https://data-se.netlify.app/2017/07/04/effsize_utest/ The Mann-Whitney U-Test is a test with a wide applicability, wider than the t-Test. Why that? Because the U-Test is applicable for ordinal data, and it can be argued that confining the metric level of a psychological variable to ordinal niveau is a reasonable bet. Second, it is robust, more robust than the t-test, because it only considers ranks, not raw values. In addition, some say that the efficiency of the U-Test is very close to the t-Test (. A second look to grouping with dplyr https://data-se.netlify.app/2017/06/28/second_look_group_by/ Wed, 28 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/28/second_look_group_by/ The one basic idea of dplyr is that each function should focus on one job. That’s why there are no functions such as compute_sumamries_by_group_with_robust_variants(df). Rather, summarising and grouping are seen as different jobs which should be accomplished by different functions. And, in turn, that’s why group_by, the grouping function of dplyr, is of considerable importance: this function should do the grouping for each operation whatsoever. Let’s load all tidyverse libraries in one go: Programming with dplyr: Part 01, introduction https://data-se.netlify.app/2017/06/28/prog_dplyr_01/ Wed, 28 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/28/prog_dplyr_01/ Like for [others], Hadley Wickham’s dplyr, and more generally, the tidyverse approach has considerably changed the I do data analyses. Most notably, the pipe (coming from magrittr by Stefan Milton Bache, see here) has creeped into nearly every analyses I, do. That is, is every analyses except for functions, and other non-interactive stuff. In those programming contexts, the dplyr way does not work, due to its non standard evaluation or NSE for short. Preparation of extraversion survey data https://data-se.netlify.app/2017/06/24/extra_prep/ Sat, 24 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/24/extra_prep/ For teaching purposes and out of curiosity towards some psychometric questions, I have run a survey on extraversion here. The dataset has been published at OSF (DOI 10.17605/OSF.IO/4KGZH). The survey is base on a google form, which in turn saves the data in Google spreadsheet. Before the data can be analyzed, some preparation and makeup is in place. This posts shows some general makeup, typical for survey data. Download the data and load packages Download the data from source (Google spreadsheets); the package gsheet provides an easy interface for that purpose. Print csv-file tables as plots https://data-se.netlify.app/2017/06/22/tab2plot/ Thu, 22 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/22/tab2plot/ tl;dr Use this convenience function to print a dataframe as a png-plot: tab2grob(). Source the function here: https://sebastiansauer.github.io/Rcode/tab2grob.R Easiest way in R: source("https://sebastiansauer.github.io/Rcode/tab2grob.R") Printing csv-dataframes as ggplot plots Recently, I wanted to print dataframes not as normal tables, but as a png-plot. See: Why? Well, basically as a convenience function for colleagues who are not into using Markdown & friends. As I am preparing some stats stuff (see my new open access course material here) using RMarkdown, I wanted to prepare the materials ready for using in Powerpoint. Review of "The 7 Deadly Sins of Psychology" by Chris Chambers https://data-se.netlify.app/2017/06/22/seven-sins/ Thu, 22 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/22/seven-sins/ tl;dr: great book. Read. The “Seven Sins” is concerned about the validity of psychological research. Can we at all, or to what degree, be certain about the conclusions reached in psychological research? More recently, replications efforts have cast doubt on our confidence in psychological research (1). In a similar vein, a recent papers states that in many research areas, researchers mostly report “successes” in the sense of that they report that their studies confirm their hypotheses - with Psychology leading in the proportion of supported hypotheses (2). Identifying the package of a function https://data-se.netlify.app/2017/06/12/finds_funs/ Mon, 12 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/12/finds_funs/ tl;dr Suppose you want to know which package(s) a given R function belongs to, say filter. Here come find_funsto help you: find_funs("filter") ## # A tibble: 4 x 3 ## package_name builtin_pckage loaded ## <chr> <lgl> <lgl> ## 1 base TRUE TRUE ## 2 dplyr FALSE TRUE ## 3 plotly FALSE FALSE ## 4 stats TRUE TRUE This function will search all installed packages for this function name. It will return all the package names that match the function name (ie. Sorting the x-axis in bargraphs using ggplot2 https://data-se.netlify.app/2017/06/05/ordering-bars/ Mon, 05 Jun 2017 00:00:00 +0000 https://data-se.netlify.app/2017/06/05/ordering-bars/ Some time ago, I posted about how to plot frequencies using ggplot2. One point that remained untouched was how to sort the order of the bars. Let’s look at that issue here. First, let’s load some data. data(tips, package = "reshape2") And the usual culprits. library(tidyverse) library(scales) # for percentage scales First, let’s plot a standard plot, with bars unsorted. tips %>% count(day) %>% mutate(perc = n / nrow(tips)) -> tips2 ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity") Hang on, what could ‘unsorted’ possibly mean? mean and sd of z-values https://data-se.netlify.app/2017/05/26/z-values/ Fri, 26 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/26/z-values/ Edit: This post was updated, including two errors fixed - thanks to (private) comments from Norman Markgraf. z-values, aka values coming from an z-transformation are a frequent creature in statistics land. Among their properties are the following: mean is zero variance is one (and hence sd is one) But why is that? How come that this two properties are true? The goal of this post is to shed light on these two properties of z-values. Simple way of plotting normal/logistic/etc. curve https://data-se.netlify.app/2017/05/24/plotting_s-curve/ Wed, 24 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/24/plotting_s-curve/ Plotting a function is often helpful to better understand what’s going on. Plotting curves in R base is simple by virtue of function curve. But how to draw curves using ggplot2? That’s a little bit more complicated by can still be accomplished by 1-2 lines. library(ggplot2) Normal curve p <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) p + stat_function(fun = dnorm, n = 101) stat_function is some kind of parallel function to curve. Squares maximize area - a visualization https://data-se.netlify.app/2017/05/19/maximize_area/ Fri, 19 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/19/maximize_area/ An old story is that one of the farmer with a fence of some given length, say 20m. Now this farmer wants to put up his fence so that he claims the largest piece of land possible. What width (w) and height (h) should we pick? Instead of a formal proof, let’s start with a visualization. First, we need some packages. library(tidyverse) library(gganimate) library(RColorBrewer) library(scales) library(knitr) Now, let’s make up serveral ways to split up a rectengular piece of land. A predictor's unique contribution - (visual) demonstration https://data-se.netlify.app/2017/05/17/storks/ Wed, 17 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/17/storks/ A well-known property of regression models is that they capture the unique contribution of a predictor. By “unique” we mean the effect of the predictor (on the criterion) if the other predictor(s) is/are held constant. A typical classroom example goes along the following lines. All about storks There’s a correlation between babies and storks. Counties with lots of storks enjoy large number of babies and v.v. However, I have children, I know the storks are not overly involved in that business, so says the teacher (polite laughters in the audience). Crashkurs Datenanalyse mit R https://data-se.netlify.app/2017/05/16/crashkurs/ Tue, 16 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/16/crashkurs/ Nicht jeder liebt Datenanalyse und Statistik… in gleichem Maße. Das ist zumindest meine Erfahrung aus dem Unterricht :neckbeard: 🔥. Crashkurse zu R sind vergleichbar zu Crahskursen zu Französisch - kann man machen, aber es sollte die Maxime gelten “If everything else fails”. Dieser Crashkurs ist für Studierende oder Anfänger der Datenanalyse gedacht, die in kurzer Zeit einen verzweifelten Versuch … äh … einen grundständigen Überblick über die Datenanalyse erwerben wollen. Introductory books for data analysis https://data-se.netlify.app/2017/05/15/books/ Mon, 15 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/15/books/ One way to dig into some topic such as data analysis is just-doing, trial and error. Another way is reading blogs; a fruitful avenue in my experience. However, the classical way of reading some good book is all but outdated. Here are some recommendations of books I found helpful as a starter (books in English and German). R for Data Science Grolemund, G., & Wickham, H. (2016). R for Data Science. Plotting true random numbers https://data-se.netlify.app/2017/05/12/true_random/ Fri, 12 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/12/true_random/ knitr::opts_chunk$set(fig.align = "center", out.width = "70%", fig.asp = .61) Every now and then, random numbers come in handy to demonstrate some statistical behavior. Of course, well-known appraoches are rnorm and friends. These functions are what is called pseudo random number generators, because they are not random at all, strictly speaking, but determined by some algorithm. An algorithm is a sort of creature that is 100% predictable once you know the input (and the details of the algorithm). Variance explained vs. variance blurred https://data-se.netlify.app/2017/05/05/explained_variance/ Fri, 05 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/05/explained_variance/ Frequently, someones says that some indicator variable X “explains” some proportion of some target variable, Y. What does this actually mean? By “mean” I am trying to find some intuition that “clicks” rather than citing the (well-known) formualas. To start with, let’s load some packages and make up some random data. library(tidyverse) n_rows <- 100 set.seed(271828) df <- data_frame( exp_clean = rnorm(n = n_rows, mean = 2, sd = 1), cntrl_clean = rnorm(n = n_rows, mean = 0, sd = 1), exp_noisy = exp_clean + rnorm(n = n_rows, mean = 0, sd = 3), cntrl_noisy = cntrl_clean + rnorm(n = n_rows, mean = 0, sd = 3), ID = 1:n_rows) Here, we drew 100 cases from the population of the “experimental group” (mue = 2) and 100 cases from the control group (mue = 0). This blog now has a DOI https://data-se.netlify.app/2017/05/04/doi_added/ Thu, 04 May 2017 00:00:00 +0000 https://data-se.netlify.app/2017/05/04/doi_added/ A DOI is useful feature to any electronic document. What the ID number in your passport is to you is the DOI to a document. It simply helps to make sure you address the “object” you want to address. Similarly, there may exists several “Joachims Zwiwwelkoecks” in this world (well, it may or may not be the case). However, if any of this person gets his (or her) unique ID (could by a simple number), then we would in principle always be certain that we address the right person. Einführung in die Datenanalyse mit R-Paket 'dplyr' - R User Group Nürnberg https://data-se.netlify.app/2017/04/27/datenanalyse_mit_dplyr/ Thu, 27 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/27/datenanalyse_mit_dplyr/ Datenjudo mit dplyr Einleitung Innerhalb der R-Landschaft hat sich das Paket dplyr binnen kurzer Zeit zu einem der verbreitesten Pakete entwickelt; es stellt ein innovatives Konzept der Datenanalyse zur Verfügung. dplyr zeichnet sich durch zwei Ideen aus. Die erste Idee ist, dass nur Tabellen (“dataframes” oder “tibbles”) verarbeitet werden, keine anderen Datenstrukturen. Diese Tabellen werden von Funktion zu Funktion durchgereicht. Der Fokus auf Tabellen vereinfacht die Analyse, da Spalten nicht einzeln oder mittels Schleifen werden müssen. Tools for Academic Writing - Comparison https://data-se.netlify.app/2017/04/26/writing_tools/ Wed, 26 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/26/writing_tools/ Many tools exist for academic writing including the notorious W.O.R.D.; but many more are out there. Let’s have a look at those tools, and discuss what’s important (what we expect the tool to deliver, eg., beautiful typesetting). Typical tools for academic writing MS Word: A “classical” choice, relied upon by myriads of white collar workers… I myself have used it extensively for academic writing; the main advantage being its simplicity, that is, well, everybody knows it, and knows more or less how to handle it. Covariance as correlation https://data-se.netlify.app/2017/04/25/cor_as_cov/ Tue, 25 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/25/cor_as_cov/ Correlation is one of the most widely used and a well-known measure of the assocation (linear association, that is) of two variables. Perhaps less well-known is that the correlation is in principle analoguous to the covariation. To see this, consider the a formula of the covariance of two empirical datasets, $X$ and $Y$: $$COV(X,Y) = \frac{1}{n} \cdot \big( \sum (X_i -\bar{X}) \cdot (Y_i - \bar{Y}) \big) $$ In other words, the covariance of $X$ and $Y$ $COV(X,Y)$ is the average of difference of some value to its mean. Plotting skewed distributions https://data-se.netlify.app/2017/04/19/skewed-distribs/ Wed, 19 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/19/skewed-distribs/ Let’s plot some skewed stuff, aehm, distributions! Actually, the point I - initially - wanted to make is that in skewed distribution, don’t use means. Or at least, be very aware that (arithmetic) means can be grossly misleading. But for today, let’s focus on drawing skewed distributions. Some packages: library(tidyverse) library(fGarch) # for snorm Some skewed distribution include: “polluted” normal distributions, ie., mixtures of two normals Exponential distributions Gamma distributions Beta distributions One way to visualize them is to draw their curve, ie. Error bars for interaction effects with nominal variables https://data-se.netlify.app/2017/04/18/moderator-errorbars/ Tue, 18 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/18/moderator-errorbars/ Moderator effects (ie., interaction or synergy effects) are a topic of frequent interest in many sciences braches. A lot ink has been spilled over this topic (so did I, eg., here). However, in that post I did now show how to visualize error in case of nominal (categorical) independent variable, and categorical moderator. Luckily, visualization of this case is quite straight forward with ggplot2. First, some data and packages to be loaded: The effect of sample on p-values. A simulation. https://data-se.netlify.app/2017/04/13/pvalue_sample_size/ Thu, 13 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/13/pvalue_sample_size/ It is well-known that the notorious p-values is sensitive to sample size: The larger the sample, the more bound the p-value is to fall below the magic number of .05. Of course, the p-value is also a function of the effect size, eg., the distance between two means and the respective variances. But still, the p-values tends to become significant in the face of larges samples, and non-significant otherwise. Theoretically, quite simple and well understood. Three ways to dichotomize a variable https://data-se.netlify.app/2017/04/11/three_ways_recoding_cutting/ Tue, 11 Apr 2017 00:00:00 +0000 https://data-se.netlify.app/2017/04/11/three_ways_recoding_cutting/ Dichotomizing is also called dummy coding. It means: Take a variable with multiple different values (>2), and transform it so that the output variable has 2 different values. Note that this “thing” can be understood as consisting of two different aspects: Recoding and cutting. Recoding means that value “a” becomes values “b” etc. Cutting means that a “rope” of numbers is cut into several shorter “ropes” (that’s why it is called cutting). Rowwise operations in dplyr https://data-se.netlify.app/2017/03/27/rowwise_dplyr/ Mon, 27 Mar 2017 00:00:00 +0000 https://data-se.netlify.app/2017/03/27/rowwise_dplyr/ R thinks columnwise, not rowwise, at least in standard dataframe operations. A typical rowwise operation is to compute row means or row sums, for example to compute person sum scores for psychometric analyses. One workaround, typical for R, is to use functions such as apply (and friends). However, dplyr offers some quite nice alternative: library(dplyr) mtcars %>% rowwise() %>% mutate(mymean=mean(c(cyl,mpg))) %>% select(cyl, mpg, mymean) ## Source: local data frame [32 x 3] ## Groups: <by row> ## ## # A tibble: 32 × 3 ## cyl mpg mymean ## <dbl> <dbl> <dbl> ## 1 6 21. Convert list to dataframe https://data-se.netlify.app/2017/03/08/convert_list_to_dataframe/ Wed, 08 Mar 2017 00:00:00 +0000 https://data-se.netlify.app/2017/03/08/convert_list_to_dataframe/ A handy function to iterate stuff is the function purrr::map. It takes a function and applies it to all elements of a given vector. This vector can be a data frame - which is a list, tecnically - or some other sort of of list (normal atomic vectors are fine, too). However, purrr::map is designed to return lists (not dataframes). For example, if you apply mosaic::favstats to map, you will get some favorite statistics for some variable: How to avoid Github/merge conflicts with Rmd-files https://data-se.netlify.app/2017/03/06/avoid_merge_conflicts/ Mon, 06 Mar 2017 00:00:00 +0000 https://data-se.netlify.app/2017/03/06/avoid_merge_conflicts/ One nice features of .rmd files is that version control systems, such as git and github, can (quite) easily be combined. However, in my experience, merge conflicts are not so uncommon. That raises the question how to avoid merge conflicts when syncing with Github? Here’s a quick overview on what to do to that hassle: Sync often. Hard wrap the lines to approx. 80 characters. Pull before you start to change the source files. Lieblings-R-Befehle https://data-se.netlify.app/2017/03/05/lieblingsbefehle/ Sun, 05 Mar 2017 00:00:00 +0000 https://data-se.netlify.app/2017/03/05/lieblingsbefehle/ Hier eine Liste einiger meiner “Lieblings-R-Funktionen”; für Einführungsveranstaltungen in Statistik spielen sie (bei mir) eine wichtige Rolle. Die Liste kann sich ändern :-) Wenn ich von einer “Tabelle” spreche, meine ich sowohl Dataframes als auch Tibbles. Zuweisung - <- Mit dem Zuweisungsoperator <- kann man Objekten einen Wert zuweisen: x <- 1 mtcars2 <- mtcars Spalten als Vektor auswählen - $ Mit dem Operator $ kann man eine Spalte einer Tabelle auswählen. AfD Mining - basales Textmining zum AfD-Parteiprogramm https://data-se.netlify.app/2017/02/21/textmining_afd_01/ Tue, 21 Feb 2017 00:00:00 +0000 https://data-se.netlify.app/2017/02/21/textmining_afd_01/ Für diesen Post benötigte R-Pakete: library(stringr) # Textverarbeitung library(tidytext) # Textmining library(pdftools) # PDF einlesen library(downloader) # Daten herunterladen # library(knitr) # HTML-Tabellen library(htmlTable) # HTML-Tabellen library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen library(gridExtra) # Kombinierte Plots library(dplyr) # Datenjudo library(ggplot2) # Visualisierung Ein einführendes Tutorial zu Textmining; analysiert wird das Parteiprogramm der Partei “Alternative für Deutschland” (AfD). Vor dem Hintergrund des gestiegenen Zuspruchs von Rechtspopulisten und der großen Gefahr, die von diesem Gedankengut ausdünstet, erscheint mir eine facettenreiche Analyse des Phänomens “Rechtspopulismus” nötig. Checklist for Data Cleansing https://data-se.netlify.app/2017/02/13/data_cleansing/ Mon, 13 Feb 2017 00:00:00 +0000 https://data-se.netlify.app/2017/02/13/data_cleansing/ What this post is about: Data cleansing in practice with R Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here): (Inspired by this) Often, the first major part – “prepare” – is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt… Sentiment-Wörterbuch erstellen https://data-se.netlify.app/2017/02/04/sentiment_dictionary/ Sat, 04 Feb 2017 00:00:00 +0000 https://data-se.netlify.app/2017/02/04/sentiment_dictionary/ Bei der Textanalyse (Textmining) ist die Sentiment-Analyse eine typische Tätigkeit. Natürlich steht und fällt die Qualität der Sentiment-Analyse mit der Qualität des verwendeten Wörterbuchs (was nicht heißt, dass man nicht auch auf andere Klippen schellen kann). Der Zweck dieses Posts ist es, eine Sentiment-Lexikon in deutscher Sprache einzulesen. Dazu wird das Sentiment-Lexikon dieser Quelle verwendet (CC-BY-NC-SA 3.0). In diesem Paper finden sich Hintergründe. Von dort lassen sich die Daten herunter laden. Dataset 'performance in stats test' https://data-se.netlify.app/2017/01/27/data_test_inference/ Fri, 27 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/27/data_test_inference/ This posts shows data cleaning and preparation for a data set on a statistics test (NHST inference). Data is published under a CC-licence, see here. Data was collected 2015 to 2017 in statistics courses at the FOM university in different places in Germany. Several colleagues helped to collect the data. Thanks a lot! Now let’s enjoy the outcome (and make it freely available to all). Raw N is 743. The test consists of 40 items which are framed as propositions; students are asked to respond with either “true” or “false” to each item. Convert logit to probability https://data-se.netlify.app/2017/01/24/convert_logit2prob/ Tue, 24 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/24/convert_logit2prob/ Logistic regression may give a headache initially. While the structure and idea is the same as “normal” regression, the interpretation of the b’s (ie., the regression coefficients) can be more challenging. This post provides a convenience function for converting the output of the glm function to a probability. Or more generally, to convert logits (that’s what spit out by glm) to a probabilty. Note1: The objective of this post is to explain the mechanics of logits. Gentle intro to 'R-squared equals squared r' https://data-se.netlify.app/2017/01/20/rsquared/ Fri, 20 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/20/rsquared/ It comes as no surprise that $$R^2$$ (“coefficient of determination”) equals $$r^2$$ in simple regression (predictor X, criterion Y), where $$r(X,Y)$$ is Pearson’s correlation coefficient. $$R^2$$ equals the fraction of explained variance in a simple regression. However, the statistical (mathematical) background is often less clear or buried in less-intuitive formula. The goal of this post is to offer a gentle explanantion why $$R^2 = r^2$$, where $$r$$ is $$r(Y,\hat{Y})$$ and $$\hat{Y}$$ are the predicted values. The two ggplot2-ways of plottings bars https://data-se.netlify.app/2017/01/20/two_ways_barplots_with_ggplot2/ Fri, 20 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/20/two_ways_barplots_with_ggplot2/ Bar plots, whereas not appropriate for means, are helpful for conveying impressions of frequencies, particularly relative frequencies, ie., proportions. Intuition: Bar plots and histograms alike can be thought of as piles of Lego pieces, put onto each each other, where each Lego piece represents (is) one observation. Presenting tables of frequencies are often not insightful to the eye. Bar plots are often much more accessible and present the story more clearly. Fallstudie (YACSDA) zur praktischen Datenanalyse mit dplyr https://data-se.netlify.app/2017/01/18/fallstudie_flights/ Wed, 18 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/18/fallstudie_flights/ Case study in data analysis using R package dplyr in German language. Praktische Datenanalyse mit dplyr Das R-Paket dplyr von Hadley Wickham ist ein Stargast auf der R-Showbühne; häufig diskutiert in einschlägigen Foren. Mit dyplr kann man Daten “verhackstücken” - umformen und aufbereiten (“to wrangle” auf Englisch); “praktische Datenanalyse” ist vielleicht eine gute Bezeichnung. Es finden sich online viele Einführungen, z.B. hier oder hier. Dieser Text ist nicht als Einführung oder Erläuterung gedacht, sondern als Übung, um (neu erworbenen Fähigkeiten) in der praktischen Datenanalyse im Rahmen einer Fallstudie auszuprobieren. I am unavailable for review https://data-se.netlify.app/2017/01/17/unavailable_for_review/ Tue, 17 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/17/unavailable_for_review/ Dear editorial team, Thanks for considering me for review. After some thought-meandering I came to the conclusion that traditional publishers - such as the present publisher of this journal - support a business model that I deem unfair and inappropriate for regular science and for the interests of science and scientists alike. That is, the fees are much too high thereby sucking resources out of the science system and out of society which could be used for the better otherwise. Kongresse 2017 - Wirtschaftspsychologie und verwandte Gebiete https://data-se.netlify.app/2017/01/17/kongresstermine_2017/ Tue, 17 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/17/kongresstermine_2017/ Hier finden Sie eine Auswahl an wissenschaftlichen Kongressen in 2017 aus der Wirtschaftspsychologie und angrenzender Felder. Nationale Kongresse 2017 (in Deutschland) GWPS, 2.-4. März in Darmstadt Fachtagung der Gesellschaft für angewandte Wirtschaftspsychologie (GWPs) Submission Deadline: 30. Nov 2016 TeaP, 26.-29. März in Dresden Conference of Experimental Psychologists Submission Deadline: 15. Nov. 2016 DiffPsy, 4.-6. September in München Arbeitstagung der Fachgruppe Differenzielle Psychologie, Persönlichkeitspsychologie und Psychologische Diagnostik Visualizing Interaction Effects with ggplot2 https://data-se.netlify.app/2017/01/17/vis_interaction_effects/ Tue, 17 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/17/vis_interaction_effects/ Moderator effects or interaction effect are a frequent topic of scientific endeavor. Put bluntly, such effects respond to the question whether the input variable X (predictor or independent variable IV) has an effect on the output variable (dependent variable DV) Y: “it depends”. More precisely, it depends on a second variable, M (Moderator). More formally, a moderation effect can be summarized as follows: If the effect of X on Y depends on M, a moderator effect takes place. How to import a strange CSV https://data-se.netlify.app/2017/01/12/strange_csvs/ Thu, 12 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/12/strange_csvs/ A typical task in data analysis is to import CSV-formatted data. CSV is nothing more than a text file with data in rectangular form; rows stand for observations (eg., persons), and columns represent variables (such as age). Columns are separed by a “separator”, often a comma. Hence the name “CSV” - “comma separeted values”. Note however that the separator can in principle anything you like (eg., “;” or tabulator or " “). R startet nicht https://data-se.netlify.app/2017/01/11/r_startet_nicht/ Wed, 11 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/11/r_startet_nicht/ Hilfe! Mein R startet nicht! Mein R startet zwar, tut aber nicht so, wie ich will. Sicherlich hat es sich (wieder einmal) gegen mich verschworen. Wahrscheinlich hilft nur noch Verschrotten… Bevor Sie zum äußersten schreiten, hier einige Tipps, die sich bewährt haben. Lösungen, wenn R nicht (richtig) läuft AEG: Aus. Ein. Gut. Starten Sie den Rechner neu. Gerade nach Installation neuer Software zu empfehlen. Sehen Sie eine Fehlermeldung, die von einem fehlenden Paket spricht (z. Convert data frame from 'wide' to 'long' https://data-se.netlify.app/2017/01/06/facial_beauty/ Fri, 06 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/06/facial_beauty/ Thanks to my student Marie Halbich who took the pains to collect the data! At times, your data set will be in “wide” format, i.e, many columns in comparison to rows. For some analyses however, it is more suitable to have the data in “long” format. That is, many rows in comparison to columns. Let’s have a look at this data set, for example. d <- read.csv("https://sebastiansauer.github.io/data/facial_beauty_raw.csv") This is the data from a study tapping into the effect of computerized “beautification” of some faces on subjective “like”. YACSDA (Fallstudie) zum Datensatz 'Affairs' https://data-se.netlify.app/2017/01/05/yacsda_affairs/ Thu, 05 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/05/yacsda_affairs/ This YACSDA (Yet-another-case-study-on-data-analysis) in composed in German language. Some typical data analytical steps are introduced. Wovon ist die Häufigkeit von Affären (Seitensprüngen) in Ehen abhängig? Diese Frage soll anhand des Datensates Affair untersucht werden. Dieser Post stellt beispielhaft eine grundlegende Methoden der praktischen Datenanalyse im Rahmen einer kleinen Fallstudie (YACSDA) vor. Quelle der Daten: http://statsmodels.sourceforge.net/0.5.0/datasets/generated/fair.html Der Datensatz findet sich (in ähnlicher Form) auch im R-Paket COUNT (https://cran.r-project.org/web/packages/COUNT/index.html). Laden wir als erstes den Datensatz in R. Why is the variance additive? An intuition. https://data-se.netlify.app/2017/01/04/additivity_variance/ Wed, 04 Jan 2017 00:00:00 +0000 https://data-se.netlify.app/2017/01/04/additivity_variance/ The variance of some data can be defined in rough terms as the mean of the squared deviations from the mean. Let’s repeat that because it is important: Variance: Mean of squared deviations from the mean. An example helps to illustrate. Assume some class of students are forced to write an exam in a statistics class (OMG). Let’s say the grades range fom 1 to 6, 1 being the best and 6 the worst. A Plain Markdown Post https://data-se.netlify.app/2016/12/30/hello-markdown/ Fri, 30 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/30/hello-markdown/ This is a post written in plain Markdown (*.md) instead of R Markdown (*.Rmd). The major differences are: You cannot run any R code in a plain Markdown document, whereas in an R Markdown document, you can embed R code chunks (```{r}); A plain Markdown post is rendered through Blackfriday, and an R Markdown document is compiled by rmarkdown and Pandoc. There are many differences in syntax between Blackfriday’s Markdown and Pandoc’s Markdown. Überleben auf der Titanic - YACSDA für nominale Daten https://data-se.netlify.app/2016/12/22/titanic/ Thu, 22 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/22/titanic/ In dieser YACSDA (Yet-another-case-study-on-data-analysis) geht es um die beispielhafte Analyse nominaler Daten anhand des “klassischen” Falls zum Untergang der Titanic. Eine Frage, die sich hier aufdrängt, lautet: Kann (konnte) man sich vom Tod freikaufen, etwas polemisch formuliert. Oder neutraler: Hängt die Überlebensquote von der Klasse, in der derPassagiers reist, ab? Diese Übung soll einige grundlegende Vorgehensweise der Datenanalyse verdeutlichen; Zielgruppe sind Einsteiger (mit Grundkenntnissen in R) in die Datenanalyse. Daten laden Zuerst laden wir die Daten. Müncher Mietpreis: Übung zum p-Wert https://data-se.netlify.app/2016/12/21/mietpreis_p-wert/ Wed, 21 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/21/mietpreis_p-wert/ Sie möchten die Hypothese (H0) testen, dass der mittlere Mietpreis in München 16,28€ beträgt (wie der Münchner Merkur einmal behauptet hat). Dafür ziehen Sie eine Stichprobe der Größe n = 36. Gehen Sie von einer SD von 3€ in der Population aus (Menge aller Mietwohnungen in München). Alpha sei 5%. Der Mittelwert Ihrer Stichprobe ist 16,79€. Nehmen Sie als H1 die Hypothese, dass der wahre mittlere Mietpreis höher ist. Gesucht Was ist der z-Wert des Stichprobenergebnisses? Some tricks on dplyr::filter https://data-se.netlify.app/2016/12/21/dplyr_filter/ Wed, 21 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/21/dplyr_filter/ The R package dplyr has some attractive features; some say, this packkage revolutionized their workflow. At any rate, I like it a lot, and I think it is very helpful. In this post, I would like to share some useful (I hope) ideas (“tricks”) on filter, one function of dplyr. This function does what the name suggests: it filters rows (ie., observations such as persons). The addressed rows will be kept; the rest of the rows will be dropped. Some thoughts on 'Dear stats curriculum developers' https://data-se.netlify.app/2016/12/08/stats_curriculum/ Thu, 08 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/08/stats_curriculum/ Recently, Andrew Gelman (@StatModeling at Twitter) published a post with this title - ““Dear Major Textbook Publisher”: A Rant”. In essence, he discussed how a good stats intro text book should be like. And complained about the low quality of some many textbooks out there. As I am also in the business guilty of coming up with stats curriculum for my students (applied courses for business type students mostly), I discuss some thoughts for “stats curriculum developers” (like myself). Simulation of p-values https://data-se.netlify.app/2016/12/01/simu_p/ Thu, 01 Dec 2016 00:00:00 +0000 https://data-se.netlify.app/2016/12/01/simu_p/ Teaching or learning stats can be a challenging endeavor. In my experience, starting with concrete (as opposed to abstract) examples helps many a learner. What also helps (for me) is visualizing. As p-values are still part and parcel of probably any given stats curriculum, here is a convenient function to simulate p-values and to plot them. “Simulating p-values” amounts to drawing many samples from a given, specified population (eg., µ=100, s=15, normally distributed). Pipe the Variance https://data-se.netlify.app/2016/11/30/pipe_variance/ Wed, 30 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/30/pipe_variance/ One idea of problem solving is, or should be, I think, that one should tackle problems of high complexity, but not too high. That sounds trivial, cooler tone would be “as hard as possible, as easy as necessary” which is basically the same thing. In software development including Rstats, a similar principle applies. Sounds theoretical, I admit. So see here some lines of code that has bitten me recently: obs <- c(1,2,3) pred <- c(1,2,4) monster <- 1 - (sum((obs - pred)^2))/(sum((obs - mean(obs))^2)) monster ## [1] 0. Some musings on the validation of Satow's Extraversion questionnaire https://data-se.netlify.app/2016/11/23/validation_extraversion_questionnaire/ Wed, 23 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/23/validation_extraversion_questionnaire/ Measuring personality traits is one of (the?) bread-and-butter business of psychologists, at least for quantitatively oriented ones. Literally, thousand of psychometric questionnaires exits. Measures abound. Extroversion, part of the Big Five personality theory approach, is one of the most widely used, and extensively scrutinized questionnaire tapping into human personality. One rather new, but quite often used questionnaire, is Satow’s (2012) B5T. The reason for the popularity of this instrument is that it runs under a CC-licence - in contrast to the old ducks, which coute chere. Preparing survey results data https://data-se.netlify.app/2016/11/19/preparing_survey_data/ Sat, 19 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/19/preparing_survey_data/ Analyzing survey results is a frequent endeavor (for some including me). Let’s not think about arguments whether and when surveys are useful or not (for some recent criticism see Briggs' book). Typically, respondents circle some option ranging from “don’t agree at all” to “completely agree” for each question (or “item”). Typically, four to six boxes are given where one is expected to tick one. In this tutorial, I will discuss some typical steps to prepare the data for subsequent analyses. Crashkurs zur Erstellung von Barplots für Umfrage-Daten https://data-se.netlify.app/2016/11/13/crashkurs_barplots/ Sun, 13 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/13/crashkurs_barplots/ Eine recht häufige Art von Daten in der Wirtschaft kommen von Umfragen in der Belegschaft. Diese Daten gilt es dann aufzubereiten und graphisch wiederzugeben. Dafür gibt dieser Post einige grundlegende Hinweise. Grundwissen mit R setzen wir voraus :-) Eine ausführlichere Beschreibung hier sich z.B. hier. Packages laden Nicht vergessen: Ein Computerprogramm (z.B. ein R-Package) kann man nur dann laden, wenn man es vorher installier hat (aber es reicht, das Programm/R-Package einmal zu installieren). New bar stacking with ggplot 2.2.0 https://data-se.netlify.app/2016/11/13/improved_bar_stacking_ggplot2_220/ Sun, 13 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/13/improved_bar_stacking_ggplot2_220/ Recently, ggplot2 2.2.0 was released. Among other news, stacking bar plot was improved. Here is a short demonstration. Load libraries library(tidyverse) library(htmlTable) … and load data: data <- read.csv("https://osf.io/meyhp/?action=download") DOI for this piece of data is 10.17605/OSF.IO/4KGZH. The data consists of results of a survey on extraversion and associated behavior. Say, we would like to visualize the responsed to the extraversion items (there are 10 of them). So, let’s see. First, compute summary of the responses. Some thoughts (and simulation) on overfitting https://data-se.netlify.app/2016/11/13/overfitting_simulation/ Sun, 13 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/13/overfitting_simulation/ Overfitting is a common problem in data analysis. Some go as far as saying that “most of” published research is false (John Ionnadis); overfitting being one, maybe central, problem of it. In this post, we explore some aspects on the notion of overfitting. Assume we have 10 metric variables v (personality/health/behavior/gene indicator variables), and, say, 10 variables for splitting up subgroups (aged vs. young, female vs. male, etc.), so 10 dichotomic variables. Plotting survey results using `ggplot2` https://data-se.netlify.app/2016/11/12/plotting_surveys/ Sat, 12 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/12/plotting_surveys/ Plotting (and more generally, analyzing) survey results is a frequent endeavor in many business environments. Let’s not think about arguments whether and when surveys are useful (for some recent criticism see Briggs' book). Typically, respondents circle some option ranging from “don’t agree at all” to “completely agree” for each question (or “item”). Typically, four to six boxes are given where one is expected to tick one. In this tutorial, I will discuss some barplot type visualizations; the presentation is based on ggplot2 (within the R environment) . Horoskopstudie zum Barnumeffekt https://data-se.netlify.app/2016/11/09/horoskop-studie/ Wed, 09 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/09/horoskop-studie/ Viele Menschen glauben an Horoskope. Doch warum? Ein Grund könnte sein, dass Horoskope einfach gut sind. Was heißt gut: Sie passen auf mich aber nicht auf andere Leute (mit anderen Strernzeichen) und sie sagen Dinge, die nützlich sind. Ein anderer Grund könnte sein, dass sie uns schmeicheln und Gemeinplätze sind, denen jeder zustimmt: “Sie sind an sich ein Super-Typ, aber manchmal etwas ungeduldig” (oh ja, absolut, passt genau!). “Heute treffen Sie jemanden, der eine große Liebe werden könnte” (Hört sich gut an! Some reflections on stochastic independence https://data-se.netlify.app/2016/11/08/stochastic_independence/ Tue, 08 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/08/stochastic_independence/ We are often interested in the question whether two variables are “associated”, “correlated” (I mean the normal English term) or “dependent”. What exactly, or rather in normal words, does that mean? Let’s look at some easy case. NOTE: The example has been updated to reflect a more tangible and sensible scenario (find the old one in the previous commit at Github). Titanic data For example, let’s look at survival rates of the Titanic disaster, to see whether the probability of survival (event A) depends on the whether you embarked for 1st class (event B). Bind lists to data frame for aggregating linear models results https://data-se.netlify.app/2016/11/04/bind_list_to_dataframe_lm/ Fri, 04 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/04/bind_list_to_dataframe_lm/ I found myself doing the following: I had a bunch of predictors, one (numeric) outcome, and wanted to run I simple regression for each of the predictors. Having a bunch of model results, I would like to have them bundled in one data frame. So, here is one way to do it. First, load some data. data(mtcars) str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22. How to plot a 'percentage plot' with ggplot2 https://data-se.netlify.app/2016/11/03/percentage_plot_ggplot2_v2/ Thu, 03 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/03/percentage_plot_ggplot2_v2/ At times it is convenient to draw a frequency bar plot; at times we prefer not the bare frequencies but the proportions or the percentages per category. There are lots of ways doing so; let’s look at some ggplot2 ways. First, let’s load some data. data(tips, package = "reshape2") And the typical libraries. library(dplyr) library(ggplot2) library(tidyr) library(scales) # for percentage scales Way 1 tips %>% count(day) %>% mutate(perc = n / nrow(tips)) -> tips2 ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity") Way 2 ggplot(tips, aes(x = day)) + geom_bar(aes(y = (. Different ways to set figure size in RMarkdown https://data-se.netlify.app/2016/11/02/figure_sizing_knitr/ Wed, 02 Nov 2016 00:00:00 +0000 https://data-se.netlify.app/2016/11/02/figure_sizing_knitr/ Markdown is thought as a “lightweight” markup language, hence the name markdown. That’s why formatting options are scarce. However, there are some extensions, for instance brought by RMarkdown. One point of particular interest is the sizing of figures. Let’s look at some ways how to size a figure with RMarkdown. We take some data first: data(mtcars) names(mtcars) ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" Not let’s plot. CLES plot https://data-se.netlify.app/2016/10/17/cles-plot/ Mon, 17 Oct 2016 00:00:00 +0000 https://data-se.netlify.app/2016/10/17/cles-plot/ In data analysis, we often ask “Do these two groups differ in the outcome variable”? Asking this question, a tacit assumption may be that the grouping variable is the cause of the difference in the outcome variable. For example, assume the two groups are “treatment group” and “control group”, and the outcome variable is “pain reduction”. A typical approach would be to report the strenght of the difference by help of Cohen’s d. Checking for NA with dplyr https://data-se.netlify.app/2016/10/16/nas-with-dplyr/ Sun, 16 Oct 2016 00:00:00 +0000 https://data-se.netlify.app/2016/10/16/nas-with-dplyr/ Often, we want to check for missing values (NAs). There are of course many ways to do so. dplyr provides a quite nice one. First, let’s load some data: library(readr) extra_file <- "https://raw.github.com/sebastiansauer/Daten_Unterricht/master/extra.csv" extra_df <- read_csv(extra_file) Note that extra is a data frame consisting of survey items regarding extraversion and related behavior. In case the dataframe is quite largish (many columns) it is helpful to have some quick way. Here, we have 25 columns. Multiple ways to subsetting data frames in R https://data-se.netlify.app/2016/10/15/indexing-in-r/ Sat, 15 Oct 2016 00:00:00 +0000 https://data-se.netlify.app/2016/10/15/indexing-in-r/ Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented. Get some data first. str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 . How to read Github files into R easily https://data-se.netlify.app/2016/10/12/download-from-github/ Wed, 12 Oct 2016 00:00:00 +0000 https://data-se.netlify.app/2016/10/12/download-from-github/ Downloading a folder (repository) from Github as a whole The most direct way to get data from Github to your computer/ into R, is to download the repository. That is, click the big green button: The big, green button saying “Clone or download”, click it and choose “download zip”. Of course, for those using Git and Github, it would be appropriate to clone the repository. And, although appearing more advanced, cloning has the definitive advantage that you’ll enjoy the whole of the Github features. Simple (R-)Markdown template for 'Onepager-reports' etc. https://data-se.netlify.app/2016/10/05/template-onepager/ Wed, 05 Oct 2016 00:00:00 +0000 https://data-se.netlify.app/2016/10/05/template-onepager/ In my role as a teacher, I (have to) write a lot of marking feedback reports. My university provides a website to facilitate the process, that’s great. I have also been writing my reports with Pages, Word, or friends. But somewhat cooler, more attractive, and more reproducible would be using (a markup language such as) Markdown. Basically, that’s easy, but it would be of help to have a template that makes up a nice and nicely formatted report, like this: Using purrr to build a data frame of vectors (eg., from effect size statistics) https://data-se.netlify.app/2016/09/29/purrr-effsize/ Thu, 29 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/29/purrr-effsize/ I just tried to accomplish the following with R: Compute effect sizes for a variable between two groups. Actually, not one numeric variable but many. And compute not only one measure of effect size but several (d, lower/upper CI, CLES,…). So how to do that? First, let’s load some data and some (tidyverse and effect size) packages: knitr::opts_chunk$set(echo = TRUE, cache = FALSE, message = FALSE) library(purrr) library(ggplot2) library(dplyr) library(broom) library(tibble) library(compute. Summary for multiple variables using purrr https://data-se.netlify.app/2016/09/28/summary-mult-cols-purrr/ Wed, 28 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/28/summary-mult-cols-purrr/ A frequent task in data analysis is to get a summary of a bunch of variables. Often, graphical summaries (diagrams) are wanted. However, at times numerical summaries are in order. How to get that in R? That’s the question of the present post. Of course, there are several ways. One way, using purrr, is the following. I liked it quite a bit that’s why I am showing it here. First, let’s load some data and some packages we will make use of. EDIT: Running multiple simple regressions with purrr https://data-se.netlify.app/2016/09/26/edit-multiple_lm_purrr_edit/ Mon, 26 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/26/edit-multiple_lm_purrr_edit/ EDIT based on comments/ suggeestions from @JonoCarroll Disqus profile and @tjmahr twitter profile. See below (last step; look for “EDIT”). Thanks for the input! 👍 reading time: 10 min. Hadley Wickham’s purrr has given a new look at handling data structures to the typical R user (some reasoning suggests that average users doesn’t exist, but that’s a different story). I just tried the following with purrr: Meditate about the running a simple regression, FWIW Take a dataframe with candidate predictors and an outcome Throw one predictor at a time into the regression, where the outcome variable remains the same (i. Running multiple simple regressions with purrr https://data-se.netlify.app/2016/09/23/multiple-lm-purrr2/ Fri, 23 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/23/multiple-lm-purrr2/ Hadley Wickham’s purrr has given a new look at handling data structures to the typical R user (some reasoning suggests that average users don’t exist, but that’s a different story). I just tried the following with purrr: Meditate about the running a simple regression, FWIW Take a dataframe with candidate predictors and an outcome Throw one predictor at a time into the regression, where the outcome variable remains the same (i. Code example for plotting boxplots instead of mean bars https://data-se.netlify.app/2016/09/22/use-boxplots/ Thu, 22 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/22/use-boxplots/ On a recent psychology conference I had the impression that psychologists keep preferring to show mean values, but appear less interested in more detailled plots such as the boxplot. Plots like the boxplot are richer in information, but not more difficult to perceive. For those who would like to have an easy starter on how to visualize more informative plots (more than mean bars), here is a suggestion: # install.pacakges("Ecdat") library(Ecdat) # dataset on extramarital affairs data(Fair) str(Fair) ## 'data. How to promote open science? Some practical recommendations https://data-se.netlify.app/2016/09/22/openscience/ Thu, 22 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/22/openscience/ I just attended the biannual conference of the German society of psychology (DPGs) in Leipzig; open science was a central, albeit not undisputed topic; a lot of interesting related twitter discussion. image source: Felix Schönbrodt Interestingly, a strong voice of German scientiests uttered their concerns about being scooped if/when sharing their data (during the official meeting of the society). This being said (sad), the German research foundation (DFG) has updated its guidelines now stressing (more strongly) that publicly funded projects should share their data, with the rationale that the data do not belong to the individual scientiest but to the public, as the public funded it (I find that convincing). Fallstudie zur explorative Datenanalyse (YACSDA) beim Datensatz 'TopGear' https://data-se.netlify.app/2016/09/14/yacsda_topgear/ Wed, 14 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/14/yacsda_topgear/ YADCSDA in German language. In dieser Fallstudie (YACSDA: Yet another case study of data analysis) wird der Datensatz TopGear analysiert, vor allem mit grafischen Mitteln. Es handelt sich weniger um einen “Rundumschlag” zur Beantwortung aller möglichen interessanten Fragen (oder zur Demonstration aller möglichen Analysewerkzeuge), sondern eher um einen Einblick zu einfachen explorativen Verfahren. library(robustHD) ## Loading required package: perry ## Loading required package: parallel ## Loading required package: robustbase data(TopGear) # Daten aus Package laden library(tidyverse) Numerischer Überblick glimpse(TopGear) ## Observations: 297 ## Variables: 32 ## $ Maker <fctr> Alfa Romeo, Alfa Romeo, Aston Martin, Asto. Why Likert scales are (in general) not metric https://data-se.netlify.app/2016/09/07/likert-not-metric/ Wed, 07 Sep 2016 00:00:00 +0000 https://data-se.netlify.app/2016/09/07/likert-not-metric/ Likert scales are psychologists' bread-and-butter tool. Literally, thousands (!) of such “scales” (as they are called, rightfully or not) do exist. To get a feeling: The APA links to this database where 25,000 tests are listed (as stated by the website)! That is indeed an enormous number. Most of these psychological tests use so called Likert scales (see this Wikipedia article). For example: (Source: Wikipedia by Nicholas Smith) Given their widespread use, the question how useful such tests are has arisen many times; see here, here, or here. Why is SD(X) unequal to MAD(X)? https://data-se.netlify.app/2016/08/31/why-sd-is-unequal-to-mad/ Wed, 31 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/31/why-sd-is-unequal-to-mad/ MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\$','\$']]} }); It may seem bewildering that the standard deviation (sd) of a vector X is (generally) unequal to the mean absolute deviation from the mean (MAD) of X, ie. $$sd(X) \ne MAD(X)$$. One could now argue this way: well, sd(X) involves computing the mean of the squared $$x_i$$, then taking the square root of this mean, thereby “coming back” to the initial size or dimension of x (i. Plot of mean with exact numbers using ggplot2 https://data-se.netlify.app/2016/08/30/plot_dot_means/ Tue, 30 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/30/plot_dot_means/ Often, both in academic research and more business-driven data analysis, we want to compare some (two in many cases) means. We will not discuss here that friends should not let friends plot barplots. Following the advise of Cleveland’s seminal book we will plot the means using dots, not bars. However, at times we do not simply want the diagram, but we (or someone) is interested in the bare, plain, naked, exact numbers too. Shading multiple areas under normal curve https://data-se.netlify.app/2016/08/30/shade_normal_curve/ Tue, 30 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/30/shade_normal_curve/ When plotting a normal curve, it is often helpful to color (or shade) some segments. For example, often we might want to indicate whether an absolute value is greater than 2. How can we achieve this with ggplot2? Here is one way. First, load packages and define some constants. Specifically, we define mean, sd, and start/end (z-) value of the area we want to shade. And your favorite color is defined. Simple way to plot a normal distribution with ggplot2 https://data-se.netlify.app/2016/08/30/normal_curve_ggplot2/ Tue, 30 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/30/normal_curve_ggplot2/ Plotting a normal distribution is something needed in a variety of situation: Explaining to students (or professors) the basic of statistics; convincing your clients that a t-Test is (not) the right approach to the problem, or pondering on the vicissitudes of life… If you like ggplot2, you may have wondered what the easiest way is to plot a normal curve with ggplot2? Here is one: library(cowplot) ## Loading required package: ggplot2 ## ## Attaching package: 'cowplot' ## The following object is masked from 'package:ggplot2': ## ## ggsave p1 <- ggplot(data = data. Why absolute correlation value (r) cannot exceed 1. An intuition. https://data-se.netlify.app/2016/08/28/why-abs-correlation-is-max-1/ Sun, 28 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/28/why-abs-correlation-is-max-1/ Pearson’s correlation is a well-known and widely used instrument to gauge the degree of linear association of two variables (see this post for an intuition on correlation). There a many formulas for correlation, but a short and easy one is this one: $$r = \varnothing(z_x z_y)$$. In words, $$r$$ can be seen as the average product of z-scores. In “raw values”, r is given by $$ r = \frac{\frac{1}{n}\sum{\Delta X \Delta Y}}{\sqrt{\frac{1}{n}\sum{\Delta X^2}} \sqrt{\frac{1}{n}\sum{\Delta Y^2}}} $$. The effect of a status symbol on success in online dating: an experimental study (data paper) https://data-se.netlify.app/2016/08/27/data_status_dating/ Sat, 27 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/27/data_status_dating/ This article has been published at The Winnower, it is distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited. Data can be accessed here. Access the paper here. CITATION: Sebastian Sauer, Alexander Wolff, The effect of a status symbol on success in online dating: an experimental study (data paper), The Winnower 3:e147241. Multiple t-Tests with dplyr https://data-se.netlify.app/2016/08/18/multiple-t-tests-with-dplyr/ Thu, 18 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/18/multiple-t-tests-with-dplyr/ t-Test on multiple columns Suppose you have a data set where you want to perform a t-Test on multiple columns with some grouping variable. As an example, say you a data frame where each column depicts the score on some test (1st, 2nd, 3rd assignment…). In each row is a different student. So you glance at the grading list (OMG!) of a teacher! How to do do that in R? Probably, the most “natural” solution would be some lapply() call. Introduction to the measurement theory, and conjoint measurement theory https://data-se.netlify.app/2016/08/17/intro_measurement/ Wed, 17 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/17/intro_measurement/ What is measurement? Why should I care? Measurement is a basis of an empirical science. Image a geometer (a person measuring distances on the earth) with a metering rul made of rubber! Poor guy! Without proper measurement, even the smartest theory cannot be expected to be found, precisely because it cannot be measured. So, what exactly is measurement? Measurement can be seen as tying numbers to empirical objects. But not in some arbritrary style. Looping through dataframe columns using purrr::map() https://data-se.netlify.app/2016/08/16/looping-purrr/ Tue, 16 Aug 2016 00:00:00 +0000 https://data-se.netlify.app/2016/08/16/looping-purrr/ Let’s get purrr. Recently, I ran across this issue: A data frame with many columns; I wanted to select all numeric columns and submit them to a t-test with some grouping variables. As this is a quite common task, and the purrr-approach (package purrr by @HadleyWickham) is quite elegant, I present the approach in this post. Let’s load the data, the Affairs data set, and some packages: data(Affairs, package = "AER") library(purrr) # functional programming library(dplyr) # dataframe wrangling library(ggplot2) # plotting library(tidyr) # reshaping df Don’t forget that the four packages need to be installed in the first place. Intuition on correlation https://data-se.netlify.app/2016/07/25/correlation-intuition/ Mon, 25 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/25/correlation-intuition/ reading time: 10 min. Pearson’s correlation (short: correlation) is one of statistics’ all time classics. With an age of about a century, it is some kind of grand dad of analytic tools – but an oldie who is still very busy! Formula, interpretation and application of correlation is well known. In some non-technical lay terms, correlation captures the (linear) degree of co-variation of two linear variables. For example: if tall people have large feet (and small people small feet), on average, we say that height and foot size are correlated. Practical data cleansing in R https://data-se.netlify.app/2016/07/24/data-cleansing/ Sun, 24 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/24/data-cleansing/ What is “data cleansing” about? Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here): (Inspired by this) Often, the first major part — “prepare” — is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt… Yet another case study on data analysis (YACSDA) – extramarital affairs data set https://data-se.netlify.app/2016/07/23/affairs/ Sat, 23 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/23/affairs/ Ok, there are heaps of them on the net. Here comes my YACSDA. Maybe the only thing about it to mention is that it comes in German language. Analytical language: R (3.3) Purpose: Demonstrate basic exploratory and modeling techniques Packages used: dplyr, ggplot2 Data set: Affair; source R package COUNT Analytical topics covered: descriptive statistics, visualization, liner model, logistic linear model Reproducibility: Rmarkdown, knitr, github Code on Github Why metric scale level cannot be taken for granted https://data-se.netlify.app/2016/07/21/measurement-01/ Thu, 21 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/21/measurement-01/ One main business for psychologists is to examine questionnaire data. Extraversion, intelligence, attitudes… That’s bread-and-butter job for (research) psychologists. Similarly, it is common to take the metric level of questionnaire data for granted. Well, not for the item level, it is said. But for the aggregated level, oh yes, that’s OK. Despite its popularity, the measurement basics of such practice are less clear. On which grounds can this comfortable practice be defended? What to read in summer (German) https://data-se.netlify.app/2016/07/20/what-to-read/ Wed, 20 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/20/what-to-read/ Below some consideration on what to read in summer times. In German language. Lesezeit/reading time: 10-15 Min. Literaturempfehlung Sommer 2016 Was soll ich lesen? Sommer, Sonne, Sonnenschein — ab in den Süden. Die Zeile “Lesen, lesen, lesen, lesen” würde sich nach meinem Dafürhalten auch ganz gut in den Song einpassen. Dafür hier ein paar Literaturempfehlungen. Von einer anständigen Sommerlektüre erwarte ich zweierlei: Dass die Kunst unterhaltsam sei. Zweitens, wenn als der Dampf sich nach dem Lesen erhebt, dass etwas zurückbleibt, außer dem Dampf. Case study on data wrangling with dplyr (German) https://data-se.netlify.app/2016/07/18/nycflights13/ Mon, 18 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/18/nycflights13/ reading time (full): 30 min. Data Wrangling with dplyr is a popular activity in data science/ statistics. A number of tutorial are available, but not so many in German language. Data set analyzed in nycflights13::flights (R package). Available on CRAN. Ok, choosing this data set is not very creative, but, hey, quite nice data:) Thus, here is a case study in German language; code (R)is on Github. Intuition on Cohen's d https://data-se.netlify.app/2016/07/15/cohens-d-intuition/ Fri, 15 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/15/cohens-d-intuition/ reading time: 5-10 min. Cohen’s d is a widely known and extensively used measure of effect size. That is, d is used to gauge how strong an effect is (given the fact that the effect exists). For example, one way to estimate d is as follows: data(tips, package = "reshape2") library(compute.es) t1 <- t.test(tip ~ sex, data = tips) t1$statistic ## t ## -1.489536 table(tips$sex) ## ## Female Male ## 87 157 tes(t1$statistic, 87, 157) ## Mean Differences ES: ## ## d [ 95 %CI] = -0. How to add a logo to a slidify presentation https://data-se.netlify.app/2016/07/05/slidify-logo/ Tue, 05 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/05/slidify-logo/ reading time: 15-20 min. Slidify is a cool tool to render HTML5 slide decks, see here, here or here for examples. Features include: reproducibility. You write your slide deck as you would write any other text, similar to Latex/Beamer. But you write using Markdown, which is easier and less clumsy. As you write plain text, you are free to use git. modern look. Just a website, nothing more. But with cool, modern features. Long vs. wide format, and gather() https://data-se.netlify.app/2016/07/04/gather-long-to-wide-format/ Mon, 04 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/04/gather-long-to-wide-format/ reading time: 10 min. A quite common task in data analysis is to change a dataset from wide to long format. For example, this is a dataset in wide format: Is is called wide, as, well, it is wide – several columns side by side. For example, assume, we have measured a number of predictors (here: predictor_1, predictor_2, predictor_3), and an outcome measure (here: outcome). In this case, each variable is dichotomous (either yes or no). Cross-tabulate multiple variables https://data-se.netlify.app/2016/07/03/cross-tabulate-multiple-variables/ Sun, 03 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/03/cross-tabulate-multiple-variables/ reading time: 15-20 min. Recently, I analyzed some data of a study where the efficacy of online psychotherapy was investigated. The investigator had assessed whether or not a participant suffered from some comorbidities (such as depression, anxiety, eating disorder…). I wanted to know whether each of these (10 or so) comorbidities was associated with the outcome (treatment success, yes vs. no). Of course, an easy solution would be to “half-manually” check the association, eg. Why have z-transformed values a mean of zero and a sd of 1? https://data-se.netlify.app/2016/07/02/z-value-intuition/ Sat, 02 Jul 2016 00:00:00 +0000 https://data-se.netlify.app/2016/07/02/z-value-intuition/ z-transformation is an ubiquitous operation in data analysis. It is often quite practical. Example: Assume Dr Zack scored 42 points on a test (say, IQ). Average score is 40 in the relevant population, and SD is 1, let’s say. So Zack’s score is 2 points above average. 2 points equals to SDs in this example. We can thus safely infer that Zack is about 2 SDs above average (leaving measurement precision and other issues at side). About https://data-se.netlify.app/about/ Sun, 20 Nov 2011 00:00:00 +0000 https://data-se.netlify.app/about/ I blog about data science, particularly using R, and with an applied interest to social sciences. As a non-virtual person, I work as a professor at Ansbach University of applied sciences. Posts reflect mostly my current thinking; and posts are not immune to thought updates. With luck things get less wrong in the course of time. All opions are my own. Faults are my own. Posts are organized as note books, as the crow flies, which is, as my thinking went. https://data-se.netlify.app/1/01/01/ Mon, 01 Jan 0001 00:00:00 +0000 https://data-se.netlify.app/1/01/01/ \— title: Eliminating a factor reduces variance author: ’’ date: ‘2018-12-10’ slug: eliminating-a-factor-reduces-variance draft: TRUE categories: - rstats tags: - tutorial - plotting — A well known measure to reduce variability and increase power in experimental (and observational) research design is to eliminate a factor that may influence the outcome variable. “Eliminating” a factor means, by and above, to hold it constant. Consider the following example. Say, an experiment is performed with two groups, and the experimental groups shows higher values than the control group. https://data-se.netlify.app/privacy/ Mon, 01 Jan 0001 00:00:00 +0000 https://data-se.netlify.app/privacy/ Datenschutzerklärung Diese Datenschutzerklärung klärt Sie über die Art, den Umfang und Zweck der Verarbeitung von personenbezogenen Daten (nachfolgend kurz „Daten“) innerhalb unseres Onlineangebotes und der mit ihm verbundenen Webseiten, Funktionen und Inhalte sowie externen Onlinepräsenzen, wie z.B. unser Social Media Profile auf (nachfolgend gemeinsam bezeichnet als „Onlineangebot“). Im Hinblick auf die verwendeten Begrifflichkeiten, wie z.B. „Verarbeitung“ oder „Verantwortlicher“ verweisen wir auf die Definitionen im Art. 4 der Datenschutzgrundverordnung (DSGVO). Verantwortlicher Sebastian Sauer