Data Se
/
Recent content on Data SeHugo -- gohugo.ioen-usWed, 22 Jul 2020 00:00:00 +0000Mean of the upper half of a Gaussian
/2020/07/22/mean-of-the-upper-half-of-a-gaussian/
Wed, 22 Jul 2020 00:00:00 +0000/2020/07/22/mean-of-the-upper-half-of-a-gaussian/Load packages library(tidyverse) library(lsr) Motivation Recently, I listened to the great Paul Meehl in the audioscripts of some lectures of him. There, he asked the students
what’s the mean value of the upper half of a Gaussian distribution?
Let’s explore that using simulation techniques.
Simulation time Let’s draw some instances from a standard Normal distribution, \(X\).
n <- 1e05 x <- rnorm(n) Mean and SD in our sample are quite close to what can be expected:Randomization in presence of an interaction effect
/2020/07/07/randomization-in-presence-of-an-interaction-effect/
Tue, 07 Jul 2020 00:00:00 +0000/2020/07/07/randomization-in-presence-of-an-interaction-effect/Load packages library(tidyverse) library(rockchalk) library(MASS) library(ggdag) Problem statement Assume that \(X\) and \(Y\) are correlated contingent on some third variable, \(Z\). For simplicity, assume that, if \(z=0\), \(_0=0.7\), and if \(z=1\), then \(r_1=-0.7\). This is not a causal statement.
Simulate data Let the sample size amount to \(n=1000\).
n <- 1e03 Group A, \(z=0\):
myR <- lazyCor(X = 0.7, d = 2) mySD <- c(1, 1) myCov <- lazyCov(Rho = myR, Sd = mySD) set.First grade math exercise
/2020/07/03/first-grade-math-exercise/
Fri, 03 Jul 2020 00:00:00 +0000/2020/07/03/first-grade-math-exercise/Problem statement My son, being a first grader, recently struggled with this piece of math:
Consider this system of equations:
\[ a + b + c = 20\\ d + e + f = 14\\ g + h + i = 11\\ a + d + g = 15\\ b + e + h = 10\\ c + f + i = 20\\ a + e + i = 20\\ g + e + c = 10\]How to sort the labels of the legend in a ggplot-diagram
/2020/06/26/how-to-sort-the-labels-of-the-legend-in-a-ggplot-diagram/
Fri, 26 Jun 2020 00:00:00 +0000/2020/06/26/how-to-sort-the-labels-of-the-legend-in-a-ggplot-diagram/Load packages library(tidyverse) library(forcats) library(hrbrthemes) What we want to achieve: barplot ggplot2-diagram where bars and legend labels are sorted Say we would like to plot frequencies, and would like to use ggplot2 for that purpose. How can we get a decent graph? This post shows some ways.
Some data data(diamonds) A glimpse to the data glimpse(diamonds) #> Rows: 53,940 #> Columns: 10 #> $ carat <dbl> 0.Simulationsbasierte Inferenz – Kurzfassung
/2020/06/26/simulationsbasierte-inferenz-kurzfassung/
Fri, 26 Jun 2020 00:00:00 +0000/2020/06/26/simulationsbasierte-inferenz-kurzfassung/Simulationsbasierte Inferenz Simulationsbasierte Inferenz (SBI) ist eine Variante der Inferenzstatistik, in der Schätzwerte einer Population nicht anhand theoretischer Verteilungen (wie der Normalverteilung) hergeleitet werden, sondern durch Nachstellen eines Versuchs mithilfe des Computers. Damit wird der Zugang zur Inferenzstastistik vereinfacht und es werden Parameterberechnung möglich (bzw. genauer), die vorher (ohne Computersimulationen) nicht möglich waren.
Folien Hier finden sich meine Folien zur Kurzfassung der SBI (als HTML-Version). Die HTLM-Folien können nur online betrachtet werden.Introduction to Statistics: A modeling-based approach -- Course Syllabus
/2020/06/19/introduction-to-statistics-a-modeling-based-approach-course-syllabus/
Fri, 19 Jun 2020 00:00:00 +0000/2020/06/19/introduction-to-statistics-a-modeling-based-approach-course-syllabus/Load packages Course description Models and modeling are of pivotal importance in many sciences, not only for providing an explanation of nature en miniature (theoretical models), but also for gauging how closely the empirical data at hand match the theoretical model. Translating a theoretical model into statistical language is called statistical modeling and provides the guiding principle in this introductory course. Regression models will be presented as a lingua franca of statistical modeling, and we will learn that many empirical questions can (comfortably) be analyzed using a regression framework.Simulating data for a Gamma regression
/2020/06/17/simulating-data-for-a-gamma-regression/
Wed, 17 Jun 2020 00:00:00 +0000/2020/06/17/simulating-data-for-a-gamma-regression/Load packages library(tidyverse) Intro A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.
In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.
Simulate data and define structural model Let \(X\) be a discrete variable following uniform distribution, and \(x_i \in \{1,2,3\}\).
set.seed(42) n <- 1000 X <- sample(x = c(1,2,3), size = n, replace = TRUE) hist(X) Let \(y_i = 0.Absolute vs. relative Covid cases in modelling
/2020/06/10/absolute-vs-relative-covid-cases-in-modelling/
Wed, 10 Jun 2020 00:00:00 +0000/2020/06/10/absolute-vs-relative-covid-cases-in-modelling/Load packages library(tidyverse) library(mosaic) require(scales) library(directlabels) library(ggrepel) library(ggthemes) library(hrbrthemes) options(scipen = 8) Covid-19 growth rate We are in the decline midst wake onset SOMEHWERE in the Corona crisis. A lot of hasty more or less useful research is being conducted.
One of the circulating claims is: “There Corona growth rate in country X is higher than in country Y!”
Let’s assume some doubling (growth) rate:
double_rates <- 5 double_rate_chosen <- 5 # sample(double_rates, size = 1) Two countries with equal Covid-19 growth rate Consider two countries, A und B, with the same Covid-19 growth rate.Spell out your model explicitly
/2020/06/10/spell-out-your-model-explicitly/
Wed, 10 Jun 2020 00:00:00 +0000/2020/06/10/spell-out-your-model-explicitly/Load packages library(tidyverse) library(hrbrthemes) library(MASS) library(moments) Why you should spell out your model explicitly Often, assumptions of widely used models, such as linear models, appear opaque. Why is heteroscedasticity important? Where is a list of the model assumptions I need to consider?
As it turns out, there are straight forward answers to these (and similar) questions. The solution is to explicitly spell out your model. All “assumptions” can easily read off from these model specifications.Distribution of residuals is of interest for linear models, not the distribution of y
/2020/06/09/distribution-of-residuals-is-of-interest-for-linear-models-not-the-distribution-of-y/
Tue, 09 Jun 2020 00:00:00 +0000/2020/06/09/distribution-of-residuals-is-of-interest-for-linear-models-not-the-distribution-of-y/Load packages library(tidyverse) My \(y\) is not distributed according to my wishes! Let \(Y\) be a variable that we would like to model, for instance, Covid-19 cases.
Now, there’s a widely hold belief that my \(Y\) must be distributed normally, or, in some cases, following some other assumed distribution (maybe some long-tailed distribution).
However, this belief is not (strictly) true. What a linear model assumes is that the residuals are distributed normally, not the \(Y\) distribution.On a confidence-interval-myth
/2020/06/05/on-a-confidence-interval-myth/
Fri, 05 Jun 2020 00:00:00 +0000/2020/06/05/on-a-confidence-interval-myth/Load packages library(tidyverse) library(mosaic) Setup data(flights, package = "nycflights13") A story about data Say we have a decent sample of \(n=100\), and we would like to compute a standard, plain vanilla confidence interval (95% CI).
For the sake of having a story, assume you are the boss of the NYC airports and you are investigating the 2013 “typical” arrival delays.
OK, here we go.
Get the sample:Simulating values according to some distribution
/2020/06/05/simulating-values-according-to-some-distribution/
Fri, 05 Jun 2020 00:00:00 +0000/2020/06/05/simulating-values-according-to-some-distribution/Load packages library(tidyverse) library(mosaic) What’s a Monte Carlo simulation? A Monte Carlo Simulation is a numeric approach to solving difficult problems. Instead of having an analytic way of solving the problem, one just says “ok, let’s try it out and see what happens”.
Coin flip distribution Simalatin a single coin flip (Bernoulli) distribution can be achieved like this:
rflip() #> #> Flipping 1 coin [ Prob(Heads) = 0.Simulation based inference for non-parametric tests, and a trick
/2020/06/05/sbi-nonparametric/
Fri, 05 Jun 2020 00:00:00 +0000/2020/06/05/sbi-nonparametric/Load packages library(tidyverse) library(mosaic) Data data("tips", package = "reshape2") Non-parametric tests and simulation based inference Simulation-based inference (SBI) is an old tool that has seen a surge in research interest in recent years probably due to the large amount of computational powers at the hands of researchers.
SBI is less prone to violations of assumptions, particularly with distributional assumptions. This is because inference is not based on the idea that some variable follows a – for example – normal distribution.Chi-squared test using simulation based inference
/2020/06/04/chi-squared-test-using-simulation-based-inference/
Thu, 04 Jun 2020 00:00:00 +0000/2020/06/04/chi-squared-test-using-simulation-based-inference/Load packages library(tidyverse) Simulation based inference Simulation based inference (SBI) is an elegant way of subsuming a wide array of statistical (inference) methods under one umbrella. In addition, its simple thereby helping learners getting to the grips.
Here’s a summary of the central ideas.
However, this post does not aim at explaining simulation based inference, which is done elsewhere.
Testing the association of two categorical variables One application of statistical tests – simulation based or classical – is testing the association of two categorical variables.When adding variable hurts – The collider bias
/2020/06/04/when-adding-variable-hurts-the-collider-bias/
Thu, 04 Jun 2020 00:00:00 +0000/2020/06/04/when-adding-variable-hurts-the-collider-bias/Load packages library(tidyverse) library(conflicted) library(ggdag) library(broom) library(GGally) Motivation Assume there is some scientist with some theory. Her theory holds that X and Z are causes of Y. dag1 shows her DAG (ie., her theory depicted as a causal diagram). Our scientist is concerned with the causal effect of X on Y, where X is a treatment variable (exposure) and Y is the dependent variable under scrutiny (outcome).
See e.Plot for mean comparison
/2020/06/02/plot-for-mean-comparison/
Tue, 02 Jun 2020 00:00:00 +0000/2020/06/02/plot-for-mean-comparison/Load packages library(tidyverse) library(reshape2) # for data library(mosaic) library(sjmisc) library(skimr) Data setup data(tips) Aggregate data per group tips_aggr <- tips %>% group_by(smoker) %>% summarise(tip_avg = mean(tip), tip_md = median(tip), tip_sd = sd(tip), tip_iqr = IQR(tip)) tips_aggr #> # A tibble: 2 x 5 #> smoker tip_avg tip_md tip_sd tip_iqr #> <fct> <dbl> <dbl> <dbl> <dbl> #> 1 No 2.99 2.74 1.38 1.50 #> 2 Yes 3.01 3 1.Plotting a correlated bivariate Gaussian
/2020/05/30/plotting-a-correlated-bivariate-gaussian/
Sat, 30 May 2020 00:00:00 +0000/2020/05/30/plotting-a-correlated-bivariate-gaussian/Load packages library(tidyverse) library(rockchalk) library(MASS) Defining the data myR <- lazyCor(X = 0.7, d = 2) mySD <- c(1, 1) myCov <- lazyCov(Rho = myR, Sd = mySD) myR #> [,1] [,2] #> [1,] 1.0 0.7 #> [2,] 0.7 1.0 mySD #> [1] 1 1 myCov #> [,1] [,2] #> [1,] 1.0 0.7 #> [2,] 0.7 1.0 Drawing from the multivariate normal Let’s draw 1000 cases.Various methods for plotting 3d bivariate Gaussians
/2020/05/30/various-methods-for-plotting-3d-bivariate-gaussians/
Sat, 30 May 2020 00:00:00 +0000/2020/05/30/various-methods-for-plotting-3d-bivariate-gaussians/Load packages library(tidyverse) Motivation This post is a compilation, rather uncommented compilation, of various methods of plotting 3D (bivariate) Gaussian distributions in R.
I add the source to each method.
Note that some methods (5, 6) open a interactive window wihich is not supported here. I added a static version of the plot then.
Method 1 Source: https://codegolf.Adjustment set exercise from Elwert 2013
/2020/05/19/adjustment-set-exercise-from-elwert-2013/
Tue, 19 May 2020 00:00:00 +0000/2020/05/19/adjustment-set-exercise-from-elwert-2013/Load packages library(tidyverse) library(ggdag) library(dagitty) Define DAG I’ve drawn the DAG in dagitty.net, that’s why the coordinates look weird.
dag3_str <- ' dag { bb="-2.865,-5.146,2.956,4.896" U [latet, pos="2.456,-0.958"] X [exposure, pos="-2.365,-4.309"] Y [outcome, pos="-0.271,4.059"] Z1 [pos="-0.491,-1.925"] Z2 [pos="-0.915,1.269"] Z3 [pos="1.713,1.984"] U -> Z1 U -> Z3 X -> Z1 Z2 -> Y Z2 -> Z1 Z2 -> Z3 Z3 -> Y }' Then tidify:
dag3 <- dagitty(dag3_str) dag3_tidy <- tidy_dagitty(dag3) dag3_tidy #> # A DAG with 6 nodes and 7 edges #> # #> # Exposure: X #> # Outcome: Y #> # #> # A tibble: 9 x 8 #> name x y direction to xend yend circular #> <chr> <dbl> <dbl> <fct> <chr> <dbl> <dbl> <lgl> #> 1 U 2.Plotting equivalence class for confounder triangle
/2020/05/19/plotting-equivalence-class-for-confounder-triangle/
Tue, 19 May 2020 00:00:00 +0000/2020/05/19/plotting-equivalence-class-for-confounder-triangle/Load packages library(tidyverse) library(ggdag) library(dagitty) Define DAG dag1_str <- 'dag { C [pos = "2,2"] X [exposure, pos = "1,1"] Y [outcome, pos = "3,1"] C -> X C -> Y }' Plot DAGs First tidify:
dag1 <- dagitty(dag1_str) dag1_tidy <- tidy_dagitty(dag1) dag1_tidy #> # A DAG with 3 nodes and 2 edges #> # #> # Exposure: X #> # Outcome: Y #> # #> # A tibble: 4 x 8 #> name x y direction to xend yend circular #> <chr> <int> <int> <fct> <chr> <int> <int> <lgl> #> 1 C 2 2 -> X 1 1 FALSE #> 2 C 2 2 -> Y 3 1 FALSE #> 3 X 1 1 <NA> <NA> NA NA FALSE #> 4 Y 3 1 <NA> <NA> NA NA FALSE Then plot:How to find the package of a R function
/2020/05/15/how-to-find-the-package-of-a-r-function/
Fri, 15 May 2020 00:00:00 +0000/2020/05/15/how-to-find-the-package-of-a-r-function/Load packages library(tidyverse) Where does my function reside? Finding the package of a given R function is some hassle. I am not aware of a quick built-in way in R to find the package of a function.
That’s why I came up with my own function, check it out:
Install package Speaking of packages of function, that’s the package where this function stays:
library(devtools) install_github("sebastiansauer/prada") Example library(prada) find_funs("select") #> # A tibble: 11 x 3 #> package_name builtin_pckage loaded #> <chr> <lgl> <lgl> #> 1 BDgraph FALSE FALSE #> 2 dplyr FALSE TRUE #> 3 jmvcore FALSE FALSE #> 4 jqr FALSE FALSE #> 5 MASS TRUE FALSE #> 6 plotly FALSE FALSE #> 7 raster FALSE FALSE #> 8 rstatix FALSE FALSE #> 9 tidygraph FALSE FALSE #> 10 tidylog FALSE FALSE #> 11 VGAM FALSE FALSE find_funs("tidy") #> # A tibble: 14 x 3 #> package_name builtin_pckage loaded #> <chr> <lgl> <lgl> #> 1 broom FALSE FALSE #> 2 broom.Statistical power: Why small effects need big samples – An intuition
/2020/05/15/statistical-power-why-small-effects-need-big-samples-an-intuition/
Fri, 15 May 2020 00:00:00 +0000/2020/05/15/statistical-power-why-small-effects-need-big-samples-an-intuition/Load packages library(tidyverse) Why small effects need big samples That’s a question that periodically comes up in class. Suppose someone is planning a study. As demanded by her teacher, she computes the needed sample size upfront. So the question arises: Given some to-be-achieved level of power (80%), some effect size, and some other details: How large does my sample need to be?
Some students are puzzled by the fact that small effects need larges samples.Crashkurs 'Umfrageforschung'
/2020/05/14/crashkurs-umfrageforschung/
Thu, 14 May 2020 00:00:00 +0000/2020/05/14/crashkurs-umfrageforschung/Eine Einführung zur Erstellung, Durchführung und Auswertung von wissenschaftlich fundierten Fragebogen Lehr-Lern-Ziele Die Teilnehmenden sollen befähigt werden, eine sozialwissenschaftlich fundierte Umfrage grundständig selbständig zu planen, durchzuführen und auszuwerten. Nebem dem Ziel der Kompetenz ist das Ziel der Selbstwirksamkeit zentral. Die Teilnehmenden sollen erfahren, dass es Ihnen (in grundständiger Variante) gut möglich ist, das Ziel zu erreichen, sich also als selbstwirksam zu erleben.
Nicht Ziel ist es, tiefer gehende theoretische Konzepte zu vermitteln.Simulating Berkson's paradox
/2020/04/16/simulation-berkson-s-paradox/
Thu, 16 Apr 2020 00:00:00 +0000/2020/04/16/simulation-berkson-s-paradox/This post was inspired by this paper of Karsten Luebke and coauthors.
We’ll stratify our sample into two groups: students (Studium) and non-students (kein Studium).
Structural causal model First, we define the structure of our causal model.
set.seed(42) # reproducibilty N <- 1e03 IQ = rnorm(N) Fleiss = rnorm(N) Eignung = 1/2 * IQ + 1/2 * Fleiss + rnorm(N, 0, .1) That is, aptitude (Eignung) is a function of intelligence (IQ) and dilligence (Fleiss), where the input variables have the same impact on the outcome variable (aptitude).Folien für den Workshop zur simulationsbasierten Inferenz, 2020-02-05
/2020/02/02/folien-f%C3%BCr-den-workshop-zur-simulationsbasierten-inferenz-2020-02-05/
Sun, 02 Feb 2020 00:00:00 +0000/2020/02/02/folien-f%C3%BCr-den-workshop-zur-simulationsbasierten-inferenz-2020-02-05/ Workshop zu simulationsbasierter Inferenz Die Folien für meinen Workshop zur simulationsbasierten Inferenz finden sich hier.
Die PDF-Version findet sich hier.
Der Quellcode liegt hier.
Die Folien sind unter CC-BY 4.0 De lizensiert.
Cluster analysis and image size reduction
/2020/01/10/cluster-analysis-and-image-size-reduction/
Fri, 10 Jan 2020 00:00:00 +0000/2020/01/10/cluster-analysis-and-image-size-reduction/Idea This post is a remake of this casestudy: https://fallstudien.netlify.com/fallstudie_bildanalyse/bildanalyse
brought to you by Karsten Lübke.
The main purpose is to replace the base R command that Karsten used with a more tidyverse-friendly style. I think that’s easier (for me).
We will compute a cluster analysis to find the typical RGB color per cluster.
WARNING There’s still a bug in the code. That’s why the image at the end appear blurred.Pictogram waffle plot using emojifont
/2019/11/25/pictogram-waffle-plot-using-emojifont/
Mon, 25 Nov 2019 00:00:00 +0000/2019/11/25/pictogram-waffle-plot-using-emojifont/Load packages library(tidyverse) library(emojifont) library(showtext) library(ggpubr) Pictogram waffle plot A Pictogram may be defined as a (statistical) diagram using icons or similar “iconic” graphics to illstrate stuff. The waffle plot (see this post) is a nice object where to combine waffle and pictorgrams. Originally, this post was inspired by HRBRMSTR waffle package, see this post, but I could not get it running.
Maybe the easiest way is to work through an example (spoiler: see below for what we’re heading at).Correlation cannot be more extreme than +1/-1, proof using Cauchy-Schwarz inequality
/2019/11/19/correlation-cannot-be-more-extreme-than-1-1-proof-using-cauchy-schwartz-inequality/
Tue, 19 Nov 2019 00:00:00 +0000/2019/11/19/correlation-cannot-be-more-extreme-than-1-1-proof-using-cauchy-schwartz-inequality/Load packages library(tidyverse) The correlation coefficient cannot exceed an absolute value of 1 This is well-known. But why is that the case? How can we proof it? This post gives one explanation using the Cauchy-Schwarz inequality.
Here’s one version of the definition of correlation:
\[ r = \frac{\sum(\Delta x \Delta y)}{\sqrt{\sum \Delta x^2} \sqrt{\sum \Delta y^2}} \]
where \(\Delta x\) and \(\Delta y\) are the differences of \(x_i\) and \(\bar{x}\), that is: \(\Delta x_i = x_i - \bar{x}\), and similarly for \(\Delta y_i\).Plotting functions in 3D in R
/2019/11/19/plotting-functions-in-3d-in-r/
Tue, 19 Nov 2019 00:00:00 +0000/2019/11/19/plotting-functions-in-3d-in-r/Load packages library(tidyverse) Plotting functions in 3d
/2019/11/19/plotting-functions-in-3d/
Tue, 19 Nov 2019 00:00:00 +0000/2019/11/19/plotting-functions-in-3d/Load packages library(tidyverse) library(mosaic) library(plotly) Gimme a function Say, you have some function such as
\[ f(x) = x^2+z^2 \]
In more R-ish:
f <- makeFun(x^2 + z^2 ~ x & z) And you would like to plot it.
Observe that this function has two input (independent) variables, \(x\) and \(z\), plus one output (dependent) variables, \(y\).
The thing is, you’ll need to define the values for a number of output values for \(y\), as defined by the function.Some intution on the Gaussian distribution formula
/2019/11/18/some-intution-on-the-gaussian-distribution-formula/
Mon, 18 Nov 2019 00:00:00 +0000/2019/11/18/some-intution-on-the-gaussian-distribution-formula/Load packages library(tidyverse) library(mosaic) The Gaussian The ubiquituous Gaussian (aka normal) distribution is probably the most widely known distribution for stochastic process (although maybe as frequently encountered as a unicorn).
Here it is in all its glory.
gf_dist("norm") There are two typical ways, why it may be considered “normal”, one is using the Galton Board, and one approach is building on the Central Limit Theorem. While such considerations are great for understanding “where” the Gaussian distribution comes from, this post explore some other direction of intuiton.Most important asssumption in linear models ... and the second most
/2019/11/11/most-important-asssumption-in-linear-models/
Mon, 11 Nov 2019 00:00:00 +0000/2019/11/11/most-important-asssumption-in-linear-models/Load packages library(tidyverse) library(mosaic) We are following here the advise of Gelman and Hill (2007).
Validity Quite obviously, the right predictors must be included in the model in order to learn something from the model. The “right” predictors means: avoiding the wrong ones, and including the correct ones. Easier said than done, particularly with a look to the causal inference aspects. Let’s turn to the next most important assumption.Some notes on data transformations for regression
/2019/11/11/some-notes-on-data-transformations-for-regression/
Mon, 11 Nov 2019 00:00:00 +0000/2019/11/11/some-notes-on-data-transformations-for-regression/Load packages library(tidyverse) library(mosaic) Motivation What are data transformation good for? Why do we bother to transform variables for regression analysis? This post explores some nuances around these themes.
Simulate an exponentially distributed assocation len <- 42 # 42 x values x <- rep(runif(len), 30) # each x value repeated 30 times y <- dexp(x) + rnorm(length(x), mean = 0, sd = .01) # add some noise Plot it:Some ways for plotting 3D linear models
/2019/10/21/some-ways-for-plotting-3d-linear-models/
Mon, 21 Oct 2019 00:00:00 +0000/2019/10/21/some-ways-for-plotting-3d-linear-models/Load packages library(tidyverse) library(mosaic) library(plotly) library(scatterplot3d) library(rsm) Motivation Linear models are a standard way of predicting or explaining some data. Visualizing data is not only of didactical value but provides heuristical value too, as demonstrated by Anscombe’s Quartet.
Visualizing linear models in 2D is straightforward, but visualizing linear models with more than one predictor is much less so. The aim of this post is to demonstrate some ways do visualize linear models with more than one predictor, using popular R packages.P-values are uniformly distributed under the H0, a simulation
/2019/10/11/p-values-are-equally-distributed-under-the-h0/
Fri, 11 Oct 2019 00:00:00 +0000/2019/10/11/p-values-are-equally-distributed-under-the-h0/Load packages library(tidyverse) library(mosaic) Motivation The p-value is a ubiquituous tool for gauging the plausibility of a Null hypothesis. More specifically, the p-values indicates the probability of obtaining a test statistic at least as extreme as in the present data if the Null hypothesis was true and the experiment would be repeated an infinite number of times (under the same conditions except the data generating process).
The distribution of the p-values depends on the strength of some effect (among other things).Simple proof that the correlation coefficient cannot exceed abs(1)
/2019/10/07/simple-proof-that-the-correlation-coefficient-cannot-exceed-abs-1/
Mon, 07 Oct 2019 00:00:00 +0000/2019/10/07/simple-proof-that-the-correlation-coefficient-cannot-exceed-abs-1/Load packages library(tidyverse) library(MASS) Motivation It is well-known that the notorious (Pearson’s) correlation cannot exceed an absolute value greater than 1, that is
\[ -1 \le r \le +1 \]
or
\[ |r| \le 1 \]
However, proofing this fact is less straightforward. A classical way of proofing the above inequality is by using the Cauchy-Schwarz inequality. From a teacher’s perspective, the CS inequality may not be ideal, because the students may lack some knowledge necessary for appreciating this proof.Some algebraic properties of z-scores
/2019/10/07/some-algebraic-properties-of-z-scores/
Mon, 07 Oct 2019 00:00:00 +0000/2019/10/07/some-algebraic-properties-of-z-scores/Load packages library(tidyverse) Motivation Z-scores (z-values) are a useful and widely employed tool to gauge and compare measurements. For instance, z-scores help to compare the relative position of some measurements with respect to their distributions. In this post, we will prove some basic (algebraic) properties of z-values. There’s nothing new to that, it’s just I’d like to have it neat and concise somewhere to quickly find it. I’ll add some explanation for the ease of reception.Looping over function arguments using purrr
/2019/09/28/looping-over-function-arguments-using-purrr/
Sat, 28 Sep 2019 00:00:00 +0000/2019/09/28/looping-over-function-arguments-using-purrr/Load packages library(tidyverse) Problem statement Assume you have to call a function multiple times, but each with (possibly) different argument. Given enough repitioons, you will not want to repeat yourself.
In other words, we would like to loop over function arguments, each round in the loop giving the respective argument’value(s) to the function.
One example would be to generate many random values but each with different mean and/or sd:Slides for my workshop on Markdown and Git
/2019/09/09/slides-for-my-workshop-on-markdown-and-git/
Mon, 09 Sep 2019 00:00:00 +0000/2019/09/09/slides-for-my-workshop-on-markdown-and-git/Here are my slides for my Workshop on Markdown and Git (2019-09-16). Note that you need to be online to render the slides (due to heavy use of JS).
The Rmd source code (master file) can be found here.
The PDF version of the slides can be found here.Computing rater accuracy across multiple raters and multiple criteria
/2019/08/27/computing-rater-accuracy-across-multiple-raters-and-multiple-criteria/
Tue, 27 Aug 2019 00:00:00 +0000/2019/08/27/computing-rater-accuracy-across-multiple-raters-and-multiple-criteria/Load packages library(tidyverse) Background Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients.Performance measures for `caret` and `lm()`
/2019/08/02/performance-measures-for-caret-and-lm-r/
Fri, 02 Aug 2019 00:00:00 +0000/2019/08/02/performance-measures-for-caret-and-lm-r/Recently, I run into performance issue when fitting a linear model together with a resampling scheme and a tuning grid (via caret). The dataset was recently large - some 200k rows and approx. 20 columns (nycflights13 train). Still, I was suprised that my machine got stuck during the computation. Now I wonder whether I ran into memory constraints (16BG on my machine), or whether some other stuff went wrong.
Load packages library(tidyverse) library(caret) library(stringr) Load data data("flights", package = "nycflights13") glimpse(flights) #> Observations: 336,776 #> Variables: 19 #> $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… #> $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… #> $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… #> $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558,… #> $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600,… #> $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, … #> $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753… #> $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745… #> $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3,… #> $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "… #> $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, … #> $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN",… #> $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", … #> $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", … #> $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, … #> $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944,… #> $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6… #> $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0,… #> $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-0… flights2 <- flights %>% select(-year) %>% drop_na() Any NAs?Geoplotting - update to my MODAR-book
/2019/07/29/geoplotting-update-to-my-modar-book/
Mon, 29 Jul 2019 00:00:00 +0000/2019/07/29/geoplotting-update-to-my-modar-book/In my book on modern data analyisis using R, I show some basics of geoplotting. It seems that some software update for the package simple features broke my code. So, here ’s some update.
Load packages and data library(tidyverse) library(viridis) library(sf) data(socec, package = "pradadata") data(wahlkreise_shp, package = "pradadata") Check data glimpse(socec) #> Observations: 316 #> Variables: 51 #> $ V01 <chr> "Schleswig-Holstein", "Schleswig-Holstein", "Schleswig-Holst… #> $ V02 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 901, 12, 13, 14, 15, 16, … #> $ V03 <chr> "Flensburg – Schleswig", "Nordfriesland – Dithmarschen Nord"… #> $ V04 <int> 130, 197, 178, 163, 3, 92, 49, 95, 49, 126, 28, 1110, 132, 1… #> $ V05 <dbl> 2128.Slides (in German) for my talk on "Datenkompetenz für alle" at the R-User-Group Nürnberg July 2019
/2019/07/17/slides-in-german-for-my-talk-on-datenkompetenz-f%C3%BCr-alle-at-the-r-user-group-n%C3%BCrnberg-july-2019/
Wed, 17 Jul 2019 00:00:00 +0000/2019/07/17/slides-in-german-for-my-talk-on-datenkompetenz-f%C3%BCr-alle-at-the-r-user-group-n%C3%BCrnberg-july-2019/The slides (pdf) of my talk “Datenkompetenz für alle – Ein Werkstattbericht zum FOM-Statistik-Curriculum” can be found here.Collapse rows to eliminate NAs
/2019/07/03/collapse-rows-to-eliminate-nas/
Wed, 03 Jul 2019 00:00:00 +0000/2019/07/03/collapse-rows-to-eliminate-nas/Load packages library(tidyverse) Starters Assume you have this data frame:
x <- tribble( ~ colA, ~colB, ~colC, NA, 1, NA, 1, NA, 1 ) x #> # A tibble: 2 x 3 #> colA colB colC #> <dbl> <dbl> <dbl> #> 1 NA 1 NA #> 2 1 NA 1 But you want this one:
y <- tribble( ~ colA, ~colB, ~colC, 1, 1, 1 ) y #> # A tibble: 1 x 3 #> colA colB colC #> <dbl> <dbl> <dbl> #> 1 1 1 1 That is, you’d like to collapse rows so that if there’s a NA in a column it is replaced by the value found in some other line.Generalized rowwise operations using purrr::pmap
/2019/07/03/generalized-rowwise-operations-using-purrr-pmap/
Wed, 03 Jul 2019 00:00:00 +0000/2019/07/03/generalized-rowwise-operations-using-purrr-pmap/Load packages library(tidyverse) Rowwwise operations are a quite frequent operations in data analysis. The R language environment is particularly strong in column wise operations. This is due to technical reasons, as data frames are internally built as column-by-column structures, hence column wise operations are simple, rowwise more difficult.
This post looks at some rather general way to comput rowwise statistics. Of course, numerous ways exist and there are quite a few tutorials around, notably by Jenny Bryant, and by Emil Hvitfeldt to name a few.Testing for equality rowwise
/2019/07/03/testing-for-equality-rowwise/
Wed, 03 Jul 2019 00:00:00 +0000/2019/07/03/testing-for-equality-rowwise/Load packages library(tidyverse) Basic testing for equality Testing for equality in a kind of very basic function in computer (and data) science. There is a straightforward function in R to test for equality:
identical(1, 1) #> [1] TRUE identical("A", "A") #> [1] TRUE identical(1, 2) #> [1] FALSE identical(1, NA) #> [1] FALSE However this get more complicated if we want to compare more than two elements. One way to achieve this is to compute the number of the different items.Testing multiple vectors for equality
/2019/07/03/testing-multiple-vectors-for-equality/
Wed, 03 Jul 2019 00:00:00 +0000/2019/07/03/testing-multiple-vectors-for-equality/Load packages library(tidyverse) Problem statement Assume we have some vectors (eg, 3), and we want to check if they are equal (the same elements in each vector). Assume further we do not in advance the number of vectors to check.
Here’s some toy data.
a<- c(1,2,3,4) b<- c(1,2,3,5) c<- c(1,3,4,5) The gist This soluation is based on the code of Akrun from this SO post (slightly adapted).
sum(reduce(map2(list(a,b,c), list(a), `==`), `&`)) #> [1] 1 Explanation Let’s break that in handy pieces to get a grip on it.How to document a conference talk in citation manager software
/2019/06/28/how-to-document-a-conference-talk-in-citation-manager-software/
Fri, 28 Jun 2019 00:00:00 +0000/2019/06/28/how-to-document-a-conference-talk-in-citation-manager-software/There are several popular citation manager software packages around. I used to work with Mendeley in class, but I stopped using it since it was acquired by El$sevier. Luckily there are good alternatives around, particularly Zotero. Zotero features a Word (MS Word, Libre Word) plugin, which is a must have for many of us. The more technically inclined folks will use Bibtex. Good news is that Zotero syncs it Library to Bibtex.Talk 'Data Science in Business'
/2019/05/10/talk-data-science-for-business/
Fri, 10 May 2019 00:00:00 +0000/2019/05/10/talk-data-science-for-business/Talk "Intro to Data Science in Business See here the slides (pdf) for the talk.
Talk “Reviewing rapid prototype candidates” See here the slides (pdf) for the talk.
Colophon CC-BY
How to convert raw scores to different types of standardized scores
/2019/04/11/how-to-convert-raw-scores-to-different-types-of-standardized-scores/
Thu, 11 Apr 2019 00:00:00 +0000/2019/04/11/how-to-convert-raw-scores-to-different-types-of-standardized-scores/A common undertaking in applied research settings such as in some areas of psychology is to convert a raw score into some type of standardized score such as z-scores.
This post shows a way how to accomplish that.
Load packages library(tidyverse) Load some psychometric data data("extra", package = "pradadata") The data can be downloaded here.
The dataset shows some data on extraversion (the personality trait) items along with some correlates of extraversion.A stochastic problem by Warren Buffet solved with simulation
/2019/04/04/a-stochastic-problem-by-warren-buffet-solved-with-simulation/
Thu, 04 Apr 2019 00:00:00 +0000/2019/04/04/a-stochastic-problem-by-warren-buffet-solved-with-simulation/This post presents a stochastic problem, with application to financial theory taken from this magazine article. Some say the problem goes back to Warren Buffett. Thanks to my colleague Norman Markgraf, who pointed it out to me.
Assume there are two coins. One is fair, one is loaded. The loaded coin has a bias of 60-40. Now, the question is: How many coin flips do you need to be “sure enough” (say, 95%) that you found the loaded coin?Reducing residual variance in modeling
/2019/03/26/reducing-residual-variance-in-modeling/
Tue, 26 Mar 2019 00:00:00 +0000/2019/03/26/reducing-residual-variance-in-modeling/Modeling is a central part not only of statistical inquiry, but also of everyday human sense-making. We use models as metaphors for the world, in a broader sense. Of course, a model that explains the world better (than some other model) is to be preferred, all other things being equal. In this post, we demonstrate that a more “clever” statistical model reduces the residual variance. It should be noted that this “noise reduction” comes at a cost, however: The model gets more complex; there a more parameters in the model.Beispiel für eine logistische Regression
/2019/03/20/beispiel-f%C3%BCr-eine-logistische-regression/
Wed, 20 Mar 2019 00:00:00 +0000/2019/03/20/beispiel-f%C3%BCr-eine-logistische-regression/Wozu ist das gut? Kurz gesagt ist die logistische Regression ein Werkzeug, um dichotome (zweiwertige) Ereignisse vorherzusagen (auf Basis eines Datensatzes mit einigen Prädiktoren).
Was sagt uns die logistische Regression? Möchte man z.B. vorhersagen, ob eine E-Mail Spam ist oder nicht, so ist es nützlich, für jede zu prüfende Mail eine Wahrscheinlichkeit zu bekommen. So könnte uns die logistische Regression sagen: “Eine Mail mit diesen Ausprägungen in den Prädiktoren hat eine Wahrschenlichkeit von X Prozent, dass es sich um Spam handelt”.Slides of my talk at ECDA 2019: Modeling of AfD election success
/2019/03/16/slides-of-my-talk-at-ecda-2019-modeling-of-afd-election-success/
Sat, 16 Mar 2019 00:00:00 +0000/2019/03/16/slides-of-my-talk-at-ecda-2019-modeling-of-afd-election-success/Slides of my talk at ECDA 2019 can be found here: http://data-se.netlify.com/slides/afd_ecda2019/afd-modeling-ECDA-2019.html#1.
Note that you need to be online to render the slides.
The (standalone) PDF version can be found here: http://data-se.netlify.com/slides/afd_ecda2019/afd-modeling-ECDA-2019.pdfHow to mutate all columns of a data frame
/2019/03/13/how-to-mutate-all-columns-of-a-data-frame/
Wed, 13 Mar 2019 00:00:00 +0000/2019/03/13/how-to-mutate-all-columns-of-a-data-frame/Say, you have a data frame with a number of columns, and you need to change every column in a similar way. A common example might be to standardize all (numeric) variables. How to do that in R? This post shows and explains an example using mutate_all() from the tidyverse.
Let’s stick to the question “how to z-standardize all columns” for the sake of simplicity (and neglect that there are precooked solutions, for example from the superb package sjmisc by strengejacke.Emails schreiben an Dozierende
/2019/02/28/emails-schreiben-an-dozierende/
Thu, 28 Feb 2019 00:00:00 +0000/2019/02/28/emails-schreiben-an-dozierende/E-Mails schreiben ist eine wesentliche Form der Korrespondenz mit eigenen Vorteilen und Schwächen. Jedenfalls ist sie allgegenwärtig. Dieser Beitrag soll (meinen) Studierenden Orientierung geben, wie man eine E-mail an Dozierende schreibt. Natürlich ist das meine Sicht der Dinge; andere Dozierende möchten vielleicht ein andere Art von E-Mails erhalten.
Letztlich ist die Art von E-Mails an Dozierende nichts anderes als eine Form der Geschäftskorrespondenz. Daher gelten die entsprechenden Regeln; allerdings behält sich die akademische Welt vielleicht ein paar Feinheiten (und Freiheiten) vor, die man kennen sollten, wenn man solche Mails schreiben will oder muss.Ornaments with ggformula
/2019/02/12/ornaments-with-gformula/
Tue, 12 Feb 2019 00:00:00 +0000/2019/02/12/ornaments-with-gformula/Since some time, there’s a wrapper for ggplot2 available, bundled in the package ggformula. One nice thing is that in that it plays nicely with the popular R package mosaic. mosaic provides some useful functions for modeling along with a tamed and consistent syntax. In this post, we will discuss some “ornaments”, that is, some details of beautification of a plot. I confess that every one will deem it central, but in some cases in comes in handy to know how to “refine” a plot using ggformula.Online reaction time experiments using lab.js
/2019/01/29/online-reaction-time-experiments-using-lab-js/
Tue, 29 Jan 2019 00:00:00 +0000/2019/01/29/online-reaction-time-experiments-using-lab-js/Collecting data over the internet used to be fancy, some twenty years or so ago. Nowadays it can be considered standard, if not old school (collecting data using mobile apps is where the cool kids go at the moment). However, there’s one noteable exception: Collecting reaction time data over the internet remained a challenge. The reason is simply a technological artefact in that an html response time may vary, vary too much as to invalidate the signal from some behavorial reaction time research study.Reading text files and Umlaute hassle
/2019/01/25/reading-text-files-and-umlaute-hassle/
Fri, 25 Jan 2019 00:00:00 +0000/2019/01/25/reading-text-files-and-umlaute-hassle/Data is often stored as plain text file. That’s good because it is a simple format. However, simplicity comes with a cost: Not all questions may have definite answers. The most common hassle when reading/importing text files is that the encoding scheme is unknown, aka wrong. This problem mostly occurs when, say, a Mac user stores a text file, where per default UTF8 text encoding is applied. In contrast, on a Windows machine, Windows-encoding (often dubbed “latin1”,“Windows 1252” or “ISO-8859-1”) is the default.Poster: A Bayes model of AfD party success
/2019/01/24/poster-a-bayes-model-of-afd-party-success/
Thu, 24 Jan 2019 00:00:00 +0000/2019/01/24/poster-a-bayes-model-of-afd-party-success/At the Dozentenmeeting 2019 of the FOM Hochschule, I presented a poster of an analysis of the AfD election success, based on a Bayes multi level regression.
The poster can be downloaded here.Poster: Populism in German politicians
/2019/01/17/poster-populism-in-german-politicians/
Thu, 17 Jan 2019 00:00:00 +0000/2019/01/17/poster-populism-in-german-politicians/At the Dozentenmeeting 2019 of the FOM Hochschule, I presented a poster of an analysis of populism in German politicians.
The poster can be downloaded here.An illustration of tidyverse’ gather/spread
/2019/01/15/an-illustration-of-tidyverse-gather-spread/
Tue, 15 Jan 2019 00:00:00 +0000/2019/01/15/an-illustration-of-tidyverse-gather-spread/Frequently, datasets have to be reshaped before further analysis. One particular important step is to transform a data frame from “wide” to “long” format. This is illustrated by the following diagram, taken from by new book on data analysis (Image licence: CC-BY-NC).A clean sessionInfo page
/2019/01/14/a-clean-sessioninfo-page/
Mon, 14 Jan 2019 00:00:00 +0000/2019/01/14/a-clean-sessioninfo-page/Writing a technical or academic report, or even a presentation, it is sensible to render the (R) code in such a writing reproducible. Same thing applies when asking for help at StackOverflow: you’ll be asked for a reprex.
One aspect for rendering a report reproducible is to include details on the version of packages needed. The well-known command sessionInf() provides the building blocks for that. However, the output of that function can feel verbose, and it consumes a lot of space.Barplots with mosaic
/2019/01/10/barplots-with-mosaic/
Thu, 10 Jan 2019 00:00:00 +0000/2019/01/10/barplots-with-mosaic/Plotting barplots is a frequent endeavor for the analysis of qualitative data. Numerous methods for plotting barplots exist; the popular R package mosaic also provides methods.
More recently, mosaic switched to a ggplot wrapper for plotting diagrams, that is gf_XXX(), packaged in ggformula. That implies that input data is expected to be tidy, because ggplot, a central member of the tidyverse, excepts its input data to be tidy.
Let’s check an example.A short tutorial for the logistic regression
/2019/01/07/a-short-tutorial-for-the-logistic-regression/
Mon, 07 Jan 2019 00:00:00 +0000/2019/01/07/a-short-tutorial-for-the-logistic-regression/Here’s q quick walk-through for a logistic regression in R.
Setup library(tidyverse) library(reshape2) # dataset "tips" library(caret) library(mosaic) We’ll use the tips dataset:
data(tips) Research question Assume we would like to predict if a person is female based on some predictor such as the amount of tip she/he give.
How many instances of each type of the outcome variable are in the data set?
tally(~ sex, data = tips, format = "proportion") #> sex #> Female Male #> 0.Folien für Vortrag 'Papers publizieren'
/2019/01/04/folien-f%C3%BCr-vortrag-papiers-publizieren/
Fri, 04 Jan 2019 00:00:00 +0000/2019/01/04/folien-f%C3%BCr-vortrag-papiers-publizieren/Die Folien für meinen Vortrag “Papers publizieren” zum Dozententreffen 2019 der FOM Hochschule liegen hier.Why standard regression is not (so) adequate for regressing proportions
/2019/01/03/why-standard-regression-is-not-so-adequate-for-regressing-proportions/
Thu, 03 Jan 2019 00:00:00 +0000/2019/01/03/why-standard-regression-is-not-so-adequate-for-regressing-proportions/Intro Professor Sweet is conducting some research to investigate the risk factor and drivers of student exam success. In a recent analysis he considers the variable “exam successfully passed” (vs. not passed) as the criterion (output) and the amount of time spent for preparation (aka study time) as predictor.
Setup Please make sure that all packages are installed before proceeding. Except pradadata, all packages are on CRAN. [ Here’s] (https://github.Force bibtex to show the exact date
/2018/12/29/force-bibtex-to-show-the-exact-date/
Sat, 29 Dec 2018 00:00:00 +0000/2018/12/29/force-bibtex-to-show-the-exact-date/Citing (aka scientific citation) is quite straight forward in RMarkdown. However, there are some shortcomings. Primarily, as citations are rendered via Pandoc’s reference engine, bibtex is used as a standard. Though is quite commonly used, bibtex has been, over and above, replaced by biblatex. biblatex is much more straight forward than bibtex (as text is formatted using latex and not bibtex, still making use of bibtex for the collection of references).Using BibLaTeX instead of Bibtex in Rmarkdown for finer control
/2018/12/28/using-biblatex-instead-of-bibtex-in-rmarkdown-for-finer-control/
Fri, 28 Dec 2018 00:00:00 +0000/2018/12/28/using-biblatex-instead-of-bibtex-in-rmarkdown-for-finer-control/As a standard, bibtex is used as a citation-renderer in Pandoc’s Markdown, that is, in RMarkdown as well. bibtex is useful for a fair amount of citation task, but biblatex allows for a finer control. For instance, multiple bibliographies for one document are possible.
For instance, citing a newspaper article using bibtex left me scratching my head, as I wanted to have the exact day of the date (not only the year) cited.Generating mass reports using Rmarkdown in R
/2018/12/19/generating-mass-reports/
Wed, 19 Dec 2018 00:00:00 +0000/2018/12/19/generating-mass-reports/Sometimes, one document must be recreated in similar fashions a lot of times. For instance, invoices to customers, grading schemes for students, progress reports in projects, and so on. In this post, I demonstrate one way to do that in R using RMarkdown.
Specifically, it is assumed that there’s a tabular data set, where each row refers to a document instance (eg., a mail or report to one given person), and each column holds the variables to appear in each reports (see examples below).Visualizing a multivariate normal distribution
/2018/12/13/visualizing-a-multivariate-normal-distribution/
Thu, 13 Dec 2018 00:00:00 +0000/2018/12/13/visualizing-a-multivariate-normal-distribution/In R, it is quite straight forward to plot a normal distribution, eg., using the package ggplot2 or plotly.
Setup library(tidyverse) library(mvtnorm) library(plotly) library(MASS) Simulate multivariate normal data First, let’s define a covariance matrix \(\Sigma\):
sigma <- matrix(c(4,2,2,3), ncol = 2) sigma ## [,1] [,2] ## [1,] 4 2 ## [2,] 2 3 Then, simulate observations n = n from these covariance matrix; the means need be defined, too.Visualizing a regression plane (two predictors)
/2018/12/13/visualizing-a-regression-plane-two-predictors/
Thu, 13 Dec 2018 00:00:00 +0000/2018/12/13/visualizing-a-regression-plane-two-predictors/Plotting a “simple” regression (one regression) is pretty straight forward in R.
Setup library(tidyverse) data(mtcars) library(mosaic) library(modelr) library(plotly) Define model lm1 <- lm(mpg ~ hp, data = mtcars) mtcars <- mtcars %>% mutate(lm1_pred = predict(lm1)) Plot One way:
ggplot(mtcars) + aes(y = mpg, x = hp) + geom_point() + geom_lm() Another way:
ggplot(mtcars) + aes(x = hp) + geom_point(aes(y = mpg)) + geom_point(aes(y = lm1_pred), color = "blue") + geom_line(aes(y = lm1_pred), color = "blue") Using the ggformula interface to ggplot2:Changing the default color scheme in ggplot2
/2018/12/12/changing-the-default-color-scheme-in-ggplot2/
Wed, 12 Dec 2018 00:00:00 +0000/2018/12/12/changing-the-default-color-scheme-in-ggplot2/UPDATE: see update below based on comments from nmarkgraf.
UPDATE 2: I changed the theme to theme_minimal thanks to the comment from @neuwirthe.
UPDATE 3: A more efficient way to plot a discrete scale using viridis. Thanks to flying sheep; see way 4 below
The default color scheme in ggplot2 is suitable for many purposes, but, for instance, it is not suitable for b/w printing, and maybe not suitable for persons with limited color perception.New split-apply-combine variant in dplyr: group_split()
/2018/12/10/new-split-apply-combine-variant-in-dplyr-group-split/
Mon, 10 Dec 2018 00:00:00 +0000/2018/12/10/new-split-apply-combine-variant-in-dplyr-group-split/UPDATE 2018-12-11 - I’m talking about the package DPLYR, not PURRR, as I had mistakenly written.
There are many approaches for what is called the “split-apply-combine” approach (see this paper by Hadley Wickham).
I recently thought about the best approach to use split-apply-combine approaches in R (see tweet, and this post).
And I retweeted some criticism on the “present era” tidyverse approach (see this tweet), and check out the mentioned post by @coolbutuseless.Applying a function to each row of a data frame
/2018/12/07/applying-a-function-to-each-row-of-a-data-frame/
Fri, 07 Dec 2018 00:00:00 +0000/2018/12/07/applying-a-function-to-each-row-of-a-data-frame/A typical and quite straight forward operation in R and the tidyverse is to apply a function on each column of a data frame (or on each element of a list, which is the same for that regard).
However, the orthogonal question of “how to apply a function on each row” is much less labored. We will look at this question in this post, and explore some (of many) answers to this question.Coercing an index over a character vector
/2018/12/06/coercing-an-index-over-a-character-vector/
Thu, 06 Dec 2018 00:00:00 +0000/2018/12/06/coercing-an-index-over-a-character-vector/Assume we have a vector (of type character) such as countries, names, or products. Each element is allowed to show up multiple times. Further assume that there is a rather large number of unique (different) elements. What we would like to achieve is to give each element a unique ID, where the ID ranges from 1 to k (k is the number of different elements).
Of course there are different ways to achieve this goal, we’ll explore one or two.This blog has a DOI
/2018/12/06/this-blog-has-a-doi/
Thu, 06 Dec 2018 00:00:00 +0000/2018/12/06/this-blog-has-a-doi/This blog has a DOI now:Plot many ggplot diagrams using nest() and map()
/2018/12/05/plot-many-ggplot-diagrams-using-nest-and-map/
Wed, 05 Dec 2018 00:00:00 +0000/2018/12/05/plot-many-ggplot-diagrams-using-nest-and-map/At times, it is helpful to plot a multiple of related diagrams, such as a scatter plot for each subgroup. As always, there a number of ways of doing so in R. Specifically, we will make use of ggplot2.
library(tidyverse) library(glue) data(mtcars) d <- mtcars %>% rownames_to_column(var = "car_names") Is d a tibble`
is_tibble(d) #> [1] FALSE What is it?
class(d) #> [1] "data.frame" Okay, let’s make a tibble out of it:What are the names of the cars with 4 cylinders?
/2018/12/03/what-are-the-names-of-the-cars-with-4-cylinders/
Mon, 03 Dec 2018 00:00:00 +0000/2018/12/03/what-are-the-names-of-the-cars-with-4-cylinders/Recently, some one asked me in a workshop this question: “What are the names of the cars with 4 (6,8) cylinders?” (he referred to the mtcars data set). That was a workshop on the tidyverse, so the question is how to answer this question using tidyverse techniques.
First, let’s load the usual culprits.
library(tidyverse) library(purrrlyr) library(knitr) library(stringr) data(mtcars) d <- as_tibble(mtcars) %>% rownames_to_column(var = "car_names") d %>% head() %>% kable() car_names mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.Image paths in Hugo/blogdown
/2018/11/28/image-paths-in-hugo-blogdown/
Wed, 28 Nov 2018 00:00:00 +0000/2018/11/28/image-paths-in-hugo-blogdown/Images from R are instantly included into (R) markdown files, and the same applies for blogdown posts.
See:
x <- 1:10 plot(x) However, for external images - such as photos - things are more complicated. First, all is still fine, if an image is found on some URL/server on the internet:
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/310px-R_logo.svg.png") Of course, one can apply direct markdown syntax for including external images:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/310px-R_logo.svg.png){width=20%} Now assume we are in an R project that gives the base for a blogdown blog.Compute all pairwise differences in matrix
/2018/11/21/compute-all-pairwise-differences-in-matrix/
Wed, 21 Nov 2018 00:00:00 +0000/2018/11/21/compute-all-pairwise-differences-in-matrix/A quite frequent task in many fields of applied math is to compute pairwise differences of elements in a matrix. Actually, it need not be a difference; a product is frequent, too. In this post, we explore some (base) R ways to achieve this.
library(mosaic) library(gdata) library(tidyverse) Using outer() An elegant approach, using base R, is applying outer(). That’s useful if one has two vectors, and wants to compute the outer product:Slides for the „hands-on data exploration workshop"
/2018/11/12/slides-for-the-hands-on-data-exploration-workshop/
Mon, 12 Nov 2018 00:00:00 +0000/2018/11/12/slides-for-the-hands-on-data-exploration-workshop/Find the slides for my workshop “hands-on data exploration using R” here: http://data-se.netlify.com/slides/hands-on-data-exploration/handson-data-workshop_2018-11-21.html.
Note that the slides need access to the internet, in order to be rendered correctly.
: Get PDF of slides here
: Get Rmd source code of slides here
The workshop is delivered at the Data Natives Conference 2018 Berlin.Simple Examples with DiagrammeR
/2018/11/07/simple-examples-with-diagrammer/
Wed, 07 Nov 2018 00:00:00 +0000/2018/11/07/simple-examples-with-diagrammer/UPDATE 2018-12-13: Based on a comment from @nmarkgraf, I added a section on how to export diagrammeR diagrams.
Here are some examples of diagrams build with DiagrammeR:
Setup library(tidyverse) library(DiagrammeR) library(DiagrammeRsvg) library(magick) DiagrammeR using grViz() Define the graph:
g1 <- "digraph boxes_and_circles { graph [layout = circo, overlap = true] node [shape = circle, fixedsize = true, fontname = Helvetica, width = 1] Problem; Plan; Data; Analysis; Conclusion edge [color = grey] Problem -> Plan Plan -> Data Data -> Analysis Analysis -> Conclusion Conclusion -> Problem }" Print it to the screen:Plot columns repeatedly
/2018/11/02/plot-columns-repeatedly/
Fri, 02 Nov 2018 00:00:00 +0000/2018/11/02/plot-columns-repeatedly/Suppose you have a large number of columns of a dataframe, and you want to plot each column – say a histogram for each column.
This post shows some ways of achieving this.
Let’s take the mtcars dataset as an example.
data(mtcars) We will use the tidyverse approach:
library(tidyverse) Way 1 mtcars %>% select_if(is_numeric) %>% map2(., names(.), ~ {ggplot(data = data_frame(.x), aes(x = .x)) + geom_histogram() + labs(x= .y)}) #> $mpg #> #> $cyl #> #> $disp #> #> $hp #> #> $drat #> #> $wt #> #> $qsec #> #> $vs #> #> $am #> #> $gear #> #> $carb Some explanations:OECD Wellbeing - Explorative Analyse
/2018/10/16/oecd-wellbeing-explorative-analyse/
Tue, 16 Oct 2018 00:00:00 +0000/2018/10/16/oecd-wellbeing-explorative-analyse/In diesem Post untersuchen wir einige Aspekte der explorativen Datenanalyse für den Datensatz oecd wellbeing aus dem Jahr 2016.
Hinweis: Als Vertiefung gekennzeichnete Abschnitt sind nicht prüfungsrelevant.
Benötigte Pakete Ein Standard-Paket zur grundlegenden Datenanalyse:
library(mosaic) Datensatz laden Der Datensatz kann hier bezogen werden.
Doi: https://doi.org/10.1787/data-00707-en.
Falls der Datensatz lokal (auf Ihrem Rechner) vorliegt, können Sie ihn in gewohnter Manier laden. Geben Sie dazu den Pfad zum Datensatz ein:
oecd <- read.OECD Wellbeing dataset (2016)
/2018/10/16/oecd-wellbeing-dataset-2016/
Tue, 16 Oct 2018 00:00:00 +0000/2018/10/16/oecd-wellbeing-dataset-2016/Packages We will need the following packages in this post:
library(mosaic) library(knitr) library(DT) The OECD wellbeing study series The OECD keeps measuring the wellbeing (and associated variables) among its members states.
On the project website, the OECD states:
In recent years, concerns have emerged regarding the fact that macro-economic statistics, such as GDP, don’t provide a sufficiently detailed picture of the living conditions that ordinary people experience. While these concerns were already evident during the years of strong growth and good economic performance that characterised the early part of the decade, the financial and economic crisis has further amplified them.Change standard theme of ggplot
/2018/10/10/change-standard-theme-of-ggplot/
Wed, 10 Oct 2018 00:00:00 +0000/2018/10/10/change-standard-theme-of-ggplot/ggplot2 is customizeable. Frankly, one can change a heap of details - not everything probably, but a lot. Of course, one can add a theme to the ggplot call, in order to change the theme. However, a more catch-it-all approach would be to change the standard theme of ggplot itself. In this post, we’ll investigate this option.
Load some data and the right packages:
data(mtcars) library(tidyverse) Here’s the standard theme of ggplot, let’s have a look at itTalk - Populism in tweets of German politicians (talk at DGPs 2018)
/2018/09/14/talk-populism-in-tweets-of-german-politicians-talk-at-dgps-2018/
Fri, 14 Sep 2018 00:00:00 +0000/2018/09/14/talk-populism-in-tweets-of-german-politicians-talk-at-dgps-2018/The slides of my talk Populism in tweets of German politicians
can be found here http://data-se.netlify.com/slides/populist-twitter/populist-twitter-dgps2018.html#1.
Data, code, and more can be found at Github: https://github.com/sebastiansauer/polits_tweet_miningDataExploR: Typische Businessfragen mit R analysieren
/2018/09/12/dataexplor-typische-businessfragen-mit-r-analysieren/
Wed, 12 Sep 2018 00:00:00 +0000/2018/09/12/dataexplor-typische-businessfragen-mit-r-analysieren/In diesem Post untersuchen wir eine recht häufige Fragestellung im Bereich der Datenanalyse – die Auswertung von Umfragedaten. Umfragen sind eine gängige Angelegenheit in vielen Organisationen: man möchte wissen, ob die Kunden zufrieden sind oder was die Mitarbeiter vom Management denken. Wir werden nicht alle Aspekte der Analyse betrachten – da gibt es viel zu tun –, sondern ein paar zentrale Aspekte herausgreifen.
Laden wir zuerst ein paar nützliche Pakete:Wenn Excel aufgibt: Datenvisualisierung kann zu komplex für Excel werden
/2018/09/11/wenn-excel-aufgibt-datenvisualisierung-kann-zu-komplex-f%C3%BCr-excel-werden/
Tue, 11 Sep 2018 00:00:00 +0000/2018/09/11/wenn-excel-aufgibt-datenvisualisierung-kann-zu-komplex-f%C3%BCr-excel-werden/Ms Excel ist ein beliebtes Werkzeug der Datenanalyse, auch für Datenvisualisierung. Es gibt einige Beispiele, dass andere Werkzeuge, wie R, zu ansehnlicheren Diagrammen führen können, s. diesen Post. In diesem Post geht es um eine verwandte Frage: Gibt es Diagramme, die nicht – oder nur sehr aufwendig – mit Excel zu erstellen sind?
Die Meine Antwort lautet: Ja, die gibt es. Betrachten wir ein Beispiel.
Bayesianische Modelle visualisieren Als Hintergrund dient uns eine Analyse (s.Plotting a logistic regression - some considerations
/2018/09/03/plotting-a-logistic-regression-some-considerations/
Mon, 03 Sep 2018 00:00:00 +0000/2018/09/03/plotting-a-logistic-regression-some-considerations/library(mosaic) data(tips, package = "reshape2") Recode sex:
tips %>% mutate(sex_n = case_when( sex == "Female" ~ 0, sex == "Male" ~ 1 )) -> tips2 Fit model:
glm1 <- glm(sex_n ~ total_bill, data = tips2, family = "binomial") Way 1 plotModel(glm1) Way 2 Add predictions to data frame:
tips2 %>% mutate(pred = predict(glm1, newdata = tips, type = "response")) %>% mutate(predict_Male = pred > .5) -> tips3 Check values of predictions:Reproducible academic writing with RMarkdown - Talk at DGPs 2018
/2018/09/03/reproducible-academic-writing-with-rmarkdown-talk-at-dgps-2018/
Mon, 03 Sep 2018 00:00:00 +0000/2018/09/03/reproducible-academic-writing-with-rmarkdown-talk-at-dgps-2018/Talk at DGPs 2018.
Get slides here: http://data-se.netlify.com/slides/rmd-writing/rmd-writing_dgps2018.html.Talk - Predictors of AfD party success in the 2017 elections. A Bayesian modeling approach
/2018/09/02/predictors-of-afd-party-success-in-the-2017-elections-a-bayesian-modeling-approach/
Sun, 02 Sep 2018 00:00:00 +0000/2018/09/02/predictors-of-afd-party-success-in-the-2017-elections-a-bayesian-modeling-approach/Talk at DGPs 2018.
Get slides here http://data-se.netlify.com/slides/afd_dgps2018/afd_dgps2018.htmlBayesian modeling of populist party success in German federal elections - A notebook from the lab
/2018/08/25/bayesian-modeling-of-populist-party-success-in-german-federal-elections/
Sat, 25 Aug 2018 00:00:00 +0000/2018/08/25/bayesian-modeling-of-populist-party-success-in-german-federal-elections/Following up on an earlier post, we will model the voting success of the (most prominent) populist party, AfD, in the recent federal elections. This time, Bayesian modeling techniques will be used, drawing on the excellent textbook my McElreath.
Note that this post is rather a notebook of my thinking, doing, and erring. I’ve made no efforts to hide scaffolding. I think it will be confusing to the uniniate and the initiate as well …Binning and recoding with R - some recommendations
/2018/08/09/binning-and-recoding-with-r-some-recommendations/
Thu, 09 Aug 2018 00:00:00 +0000/2018/08/09/binning-and-recoding-with-r-some-recommendations/Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.
Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.
Let’s load some example data:
data(tips, package = "reshape2") Some packages:Finding NAs in multiples columns (per row)
/2018/08/09/finding-nas-in-multiples-columns-per-rows/
Thu, 09 Aug 2018 00:00:00 +0000/2018/08/09/finding-nas-in-multiples-columns-per-rows/Assume you would like to check for missing data, but not for one column only but for several columns.
First, data and some packages:
data(mtcars) library(tidyverse) Then, let’s introduce some missing data:
mtcars[c(1,2), 1] <- NA mtcars[c(1, 3:4), 2] <- NA Don’t check columns individually Of course, you do not want to repeat yourself, and check each column individually, like this:
sum(is.na(mtcars[[1]])) #> [1] 2 sum(is.na(mtcars[, 1])) # same #> [1] 2 Neither one would like to check each row individually:Power calculation for the general linear model
/2018/07/24/power-calculation-for-the-general-linear-model/
Tue, 24 Jul 2018 00:00:00 +0000/2018/07/24/power-calculation-for-the-general-linear-model/Before conducting an experiment, one should compute the power - or, preferably, estimate the precision of the expected results. There are numerous way to achieve this, here’s one using the R package pwr.
Package pwr library(pwr) The workhorse function here is pwr.f2.test. Note that f2 refers to the effect size \(f^2\) (see here), defined as:
\[f^2 = \frac{R^2}{1-R^2}\].
See for details of the function its help page:
help("pwr.f2.test") pwr.f2.test(u = NULL, v = NULL, f2 = NULL, sig.How to prepare data for a gantt diagram
/2018/07/05/how-to-prepare-data-for-a-gantt-diagram/
Thu, 05 Jul 2018 00:00:00 +0000/2018/07/05/how-to-prepare-data-for-a-gantt-diagram/There’s the new cool world of project management - agile, scrumbling, cool. There’s the old sluggish way of project management using stuff like gantt diagrams. Let’s stick to the old world and come up with a gantt diagram.
The gant diagram itself is no big deal. Just some horizontal lines referring to dates. Somewhat more interesting is to populate a raw data frame in a way that allows for convenient plotting.Work with bibtex bib files like a pro
/2018/07/05/work-with-bibtex-bib-files-like-a-pro/
Thu, 05 Jul 2018 00:00:00 +0000/2018/07/05/work-with-bibtex-bib-files-like-a-pro/Recently, I had to curate a list of publications for our institution. Where’s the point? One might ask. Let’s leave aside that a number of colleagues do not use citation management software to work with their publications. They just hack the citation, if and when needed, in some word files. Done. Fair enough, unless someone tries to come up with a list of all the publication of that institution. In that case, the curator will need some structured data, otherwise he or she will end up copy-pasting the rest of the day.How to cite "in press" using Bibtex
/2018/07/01/how-to-cite-in-press-using-bibtex/
Sun, 01 Jul 2018 00:00:00 +0000/2018/07/01/how-to-cite-in-press-using-bibtex/Bibtex entry type for conference talks suitable for APA
/2018/06/26/bibtex-entry-type-for-conference-talks-suitable-for-apa/
Tue, 26 Jun 2018 00:00:00 +0000/2018/06/26/bibtex-entry-type-for-conference-talks-suitable-for-apa/I’ ve wondered how to best cite a talk given at a conference that is not “really” published in the sense that there’s no ISBN or similar identifier
One can argue that it is not worth citing a non-identifiable source - I agree with that basically. However, for some reasons it maybe helpful to cite anyway. For example, one may have to document the talks being given.
For that purpose, I found this bibtex entry type helpful:Easy way to convert factors zu numbers
/2018/06/22/easy-way-to-convert-factors-zu-numbers/
Fri, 22 Jun 2018 00:00:00 +0000/2018/06/22/easy-way-to-convert-factors-zu-numbers/Converting factors to numbers in R can be frustrating. Consider the following sitation: We have some data, and try to convert a factor (sex in tips, see below) to a numeric variable:
library(tidyverse) library(sjmisc) # for recoding data(tips, package = "reshape2") glimpse(tips) #> Observations: 244 #> Variables: 7 #> $ total_bill <dbl> 16.99, 10.34, 21.01, 23.68, 24.59, 25.29, 8.77, 26.... #> $ tip <dbl> 1.01, 1.66, 3.50, 3.31, 3.61, 4.71, 2.Some musings on the logistic map
/2018/06/19/some-musings-on-the-logistic-map/
Tue, 19 Jun 2018 00:00:00 +0000/2018/06/19/some-musings-on-the-logistic-map/The logistic map is a well-known and simple growth model that is defined by the iterative equation
\[x_{t+1} = 4rx_t(1-t_t)\],
where \(r\) is a parameter that can be thought of as a fertility and reproduction rate of the population. The allowed values of \(x\) range between 0 an 1 inclusively, where 0 means the population is extinct. The maximum of 1 can be interpreted as the ecological carrying capacity of the system.Visualizing mean values between two groups - the tidyverse way
/2018/06/10/visualizing-summary-statistics-the-tidyverse-way/
Sun, 10 Jun 2018 00:00:00 +0000/2018/06/10/visualizing-summary-statistics-the-tidyverse-way/A frequent job in data visualizing is to present summary statistics. In this post, I show one way to plot mean values between groups using the tidyverse approach in comparison to the mosaic way.
library(tidyverse) data(mtcars) library(mosaic) library(knitr) library(sjmisc) library(sjPlot) Visualizing mean values between two groups First, let’s compute the mean hp for automatic cars (am == 0) vs. manual cars (am == 1).
mtcars %>% group_by(am) %>% summarise(hp_am = mean(hp)) -> hp_am Now just hand over this data frame of summarized data to ggplot:Playing around with geo mapping: combining demographic data with spatial data
/2018/05/28/playing-around-with-geo-mapping-combining-demographic-data-with-spatial-data/
Mon, 28 May 2018 00:00:00 +0000/2018/05/28/playing-around-with-geo-mapping-combining-demographic-data-with-spatial-data/In this post, we will play around with some basic geo mapping. More preciseyl, we will explore some easy ways to plot a choropleth map.
First, let’s load some geo data from Bundeswahlleiter, and combine it with some socio demographic data from the same source.
Preparation Let’s load some packages:
library(tidyverse) ## Warning: package 'dplyr' was built under R version 3.5.1 library(sf) library(viridis) suppressPackageStartupMessages(library(googleVis)) Geo data:
my_path_wahlkreise <- "~/Documents/datasets/geo_maps/btw17_geometrie_wahlkreise_shp/Geometrie_Wahlkreise_19DBT.shp" file.exists(my_path_wahlkreise) ## [1] TRUE socio demographic data:Playing around with dumbbell plots
/2018/05/23/playing-around-with-dumbbell-plots/
Wed, 23 May 2018 00:00:00 +0000/2018/05/23/playing-around-with-dumbbell-plots/Dumbbell plots can be used to show differences between two groups. Bob Rudis demonstrated a beautiful application of such plots using ggplot2 board methods.
In this plot, I will explain or comment his code, and adapt a few changes.
First, load some packages.
pacman::p_load(tidyverse, ggalt) Let’s make up some data. Tip: Make up some data conveniently in Excel, copy it to the clipboard, and then paste it as tribble (see below) into R.Playing around with dataviz: Comparing distributions between groups
/2018/05/18/playing-around-dataviz-comparing-distributions-between-groups/
Fri, 18 May 2018 00:00:00 +0000/2018/05/18/playing-around-dataviz-comparing-distributions-between-groups/What’ a nice way to display distributional differences between a (larger) number of groups? Boxplots is one way to go. In addition, the raw data may be shown as dots, but should be demphasized. Third, a trend or big picture comparing the groups will make sense in some cases.
Ok, based on this reasoning, let’s do som visualizing. Let’s load some data (movies), and the usual culprits of packages.
library(tidyverse) ## Warning: package 'dplyr' was built under R version 3.Playing around with dataviz: Showing correlations
/2018/05/18/playing-around-with-dataviz-showing-correlations/
Fri, 18 May 2018 00:00:00 +0000/2018/05/18/playing-around-with-dataviz-showing-correlations/In this plot, we are looking into some ways of displaying association between (two) quantitative variables, aka correlation. Our goal is to present a rich representation of the correlation.
Let’s take the dataset flights as an example.
data(flights, package = "nycflights13") library(tidyverse) ## Warning: package 'dplyr' was built under R version 3.5.1 library(viridis) flights %>% filter(arr_delay < 100, dep_delay < 100) %>% ggplot(aes(x = dep_delay, y = arr_delay, color = origin)) + geom_point(alpha = .Showcase of Viridis, maps, and ggcounty
/2018/05/18/showcase-of-viridis-maps-and-ggounty/
Fri, 18 May 2018 00:00:00 +0000/2018/05/18/showcase-of-viridis-maps-and-ggounty/This posts shows how easy it can be to build an visually pleasing plot. We will use hrbrmster’s ggcounty, which is an R package at this Github repo. Graphics engine is as mostly in my plots, Hadley Wickhams ggplot. All build on R. Standing on shoulders…
Disclaimer: This example heavily draws on hrbrmster example on this page. All credit is due to Rudy, and those on whose work he built up on.Why is the sample mean a good point estimator of the population mean? A simulation and some thoughts.
/2018/05/18/why-is-the-sample-mean-a-good-point-estimator-of-the-population-mean-a-simulation-and-some-thoughts/
Fri, 18 May 2018 00:00:00 +0000/2018/05/18/why-is-the-sample-mean-a-good-point-estimator-of-the-population-mean-a-simulation-and-some-thoughts/It is frequently stated that the sample mean is a good or even the best point estimator of the according population value. But why is that? In this post we are trying to get an intuition by using simulation inference methods.
Assume you played throwing coins with some one at some dark corner. “Some one” throws the coin 10 times, and wins 8 times (the guy was betting on heads, but that’s only for the sake of the story).Convenient way to cite blog posts using Bibtex
/2018/04/11/convenient-way-to-cite-blog-posts-using-bibtex/
Wed, 11 Apr 2018 00:00:00 +0000/2018/04/11/convenient-way-to-cite-blog-posts-using-bibtex/Writing (scholarly) texts - a great way is using Markdown. Bibtext interacts nicely with Markdown, so one can easily cite literature.
One question that came up for me a couple of times recently was how to cite blogs in Bibtex?
I found this solution to be the most convenient:
@misc{stats_test, Author = {Sebastian Sauer}, Date-Added = {2018-03-29 13:54:38 +0000}, Date-Modified = {2018-03-29 13:55:51 +0000}, Doi = {10.17605/OSF.IO/SJHUY}, Howpublished = {Data Set}, Month = {01}, Title = {Results from an exam in inferential statistics}, Year = {2017}} The important points are the @misc class, and the Howpublished field.One-way ANOVA power analysis
/2018/04/11/one-way-anova-power-analysis/
Wed, 11 Apr 2018 00:00:00 +0000/2018/04/11/one-way-anova-power-analysis/Computing or estimating power is a very useful procedure in order to weigh the reliability of study results.
One frequent procedure in inferential statistics is the ANOVA, with the simplest form being the one-way ANOVA. This post shows how to compute power for this test.
What’s the effect size? The first thing to not is that there is no such thing as “power” - in the sense that a sample or a test would have “its power”.Parse libraries from R project
/2018/04/11/parse-libraries-from-r-project/
Wed, 11 Apr 2018 00:00:00 +0000/2018/04/11/parse-libraries-from-r-project/Having written a larger R project is may be of interest which packages have been used. As I did not find a read-to-use package, a colleague of mine - Norman Markgraf - came up with a nice solution. In this post, I build on his solution to provide a function that suits my needs of today:
@Norman: Thanks for your great idea!
First, some libraries:
library(tidyverse) library(bibtex) library(testthat) Then, here is some path of an R project where we want to parse all rmd files:Visualisation of interaction for the logistic regression
/2018/04/02/visualisation-of-interaction-for-logistic-regression/
Mon, 02 Apr 2018 00:00:00 +0000/2018/04/02/visualisation-of-interaction-for-logistic-regression/In this post we are plotting an interaction for a logistic regression. Interaction per se is a concept difficult to grasp; for a GLM it may be even more difficult especially for continuous variables’ interaction. Plotting helps to better or more easy grasp what a model tries to tell us.
First, load some packages.
library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.Why "n-1" in empirical variance? A simulation.
/2018/03/24/why-n-1-in-empirical-variance-a-simulation/
Sat, 24 Mar 2018 00:00:00 +0000/2018/03/24/why-n-1-in-empirical-variance-a-simulation/It is well-known that the empirical variance underestimates the population variance. Specifically, the empirical variance is defined as: \(var_{emp} = \frac{\sum_i (x_i - \bar{x})^2}{n-1}\). But why \(n-1\), why not just \(n\), as intuition (of some) dictates? Put shortly, as the variance of a sample tends to underestimate the population variance we have to inflate it artificially, to enlarge it, that’s why we do put a smaller number (the “n-1”) in the denominator, resulting in a larger value of the whole fraction.Beispiel zu Simpsons Paradox
/2018/03/16/beispiel-zu-simpsons-paradox/
Fri, 16 Mar 2018 00:00:00 +0000/2018/03/16/beispiel-zu-simpsons-paradox/In diesem Post diskutieren wir ein Beispiel zu Simpson’s Paradox. Der Fokus liegt nicht auf der R-Syntax, sondern auf einer intuitiven Erläuterung des Simpson Paradox. (Die Syntax findet sich in ähnlicher Form in diesem Post.)
Sagen wir, Sie müssen sich zwischen zwei Ärzten (Dr. Arriba und Dr. Bajo) entscheiden und fragen sich, welcher “besser” ist. Unter “besser” verstehen Sie “höhere Heilungsquote”.
Die beiden Ärzte behandeln die gleichen zwei Krankheiten: Severitis und Nervosia maskulina.Tangible data of normal distributed data
/2018/03/16/tangible-data-of-normal-distributed-data/
Fri, 16 Mar 2018 00:00:00 +0000/2018/03/16/tangible-data-of-normal-distributed-data/A classical example for a normally distributed variable is height. However, I kept on looking for data as to the mean and sd for some populations, such as Germany. Now I found some reliably looking data here.
We will not question whether the assumption of normality holds, we just assume it.
In the source, we can read that in Germany, the adult men population has the following parameters:
mean: 174cmMap students to presentation slots
/2018/03/11/map-students-to-presentation-slots/
Sun, 11 Mar 2018 00:00:00 +0000/2018/03/11/map-students-to-presentation-slots/As a teacher, I not only teach but also assess the achievements of students. One example of a typical student assignments is a presentation. You know, powerpoint slides and stuff.
For that purpose, I often need to map students to one of several time slots. Here’s the R code I use for that purpose.
library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.Intuition to Simpson's paradox
/2018/03/09/intuition-to-simpson-s-paradox/
Fri, 09 Mar 2018 00:00:00 +0000/2018/03/09/intuition-to-simpson-s-paradox/Say, you have to choose between two doctors (Anna and Berta). To decide which one is better, you check their success rates. Suppose that they deal with two conditions (Coolities and Dummities). So let’s compare their success rate for each of the two conditions (and the total success rate):
This is the proportion of healing (success) of the first doctor, Dr. Anna for each of the two conditions:
Coolities: 7 out of 8 patients are healed from Coolities Dummieties: 1 out of 2 patients are healed from Dummities This is the proportion of healing (success) of the first doctor, Dr.How to create columns in a dataframe in R
/2018/03/07/how-to-create-columns-in-a-dataframe-in-r/
Wed, 07 Mar 2018 00:00:00 +0000/2018/03/07/how-to-create-columns-in-a-dataframe-in-r/Note that we will use this library for this post:
library(dplyr) ## Warning: package 'dplyr' was built under R version 3.5.1 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union By the way, loading mosaic, will load dplyr too.
One of the major data wrangling activities (in R and elsewhere) is to create a new column in a data frame.Papers publizieren. Versuch einer Anleitung
/2018/01/25/papers-publizieren-versuch-einer-anleitung/
Thu, 25 Jan 2018 00:00:00 +0000/2018/01/25/papers-publizieren-versuch-einer-anleitung/Unter https://sebastiansauer.github.io/Talks-ses/pubws.html#/ finden sich die HTML-Folien zu einem Talk von mir zum Thema, wie man Papers publiziert (oder es zumindest versucht).
Der Quelltext findet sich in diesem Github-Repo.
Der Talk steht unter der CC-BY-Lizenz.Simulate p-hacking - adding observations
/2018/01/24/simulate-p-hacking-adding-observations/
Wed, 24 Jan 2018 00:00:00 +0000/2018/01/24/simulate-p-hacking-adding-observations/Let’s simulate p-values as a funtion of sample size. We assume that some researcher collects one data point, computes the p-value, and repeats until p-value falls below some arbitrary threshold. Oh and yes, there is no real effect. For the sake of spending the budget, assume that our researcher collects a sample size of \(n=100\).
This idea stems from this great article False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant; cf.Visualizing a logistic regression the easy way
/2018/01/23/visualizing-a-logistic-regression-the-easy-way/
Tue, 23 Jan 2018 00:00:00 +0000/2018/01/23/visualizing-a-logistic-regression-the-easy-way/Let’s visualize a GLM (logistic regression).
First laod some data:
data(tips, package = "reshape2") Compute a glm:
glm_tips <- glm(sex ~ tip, data = tips, family = "binomial") Plot the model using mosaic:
library(mosaic) ## Warning: package 'dplyr' was built under R version 3.5.1 plotModel(glm_tips) The curve does not look really s-typed (ogive) but that’s ok because the data suggest not a strong trend. The plot is not very beautiful either, but hey - it’s quick to produce 😁.Zusammenhang von Lernen und Noten im Statistikunterricht
/2017/12/20/zusammenhang-von-lernen-und-noten-im-statistikunterricht/
Wed, 20 Dec 2017 00:00:00 +0000/2017/12/20/zusammenhang-von-lernen-und-noten-im-statistikunterricht/Führt Lernen zu besseren Noten? Eigene Erfahrung und allgemeiner Konsens stimmen dem zu; zumindest schadet Lernen des Stoffes nicht und hilft oft, gute Noten bei einer Prüfung zu diesem Stoff zu erzielen. Aber welche Belege, wissenschaftliche Belege gibt es dazu? An unserer Hochschule, die FOM, haben wir eine kleine Untersuchung zu dieser Frage durchgeführt. Genauer gesagt haben wir unseren Studierenden einen Statistik-Test vorlegt und gefagt, wie sehr sie sich für diesen Test vorbereitet hätten.A p-value picture
/2017/11/29/a-p-value-picture/
Wed, 29 Nov 2017 00:00:00 +0000/2017/11/29/a-p-value-picture/Much ado and to say about the p-value. Let me add one more point; actually not really from myself, but from Diez, Barr, and Cetinkaya-Rundel (2012), p. 189; good book in one is looking for “orthodox” statistics.
library(tidyverse) ## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.6 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.Grundlagen des Textminings mit R
/2017/11/28/textmining-grundlagen/
Tue, 28 Nov 2017 00:00:00 +0000/2017/11/28/textmining-grundlagen/Lernziele:
- Sie kennen zentrale Ziele und Begriffe des Textminings. - Sie wissen, was ein 'tidy text dataframe' ist. - Sie können Worthäufigkeiten auszählen. - Sie können Worthäufigkeiten anhand einer Wordcloud visualisieren. In dieser Übung benötigte R-Pakete:
library(tidyverse) # Datenjudo library(stringr) # Textverarbeitung library(tidytext) # Textmining library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen Bitte installieren Sie rechtzeitig alle Pakete, z.B. in RStudio über den Reiter Packages > Install.Grundlagen des Textminings mit R - Teil 2
/2017/11/28/grundlagen-des-textminings-mit-r-teil-2/
Tue, 28 Nov 2017 00:00:00 +0000/2017/11/28/grundlagen-des-textminings-mit-r-teil-2/In dieser Übung benötigte R-Pakete:
library(tidyverse) # Datenjudo library(stringr) # Textverarbeitung library(tidytext) # Textmining library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen library(skimr) # Überblicksstatistiken Bitte installieren Sie rechtzeitig alle Pakete, z.B. in RStudio über den Reiter Packages … Install.
## ## Attaching package: 'knitr' ## The following object is masked from 'package:skimr': ## ## kable Aus dem letzten Post Daten einlesen:
osf_link <- paste0("https://osf.io/b35r7/?action=download") afd <- read_csv(osf_link) ## Parsed with column specification: ## cols( ## page = col_double(), ## content = col_character() ## ) Aus breit mach lang:Image path for blogdown
/2017/11/28/image-path-for-blogdown/
Tue, 28 Nov 2017 00:00:00 +0000/2017/11/28/image-path-for-blogdown/How to include external images to a hugo post?
Suppose we have a file img1.png in project1, ie., project1/img1.png. Do this:
Copy your folder with images to static/. Use this path in your blogdown post: /project/img1.png. Mind the leading slash! Example time This code (on my machine) ![](/images/textmining/tidytext-crop.png){ width="20%" }
renders this:
Note the nice width option.
Knitr way The knitr way works similarly:
knitr::include_graphics("/images/textmining/tidytext-crop.png") Dummy variables and regression
/2017/11/27/dummy-variables-and-regression/
Mon, 27 Nov 2017 00:00:00 +0000/2017/11/27/dummy-variables-and-regression/For modeling cause-effect relationships, linear regression is among the most typically used methods.
Take, for example, the idea that the Gross Domestic Product (GDP) drives religiosity. Of course, we should have a strong theory that defends this choice and this directionality. Without a convincing theory it may be argued that the cause-relationship is the other way round or complete different (ie., some third variable accounts for any association between GDP and religiosity).Interactive diagrams in lieu of shiny?
/2017/11/27/interactive-diagrams-in-lieu-of-shiny/
Mon, 27 Nov 2017 00:00:00 +0000/2017/11/27/interactive-diagrams-in-lieu-of-shiny/One frequent use of the Shiny server software is displaying interactive data diagrams. The pro of using Shiny is the great flexibility; much more than “just graphics” can be done. Basically Shiny provides a flexible GUI for your R program. But if you simply aiming at displaying or exploring some data interactively, a much simplor approach may do it for you; there are some nice libraries available in R for that.My favorite stats text book
/2017/11/27/my-favorite-stats-text-book/
Mon, 27 Nov 2017 00:00:00 +0000/2017/11/27/my-favorite-stats-text-book/Some thoughts how my favorite applied stats text book would look like. I am looking at eg., business fields such as MBA as consumers.
My ideal applied stats text book is case study oriented (“Assume you would like to predict which movie will score highest next year based on some movie characteristics you know”)
makes use of recent data analytics techniques such as tree based methods (Random Forests) or Shrinkage models (Lasso)Compute effect sizes with R. A primer.
/2017/11/21/compute-effect-sizes-with-r-a-primer/
Tue, 21 Nov 2017 00:00:00 +0000/2017/11/21/compute-effect-sizes-with-r-a-primer/A typical “cook book recipe” for doing data analysis is an applied stats course is:
report descriptive statistics plot some nice diagrams test hypothesis report effect sizes Let’s have a quick glance at these steps. We will use the dataset flights of the package nycflights13.
data(flights, package = "nycflights13") This post will be tidyverse-driven.
library(tidyverse) library(skimr) library(mosaic) Let’s compute some summaries:
flights %>% select(arr_delay) %>% skim #> Skim summary statistics #> n obs: 336776 #> n variables: 1 #> #> Variable type: numeric #> variable missing complete n mean sd p0 p25 p50 p75 p100 #> arr_delay 9430 327346 336776 6.Hello World, this is Blogdown
/2017/11/21/hello-world-this-is-blogdown/
Tue, 21 Nov 2017 00:00:00 +0000/2017/11/21/hello-world-this-is-blogdown/My blog at https://sebastiansauer.github.io/posts/ has moved. It is now here! This is the new home of my blog. In (the unlikely) case you are asking yourself “Why did you move your blog?”, here is the answer.
I was using Jekyll at Github pages which is great as long as you do not have a lot of R in your posts. But I did have a lot of R in my posts.Great dataviz examples in rstats
/2017/11/20/great-dataviz-examples-in-rstats/
Mon, 20 Nov 2017 00:00:00 +0000/2017/11/20/great-dataviz-examples-in-rstats/Here come some stunning examples of data visualizations, all built with R. R code of each diagram is available at the source. Enjoy! #beautiful.
UPDATE: I’ve included links to the R source!
Plotting geo maps along with subplots in ggplot2 I like this one by Ilya Kashnitsky:
Similarly, by the same author:
Source
Great work, @ikashnitsky!
Cirlize (Chord) diagrams Plotting association in a circular form yields aesthetic examples of diagrams, see the following examples Wie gut schätzt eine Stichprobe die Grundgesamtheit?
/2017/11/17/inference/
Fri, 17 Nov 2017 00:00:00 +0000/2017/11/17/inference/Daten Sie arbeiten bei der Flughafenaufsicht von NYC. Cooler Job.
library(nycflights13) data(flights) Pakete laden library(mosaic) Stichprobe ziehen Die Aufsichtsbehörde zieht eine Probe von 100 Flügen und ermittelt die “typische” Verspätung.
set.seed(42) sample(flights$arr_delay, size = 100) -> flights_sample Und berechnen wir die typischen Kennwerte:
favstats(~flights_sample, na.rm = TRUE) #> min Q1 median Q3 max mean sd n missing #> -51 -18.75 -5 11.75 150 0.4387755 31.1604 98 2 Ob $n=3$ ausreichen würde?Some thoughts on tidyveal and environments in R
/2017/11/16/tidyeval_basense/
Thu, 16 Nov 2017 00:00:00 +0000/2017/11/16/tidyeval_basense/The tidyeval framework is a rather new, and in parts complementary, framework to dealing with non-standarde evaluation (NSE) in R. In short, NSE is about capturing some R-code, witholding execution, maybe editing the code, and finally execuing it later and/or somewhere else.
This post borrows heavily by Edwin Thon’s great post, and this post by the same author.
In addtion, most of the knowledge is derived from Hadley Wickham’s book Advanced R.Yart - Yet Another Markdown Report Template
/2017/11/15/yart/
Wed, 15 Nov 2017 00:00:00 +0000/2017/11/15/yart/It would be useful to have a RMarkdown template for typical (academic) reports such as class assigments and bachelor/master thesises. The LaTeX class “report” provides a suitable format for that. This package provides a simple wrapper around this class built on the standard pandoc template.
Thanks to Yart, ie, this package leans on earlier work by Aaron Wolen in his pandoc-letter repository, and extends it for use from R via the rmarkdown package.Package 'pradadata' on Github - feature social science data
/2017/11/07/pradadata/
Tue, 07 Nov 2017 00:00:00 +0000/2017/11/07/pradadata/Recently, I’ve put a package on Github featureing some social science data set. Some data came from official sites; my contribution was to clear ‘em up, and render comfortably accessable for automatic inquiry (nice header lines, no special enconding, flat csvs….). In other cases it’s unpublished data collected by friends, students of mine or myself.
Let’s check its contents using a function by Maiasaura from this SO post.
library(pradadata) lsp <- function (package, all.Populism in tweets of German politicians
/2017/11/01/afd01/
Wed, 01 Nov 2017 00:00:00 +0000/2017/11/01/afd01/The last months (years? since ever???) have seen a surge in populism and a rise in nationalism. Not only in Russia, the United States, Turkey, but also in some EU countries the ghost of nationalism-populism seems to be marching and gaining ground.
As to Germany, in September 24, 2017, the 19. German federal elections took place. The newly founded alt-right AfD (Alternative for Deutschland) has made a leap and moved in the Bundestag.Data, machine-friendly, of the 2017 German federal elections
/2017/10/30/de-elec-data/
Mon, 30 Oct 2017 00:00:00 +0000/2017/10/30/de-elec-data/On September 2017, the 19. German Bundestag has been elected. As of this writing, the parties are still busy sorting out whether they want to part of the government, with whom, and maybe whether they even want to form a government at all. This post is about providing the data in machine friendly form, and in English language.
All data presented in this post regarding this (and previous) elections are published by the Bundeswahlleiter.Mapping foreigner ratio to AfD election results in the German Wahlkreise
/2017/10/22/afd-map-foreigners/
Sun, 22 Oct 2017 00:00:00 +0000/2017/10/22/afd-map-foreigners/In a previous post, we have shed some light on the idea that populism - as manifested in AfD election results - is associated with socioeconomic deprivation, be it subjective or objective. We found some supporting pattern in the data, although that hypothesis is far from being complete; ie., most of the variance remained unexplained.
In this post, we test the hypothesis that AfD election results are negatively associated with the proportion of foreign nationals in a Wahlkreis.Simple way to separate train and test sample in R
/2017/10/17/train-test/
Tue, 17 Oct 2017 00:00:00 +0000/2017/10/17/train-test/For statistical modeling, it is typical to separate a train sample from a test sample. The training sample is used to build (“train”) the model, whereas the test sample is used to gauge the predictive quality of the model.
There are many ways to split off a test sample from the train sample. One quite simple, tidyverse-oriented way, is the following.
First, load the tidyverse. Next, load some data.
library(tidyverse) data(Affairs, package = "AER") Then, create an index vector of the length of your train sample, say 80% of the total sample size.Two R plot side by side in .Rmd-Files
/2017/10/12/two-plots-rmd/
Thu, 12 Oct 2017 00:00:00 +0000/2017/10/12/two-plots-rmd/I kept wondering who to plot two R plots side by side (ie., in one “row”) in a .Rmd chunk. Here’s a way, well actually a number of ways, some good, some … not.
library(tidyverse) library(gridExtra) library(grid) library(png) library(downloader) library(grDevices) data(mtcars) Plots from ggplot Say, you have two plots from ggplot2, and you would like them to put them next to each other, side by side (not underneath each other):Two r plots side by sind in a Rmd-File - UPDATE
/2017/10/12/two-r-plots-side-by-sind-in-a-rmd-file/
Thu, 12 Oct 2017 00:00:00 +0000/2017/10/12/two-r-plots-side-by-sind-in-a-rmd-file/UPDATE 2018-12-03
Thanks to a comment by Katharina Hees and Joyce, I know know how to plot two images side by side in an Rmd file.
I kept wondering who to plot two R plots side by side (ie., in one “row”) in a .Rmd chunk. Here’s a way, well actually a number of ways, some good, some … not.
library(tidyverse) library(gridExtra) library(grid) library(png) library(downloader) library(grDevices) data(mtcars) Plots from ggplot Say, you have two plots from ggplot2, and you would like them to put them next to each other, side by side (not underneath each other):Mapping unemployment ratio to AfD election results in German Wahlkreise
/2017/10/10/afd-map/
Tue, 10 Oct 2017 00:00:00 +0000/2017/10/10/afd-map/There is the idea that the alt-right German party AfD is followed by those who are deprived of chances, thoses of fearing to falling down the social ladder, and so on. Let’s test this hypothesis. No, I am not thinking on hypothesis testing, p-values, and stuff. Rather, let’s color a map of German election districts (Wahlkreise) according to whether the area is poor AND the AfD gained a lot of votes (and vice versa: the area is rich AND the AfD gained relatively few votes).Mapping unemployment rate to German district areas
/2017/10/09/unemp-map/
Mon, 09 Oct 2017 00:00:00 +0000/2017/10/09/unemp-map/A chloropleth map is a geographic map where statistical information are mapped to certain areas. Let’s plot such a chloropleth map in this post.
Packages library(sf) library(stringr) library(tidyverse) library(readxl) Geo data Best place to get German geo data is from the “Bundesamt für Kartografie und Geodäsie (BKG)”. One may basically use the data for a purposes unless it is against the law. I have downloaded the data 2017-10-09. More specifically, we are looking at the “Verwaltungsgebiete” (vg), that is, the administrative areas of the country, ie.Drawing a country map
/2017/10/06/chloromap/
Fri, 06 Oct 2017 00:00:00 +0000/2017/10/06/chloromap/Let’s draw a map of Bavaria, a state of Germany, in this post.
Packages library(tidyverse) library(maptools) library(sf) library(RColorBrewer) library(ggmap) library(viridis) library(stringr) Data Let’s get the data first. Basically, we need to data files:
the shape file, ie., a geographic details of state borders and points of interest the semantic information to points of interest eg., town names Shape file The shape file can be downloaded from this source: http://www.Kongresse 2018 - Wirtschaftspsychologie und verwandte Gebiete
/2017/09/27/kongresse_2018/
Wed, 27 Sep 2017 00:00:00 +0000/2017/09/27/kongresse_2018/Hier finden Sie eine Auswahl an wissenschaftlichen Kongressen in 2018 aus der Wirtschaftspsychologie und angrenzenden Feldern.
Nationale Kongresse (in DACH) 64. GfA-Frühjahrskongress: Arbeit(s).Wissen.Schaf(f)t – Grundlage für Management & Kompetenzentwicklung, 21.-23. Februar in Frankfurt am Main
Veranstalter: FOM in Frankfurt
Frist für Einreichung von Beiträgen: 15. September 2017
[Jubliäumskongress 20 Jahre Wirtschaftspsychologie]() der Gesellschaft für angewandte Wirtschaftspsychologie (GWPs), 8.-10. März 2018 in Wernigerode
Veranstalter: Gesellschaft für angewandte Wirtschaftspsychologie (GWPs) Frist für Einreichung: OFFENSome intriguing psychology papers (open access)
/2017/09/26/psy-paper-suggestions/
Tue, 26 Sep 2017 00:00:00 +0000/2017/09/26/psy-paper-suggestions/This post presents a compilation of links to psychology papers; I have chosen papers I find intriguing particularly for working in class. All papers are open access (or a from open access repositories) which renders classroom work easier. The papers are collected from a broad range of topics but mostly with focus on general interest. The perspective is an applied one; I have not tried to select based on methodological rigor.Crashkurs Datenanalyse mit R
/2017/09/12/r-crashkurs/
Tue, 12 Sep 2017 00:00:00 +0000/2017/09/12/r-crashkurs/Willkommen zum R-Crashkurs Nicht jeder liebt Datenanalyse und Statistik… in gleichem Maße! Das ist zumindest meine Erfahrung aus dem Unterricht 🔥. Crashkurse zu R sind vergleichbar zu Tanzkursen vor der Hochzeit: Hat schon vielen das Leben gerettet, aber ersetzt nicht ein Semester in der Pariser Tanzakademie (man beachte den Vergleich zum Unterricht an der Hochschule).
Dieser Crashkurs ist für Studierende oder Anfänger der Datenanalyse gedacht, die in kurzer Zeit einen verzweifelten Versuch … äh … einen grundständigen Überblick über die Datenanalyse erwerben wollen.Different ways to count NAs over multiple columns
/2017/09/08/sum-isna/
Fri, 08 Sep 2017 00:00:00 +0000/2017/09/08/sum-isna/There are a number of ways in R to count NAs (missing values). A common use case is to count the NAs over multiple columns, ie., a whole dataframe. That’s basically the question “how many NAs are there in each column of my dataframe”? This post demonstrates some ways to answer this question.
Way 1: using sapply A typical way (or classical way) in R to achieve some iteration is using apply and friends.Different ways to present summaries in ggplot2
/2017/09/08/ggplot-summaries/
Fri, 08 Sep 2017 00:00:00 +0000/2017/09/08/ggplot-summaries/A convenient and well applicable visualization for comparing groups with respect to a metric variable is the boxplot. However, often, comparing means is accompanied by t-tests, ANOVAs, and friends. Such tests test the mean, not the median, and hence the boxplot is presenting the tested statistic. It would be better to align test and diagram. How can that be achieved using ggplot2? This posts demonstrates some possibilities.
First, let’s plot a boxplot.Replacing dplyr::do by purrr:map. Some considerations
/2017/09/05/purrr-map-no-do/
Tue, 05 Sep 2017 00:00:00 +0000/2017/09/05/purrr-map-no-do/Hadley Wickham has announced to depreceate dplyr::do in favor of purrr:map. In a recent post, I have made use of do, so some commentators informed me about that. In this post, I will show use cases of map, specifically as a replacement of do. map is for lists; read more about lists here.
library(tidyverse) library(broom) We will use mtcars as a sample dataframe (boring, I know, but convenient).Comparing the pipe with base methods
/2017/08/31/some-pipes/
Thu, 31 Aug 2017 00:00:00 +0000/2017/08/31/some-pipes/Some say, the pipe (#tidyverse) makes analyses in R easier. I agree. This post demonstrates some examples.
Let’s take the mtcars dataset as an example.
data(mtcars) ?mtcars Say, we would like to compute the correlation between gasoline consumption (mpg) and horsepower (hp).
Base approach 1 cor(mtcars[, c("mpg", "hp")]) ## mpg hp ## mpg 1.0000000 -0.7761684 ## hp -0.7761684 1.0000000 We use the [-operator (function) to select the columns; note that df[, c(col1, col2)] sees dataframes as matrices, and spits out a dataframe, not a vector:Shading normal curve made easy
/2017/08/29/simple-shading/
Tue, 29 Aug 2017 00:00:00 +0000/2017/08/29/simple-shading/Shading values/areas under the normal curve is a quite frequent taks in eg educational contexts. Thanks to Hadley in this post, I found this easy solution.
library(ggplot2) ``````
ggplot(NULL, aes(c(-3,3))) + geom_area(stat = "function", fun = dnorm, fill = "#00998a", xlim = c(-3, 0)) + geom_area(stat = "function", fun = dnorm, fill = "grey80", xlim = c(0, 3)) Simple, right?
Some minor beautification:
ggplot(NULL, aes(c(-3,3))) + geom_area(stat = "function", fun = dnorm, fill = "#00998a", xlim = c(-3, 1)) + geom_area(stat = "function", fun = dnorm, fill = "grey80", xlim = c(1, 3)) + labs(x = "z", y = "") + scale_y_continuous(breaks = NULL) + scale_x_continuous(breaks = 1) And some other quantiles:Programming with dplyr: Part 03, working with strings
/2017/08/09/dplyr_strings/
Wed, 09 Aug 2017 00:00:00 +0000/2017/08/09/dplyr_strings/More on programming with dplyr: converting quosures to strings In this post, we have programmed a simple function using dplyr’s programming capabilities based on tidyeval; for more intro to programming with dplyr, see here.
In this post, we’ll go one step further and programm a function where a quosure will be turned to a string. Why this? Because quite a number of functions out there except strings as input parameters.Precipitation - It never rains in Southern Nuremberg (?). Working with dates/times.
/2017/08/01/weather/
Tue, 01 Aug 2017 00:00:00 +0000/2017/08/01/weather/In this post, we will explore some date and time parsing. As an example, we will work with weather data provided by City of Nuremberg, Environmental and Meteorological Data.
We will need these packages:
library(tidyverse) # data reading and wrangling library(lubridate) # working with dates/times First, let’s import some precipitation data:
file_name <- "~/Downloads/export-sun-nuremberg--flugfeld--airport--precipitation-data--1-hour--individuell.csv" rain <- read_csv2(file_name, skip = 13, col_names = FALSE) ## Warning in rbind(names(probs), probs_f): number of columns of result is not ## a multiple of vector length (arg 1) ## Warning: 300 parsing failures.Programming with dplyr: Part 02, writing a function
/2017/07/06/prop_fav/
Thu, 06 Jul 2017 00:00:00 +0000/2017/07/06/prop_fav/Recently, since dplyr <= 0.6.0 a new way of dealing with NSE was introduced, called tidyeval. As with every topic that begs our attention, the question “why bother” is in place. Theone answer is “you’ll need this stuff if you want to lock dplyr verbs inside a function”. Once you like dplyr and friends, a natural second step is to use the ideas not only for interactive use, but for more “programming” type, ie.Effect sizes for the Mann-Whitney U Test: an intuition
/2017/07/04/effsize_utest/
Tue, 04 Jul 2017 00:00:00 +0000/2017/07/04/effsize_utest/The Mann-Whitney U-Test is a test with a wide applicability, wider than the t-Test. Why that? Because the U-Test is applicable for ordinal data, and it can be argued that confining the metric level of a psychological variable to ordinal niveau is a reasonable bet. Second, it is robust, more robust than the t-test, because it only considers ranks, not raw values. In addition, some say that the efficiency of the U-Test is very close to the t-Test (.A second look to grouping with dplyr
/2017/06/28/second_look_group_by/
Wed, 28 Jun 2017 00:00:00 +0000/2017/06/28/second_look_group_by/The one basic idea of dplyr is that each function should focus on one job. That’s why there are no functions such as compute_sumamries_by_group_with_robust_variants(df). Rather, summarising and grouping are seen as different jobs which should be accomplished by different functions. And, in turn, that’s why group_by, the grouping function of dplyr, is of considerable importance: this function should do the grouping for each operation whatsoever.
Let’s load all tidyverse libraries in one go:Programming with dplyr: Part 01, introduction
/2017/06/28/prog_dplyr_01/
Wed, 28 Jun 2017 00:00:00 +0000/2017/06/28/prog_dplyr_01/Like for [others], Hadley Wickham’s dplyr, and more generally, the tidyverse approach has considerably changed the I do data analyses. Most notably, the pipe (coming from magrittr by Stefan Milton Bache, see here) has creeped into nearly every analyses I, do.
That is, is every analyses except for functions, and other non-interactive stuff. In those programming contexts, the dplyr way does not work, due to its non standard evaluation or NSE for short.Preparation of extraversion survey data
/2017/06/24/extra_prep/
Sat, 24 Jun 2017 00:00:00 +0000/2017/06/24/extra_prep/For teaching purposes and out of curiosity towards some psychometric questions, I have run a survey on extraversion here. The dataset has been published at OSF (DOI 10.17605/OSF.IO/4KGZH). The survey is base on a google form, which in turn saves the data in Google spreadsheet. Before the data can be analyzed, some preparation and makeup is in place. This posts shows some general makeup, typical for survey data.
Download the data and load packages Download the data from source (Google spreadsheets); the package gsheet provides an easy interface for that purpose.Print csv-file tables as plots
/2017/06/22/tab2plot/
Thu, 22 Jun 2017 00:00:00 +0000/2017/06/22/tab2plot/tl;dr Use this convenience function to print a dataframe as a png-plot: tab2grob().
Source the function here: https://sebastiansauer.github.io/Rcode/tab2grob.R
Easiest way in R:
source("https://sebastiansauer.github.io/Rcode/tab2grob.R") Printing csv-dataframes as ggplot plots Recently, I wanted to print dataframes not as normal tables, but as a png-plot. See:
Why? Well, basically as a convenience function for colleagues who are not into using Markdown & friends. As I am preparing some stats stuff (see my new open access course material here) using RMarkdown, I wanted to prepare the materials ready for using in Powerpoint.Review of "The 7 Deadly Sins of Psychology" by Chris Chambers
/2017/06/22/seven-sins/
Thu, 22 Jun 2017 00:00:00 +0000/2017/06/22/seven-sins/tl;dr: great book. Read.
The “Seven Sins” is concerned about the validity of psychological research. Can we at all, or to what degree, be certain about the conclusions reached in psychological research? More recently, replications efforts have cast doubt on our confidence in psychological research (1). In a similar vein, a recent papers states that in many research areas, researchers mostly report “successes” in the sense of that they report that their studies confirm their hypotheses - with Psychology leading in the proportion of supported hypotheses (2).Identifying the package of a function
/2017/06/12/finds_funs/
Mon, 12 Jun 2017 00:00:00 +0000/2017/06/12/finds_funs/tl;dr Suppose you want to know which package(s) a given R function belongs to, say filter. Here come find_funsto help you:
find_funs("filter") ## # A tibble: 4 x 3 ## package_name builtin_pckage loaded ## <chr> <lgl> <lgl> ## 1 base TRUE TRUE ## 2 dplyr FALSE TRUE ## 3 plotly FALSE FALSE ## 4 stats TRUE TRUE This function will search all installed packages for this function name.Sorting the x-axis in bargraphs using ggplot2
/2017/06/05/ordering-bars/
Mon, 05 Jun 2017 00:00:00 +0000/2017/06/05/ordering-bars/Some time ago, I posted about how to plot frequencies using ggplot2. One point that remained untouched was how to sort the order of the bars. Let’s look at that issue here.
First, let’s load some data.
data(tips, package = "reshape2") And the usual culprits.
library(tidyverse) library(scales) # for percentage scales First, let’s plot a standard plot, with bars *un*sorted.
tips %>% count(day) %>% mutate(perc = n / nrow(tips)) -> tips2 ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity") Hang on, what could ‘unsorted’ possibly mean?mean and sd of z-values
/2017/05/26/z-values/
Fri, 26 May 2017 00:00:00 +0000/2017/05/26/z-values/Edit: This post was updated, including two errors fixed - thanks to (private) comments from Norman Markgraf.
z-values, aka values coming from an z-transformation are a frequent creature in statistics land. Among their properties are the following:
mean is zero variance is one (and hence sd is one) But why is that? How come that this two properties are true? The goal of this post is to shed light on these two properties of z-values.Simple way of plotting normal/logistic/etc. curve
/2017/05/24/plotting_s-curve/
Wed, 24 May 2017 00:00:00 +0000/2017/05/24/plotting_s-curve/Plotting a function is often helpful to better understand what’s going on. Plotting curves in R base is simple by virtue of function curve. But how to draw curves using ggplot2?
That’s a little bit more complicated by can still be accomplished by 1-2 lines.
library(ggplot2) Normal curve p <- ggplot(data = data.frame(x = c(-3, 3)), aes(x)) p + stat_function(fun = dnorm, n = 101) stat_function is some kind of parallel function to curve.Squares maximize area - a visualization
/2017/05/19/maximize_area/
Fri, 19 May 2017 00:00:00 +0000/2017/05/19/maximize_area/An old story is that one of the farmer with a fence of some given length, say 20m. Now this farmer wants to put up his fence so that he claims the largest piece of land possible. What width (w) and height (h) should we pick?
Instead of a formal proof, let’s start with a visualization.
First, we need some packages.
library(tidyverse) library(gganimate) library(RColorBrewer) library(scales) library(knitr) Now, let’s make up serveral ways to split up a rectengular piece of land.A predictor's unique contribution - (visual) demonstration
/2017/05/17/storks/
Wed, 17 May 2017 00:00:00 +0000/2017/05/17/storks/A well-known property of regression models is that they capture the unique contribution of a predictor. By “unique” we mean the effect of the predictor (on the criterion) if the other predictor(s) is/are held constant. A typical classroom example goes along the following lines.
All about storks There’s a correlation between babies and storks. Counties with lots of storks enjoy large number of babies and v.v.
However, I have children, I know the storks are not overly involved in that business, so says the teacher (polite laughters in the audience).Crashkurs Datenanalyse mit R
/2017/05/16/crashkurs/
Tue, 16 May 2017 00:00:00 +0000/2017/05/16/crashkurs/Nicht jeder liebt Datenanalyse und Statistik… in gleichem Maße. Das ist zumindest meine Erfahrung aus dem Unterricht 🔥. Crashkurse zu R sind vergleichbar zu Crahskursen zu Französisch - kann man machen, aber es sollte die Maxime gelten “If everything else fails”.
Dieser Crashkurs ist für Studierende oder Anfänger der Datenanalyse gedacht, die in kurzer Zeit einen verzweifelten Versuch … äh … einen grundständigen Überblick über die Datenanalyse erwerben wollen.Introductory books for data analysis
/2017/05/15/books/
Mon, 15 May 2017 00:00:00 +0000/2017/05/15/books/One way to dig into some topic such as data analysis is just-doing, trial and error. Another way is reading blogs; a fruitful avenue in my experience. However, the classical way of reading some good book is all but outdated.
Here are some recommendations of books I found helpful as a starter (books in English and German).
R for Data Science Grolemund, G., & Wickham, H. (2016). R for Data Science.Plotting true random numbers
/2017/05/12/true_random/
Fri, 12 May 2017 00:00:00 +0000/2017/05/12/true_random/knitr::opts_chunk$set(fig.align = "center", out.width = "70%", fig.asp = .61) Every now and then, random numbers come in handy to demonstrate some statistical behavior. Of course, well-known appraoches are rnorm and friends. These functions are what is called pseudo random number generators, because they are not random at all, strictly speaking, but determined by some algorithm. An algorithm is a sort of creature that is 100% predictable once you know the input (and the details of the algorithm).Deriving the logits for logistic regression
/2017/05/06/deriving-the-logits-for-logistic-regression/
Sat, 06 May 2017 00:00:00 +0000/2017/05/06/deriving-the-logits-for-logistic-regression/The logistic regression is an incredible useful tool, partly because binary outcomes are so frequent in live (“she loves me - she doesn’t love me”). In parts because we can make use of well-known “normal” regression instruments.
But the formula of logistic regression appears opaque to many (beginners or those with not so much math background).
Let’s try to shed some light on the formula by discussing some accessible explanation on how to derive the formula.Variance explained vs. variance blurred
/2017/05/05/explained_variance/
Fri, 05 May 2017 00:00:00 +0000/2017/05/05/explained_variance/Frequently, someones says that some indicator variable X “explains” some proportion of some target variable, Y. What does this actually mean? By “mean” I am trying to find some intuition that “clicks” rather than citing the (well-known) formualas.
To start with, let’s load some packages and make up some random data.
library(tidyverse) n_rows <- 100 set.seed(271828) df <- data_frame( exp_clean = rnorm(n = n_rows, mean = 2, sd = 1), cntrl_clean = rnorm(n = n_rows, mean = 0, sd = 1), exp_noisy = exp_clean + rnorm(n = n_rows, mean = 0, sd = 3), cntrl_noisy = cntrl_clean + rnorm(n = n_rows, mean = 0, sd = 3), ID = 1:n_rows) Here, we drew 100 cases from the population of the “experimental group” (mue = 2) and 100 cases from the control group (mue = 0).This blog now has a DOI
/2017/05/04/doi_added/
Thu, 04 May 2017 00:00:00 +0000/2017/05/04/doi_added/A DOI is useful feature to any electronic document. What the ID number in your passport is to you is the DOI to a document. It simply helps to make sure you address the “object” you want to address.
Similarly, there may exists several “Joachims Zwiwwelkoecks” in this world (well, it may or may not be the case). However, if any of this person gets his (or her) unique ID (could by a simple number), then we would in principle always be certain that we address the right person.Einführung in die Datenanalyse mit R-Paket 'dplyr' - R User Group Nürnberg
/2017/04/27/datenanalyse_mit_dplyr/
Thu, 27 Apr 2017 00:00:00 +0000/2017/04/27/datenanalyse_mit_dplyr/Datenjudo mit dplyr Einleitung Innerhalb der R-Landschaft hat sich das Paket dplyr binnen kurzer Zeit zu einem der verbreitesten Pakete entwickelt; es stellt ein innovatives Konzept der Datenanalyse zur Verfügung. dplyr zeichnet sich durch zwei Ideen aus. Die erste Idee ist, dass nur Tabellen (“dataframes” oder “tibbles”) verarbeitet werden, keine anderen Datenstrukturen. Diese Tabellen werden von Funktion zu Funktion durchgereicht. Der Fokus auf Tabellen vereinfacht die Analyse, da Spalten nicht einzeln oder mittels Schleifen werden müssen.Tools for Academic Writing - Comparison
/2017/04/26/writing_tools/
Wed, 26 Apr 2017 00:00:00 +0000/2017/04/26/writing_tools/Many tools exist for academic writing including the notorious W.O.R.D.; but many more are out there. Let’s have a look at those tools, and discuss what’s important (what we expect the tool to deliver, eg., beautiful typesetting).
Typical tools for academic writing MS Word: A “classical” choice, relied upon by myriads of white collar workers… I myself have used it extensively for academic writing; the main advantage being its simplicity, that is, well, everybody knows it, and knows more or less how to handle it.Covariance as correlation
/2017/04/25/cor_as_cov/
Tue, 25 Apr 2017 00:00:00 +0000/2017/04/25/cor_as_cov/Correlation is one of the most widely used and a well-known measure of the assocation (linear association, that is) of two variables.
Perhaps less well-known is that the correlation is in principle analoguous to the covariation.
To see this, consider the a formula of the covariance of two empirical datasets, $X$ and $Y$:
$$COV(X,Y) = \frac{1}{n} \cdot \big( \sum (X_i -\bar{X}) \cdot (Y_i - \bar{Y}) \big) $$
In other words, the covariance of $X$ and $Y$ $COV(X,Y)$ is the average of difference of some value to its mean.Plotting skewed distributions
/2017/04/19/skewed-distribs/
Wed, 19 Apr 2017 00:00:00 +0000/2017/04/19/skewed-distribs/Let’s plot some skewed stuff, aehm, distributions!
Actually, the point I - initially - wanted to make is that in skewed distribution, don’t use means. Or at least, be very aware that (arithmetic) means can be grossly misleading. But for today, let’s focus on drawing skewed distributions.
Some packages:
library(tidyverse) library(fGarch) # for snorm Some skewed distribution include:
“polluted” normal distributions, ie., mixtures of two normals Exponential distributions Gamma distributions Beta distributions One way to visualize them is to draw their curve, ie.Error bars for interaction effects with nominal variables
/2017/04/18/moderator-errorbars/
Tue, 18 Apr 2017 00:00:00 +0000/2017/04/18/moderator-errorbars/Moderator effects (ie., interaction or synergy effects) are a topic of frequent interest in many sciences braches. A lot ink has been spilled over this topic (so did I, eg., here).
However, in that post I did now show how to visualize error in case of nominal (categorical) independent variable, and categorical moderator.
Luckily, visualization of this case is quite straight forward with ggplot2.
First, some data and packages to be loaded:The effect of sample on p-values. A simulation.
/2017/04/13/pvalue_sample_size/
Thu, 13 Apr 2017 00:00:00 +0000/2017/04/13/pvalue_sample_size/It is well-known that the notorious p-values is sensitive to sample size: The larger the sample, the more bound the p-value is to fall below the magic number of .05.
Of course, the p-value is also a function of the effect size, eg., the distance between two means and the respective variances. But still, the p-values tends to become significant in the face of larges samples, and non-significant otherwise.Three ways to dichotomize a variable
/2017/04/11/three_ways_recoding_cutting/
Tue, 11 Apr 2017 00:00:00 +0000/2017/04/11/three_ways_recoding_cutting/Dichotomizing is also called dummy coding. It means: Take a variable with multiple different values (>2), and transform it so that the output variable has 2 different values.
Note that this “thing” can be understood as consisting of two different aspects: Recoding and cutting. Recoding means that value “a” becomes values “b” etc. Cutting means that a “rope” of numbers is cut into several shorter “ropes” (that’s why it is called cutting).Rowwise operations in dplyr
/2017/03/27/rowwise_dplyr/
Mon, 27 Mar 2017 00:00:00 +0000/2017/03/27/rowwise_dplyr/R thinks columnwise, not rowwise, at least in standard dataframe operations. A typical rowwise operation is to compute row means or row sums, for example to compute person sum scores for psychometric analyses.
One workaround, typical for R, is to use functions such as apply (and friends).
However, dplyr offers some quite nice alternative:
library(dplyr) mtcars %>% rowwise() %>% mutate(mymean=mean(c(cyl,mpg))) %>% select(cyl, mpg, mymean) ## Source: local data frame [32 x 3] ## Groups: <by row> ## ## # A tibble: 32 × 3 ## cyl mpg mymean ## <dbl> <dbl> <dbl> ## 1 6 21.Convert list to dataframe
/2017/03/08/convert_list_to_dataframe/
Wed, 08 Mar 2017 00:00:00 +0000/2017/03/08/convert_list_to_dataframe/A handy function to iterate stuff is the function purrr::map. It takes a function and applies it to all elements of a given vector. This vector can be a data frame - which is a list, tecnically - or some other sort of of list (normal atomic vectors are fine, too).
However, purrr::map is designed to return lists (not dataframes). For example, if you apply mosaic::favstats to map, you will get some favorite statistics for some variable:How to avoid Github/merge conflicts with Rmd-files
/2017/03/06/avoid_merge_conflicts/
Mon, 06 Mar 2017 00:00:00 +0000/2017/03/06/avoid_merge_conflicts/One nice features of .rmd files is that version control systems, such as git and github, can (quite) easily be combined. However, in my experience, merge conflicts are not so uncommon. That raises the question how to avoid merge conflicts when syncing with Github?
Here’s a quick overview on what to do to that hassle:
Sync often. Hard wrap the lines to approx. 80 characters. Pull before you start to change the source files.Lieblings-R-Befehle
/2017/03/05/lieblingsbefehle/
Sun, 05 Mar 2017 00:00:00 +0000/2017/03/05/lieblingsbefehle/Hier eine Liste einiger meiner “Lieblings-R-Funktionen”; für Einführungsveranstaltungen in Statistik spielen sie (bei mir) eine wichtige Rolle. Die Liste kann sich ändern :-)
Wenn ich von einer “Tabelle” spreche, meine ich sowohl Dataframes als auch Tibbles.
Zuweisung - <- Mit dem Zuweisungsoperator <- kann man Objekten einen Wert zuweisen:
x <- 1 mtcars2 <- mtcars Spalten als Vektor auswählen - $ Mit dem Operator $ kann man eine Spalte einer Tabelle auswählen.AfD Mining - basales Textmining zum AfD-Parteiprogramm
/2017/02/21/textmining_afd_01/
Tue, 21 Feb 2017 00:00:00 +0000/2017/02/21/textmining_afd_01/Für diesen Post benötigte R-Pakete:
library(stringr) # Textverarbeitung library(tidytext) # Textmining library(pdftools) # PDF einlesen library(downloader) # Daten herunterladen # library(knitr) # HTML-Tabellen library(htmlTable) # HTML-Tabellen library(lsa) # Stopwörter library(SnowballC) # Wörter trunkieren library(wordcloud) # Wordcloud anzeigen library(gridExtra) # Kombinierte Plots library(dplyr) # Datenjudo library(ggplot2) # Visualisierung Ein einführendes Tutorial zu Textmining; analysiert wird das Parteiprogramm der Partei “Alternative für Deutschland” (AfD). Vor dem Hintergrund des gestiegenen Zuspruchs von Rechtspopulisten und der großen Gefahr, die von diesem Gedankengut ausdünstet, erscheint mir eine facettenreiche Analyse des Phänomens “Rechtspopulismus” nötig.Checklist for Data Cleansing
/2017/02/13/data_cleansing/
Mon, 13 Feb 2017 00:00:00 +0000/2017/02/13/data_cleansing/What this post is about: Data cleansing in practice with R Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here):
(Inspired by this)
Often, the first major part – “prepare” – is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!Sentiment-Wörterbuch erstellen
/2017/02/04/sentiment_dictionary/
Sat, 04 Feb 2017 00:00:00 +0000/2017/02/04/sentiment_dictionary/Bei der Textanalyse (Textmining) ist die Sentiment-Analyse eine typische Tätigkeit. Natürlich steht und fällt die Qualität der Sentiment-Analyse mit der Qualität des verwendeten Wörterbuchs (was nicht heißt, dass man nicht auch auf andere Klippen schellen kann).
Der Zweck dieses Posts ist es, eine Sentiment-Lexikon in deutscher Sprache einzulesen.
Dazu wird das Sentiment-Lexikon dieser Quelle verwendet (CC-BY-NC-SA 3.0). In diesem Paper finden sich Hintergründe. Von dort lassen sich die Daten herunter laden.Dataset 'performance in stats test'
/2017/01/27/data_test_inference/
Fri, 27 Jan 2017 00:00:00 +0000/2017/01/27/data_test_inference/This posts shows data cleaning and preparation for a data set on a statistics test (NHST inference). Data is published under a CC-licence, see here.
Data was collected 2015 to 2017 in statistics courses at the FOM university in different places in Germany. Several colleagues helped to collect the data. Thanks a lot! Now let’s enjoy the outcome (and make it freely available to all).
Raw N is 743. The test consists of 40 items which are framed as propositions; students are asked to respond with either “true” or “false” to each item.Convert logit to probability
/2017/01/24/convert_logit2prob/
Tue, 24 Jan 2017 00:00:00 +0000/2017/01/24/convert_logit2prob/Logistic regression may give a headache initially. While the structure and idea is the same as “normal” regression, the interpretation of the b’s (ie., the regression coefficients) can be more challenging.
This post provides a convenience function for converting the output of the glm function to a probability. Or more generally, to convert logits (that’s what spit out by glm) to a probabilty.
Note1: The objective of this post is to explain the mechanics of logits.Gentle intro to 'R-squared equals squared r'
/2017/01/20/rsquared/
Fri, 20 Jan 2017 00:00:00 +0000/2017/01/20/rsquared/It comes as no surprise that $$R^2$$ (“coefficient of determination”) equals $$r^2$$ in simple regression (predictor X, criterion Y), where $$r(X,Y)$$ is Pearson’s correlation coefficient. $$R^2$$ equals the fraction of explained variance in a simple regression. However, the statistical (mathematical) background is often less clear or buried in less-intuitive formula.
The goal of this post is to offer a gentle explanantion why
$$R^2 = r^2$$,
where $$r$$ is $$r(Y,\hat{Y})$$ and $$\hat{Y}$$ are the predicted values.The two ggplot2-ways of plottings bars
/2017/01/20/two_ways_barplots_with_ggplot2/
Fri, 20 Jan 2017 00:00:00 +0000/2017/01/20/two_ways_barplots_with_ggplot2/Bar plots, whereas not appropriate for means, are helpful for conveying impressions of frequencies, particularly relative frequencies, ie., proportions.
Intuition: Bar plots and histograms alike can be thought of as piles of Lego pieces, put onto each each other, where each Lego piece represents (is) one observation.
Presenting tables of frequencies are often not insightful to the eye. Bar plots are often much more accessible and present the story more clearly.Fallstudie (YACSDA) zur praktischen Datenanalyse mit dplyr
/2017/01/18/fallstudie_flights/
Wed, 18 Jan 2017 00:00:00 +0000/2017/01/18/fallstudie_flights/Case study in data analysis using R package dplyr in German language.
Praktische Datenanalyse mit dplyr Das R-Paket dplyr von Hadley Wickham ist ein Stargast auf der R-Showbühne; häufig diskutiert in einschlägigen Foren. Mit dyplr kann man Daten “verhackstücken” - umformen und aufbereiten (“to wrangle” auf Englisch); “praktische Datenanalyse” ist vielleicht eine gute Bezeichnung. Es finden sich online viele Einführungen, z.B. hier oder hier.
Dieser Text ist nicht als Einführung oder Erläuterung gedacht, sondern als Übung, um (neu erworbenen Fähigkeiten) in der praktischen Datenanalyse im Rahmen einer Fallstudie auszuprobieren.I am unavailable for review
/2017/01/17/unavailable_for_review/
Tue, 17 Jan 2017 00:00:00 +0000/2017/01/17/unavailable_for_review/Dear editorial team,
Thanks for considering me for review. After some thought-meandering I came to the conclusion that traditional publishers - such as the present publisher of this journal - support a business model that I deem unfair and inappropriate for regular science and for the interests of science and scientists alike. That is, the fees are much too high thereby sucking resources out of the science system and out of society which could be used for the better otherwise.Kongresse 2017 - Wirtschaftspsychologie und verwandte Gebiete
/2017/01/17/kongresstermine_2017/
Tue, 17 Jan 2017 00:00:00 +0000/2017/01/17/kongresstermine_2017/Hier finden Sie eine Auswahl an wissenschaftlichen Kongressen in 2017 aus der Wirtschaftspsychologie und angrenzender Felder.
Nationale Kongresse 2017 (in Deutschland) GWPS, 2.-4. März in Darmstadt
Fachtagung der Gesellschaft für angewandte Wirtschaftspsychologie (GWPs)
Submission Deadline: 30. Nov 2016
TeaP, 26.-29. März in Dresden
Conference of Experimental Psychologists Submission Deadline: 15. Nov. 2016
DiffPsy, 4.-6. September in München
Arbeitstagung der Fachgruppe Differenzielle Psychologie, Persönlichkeitspsychologie und Psychologische DiagnostikVisualizing Interaction Effects with ggplot2
/2017/01/17/vis_interaction_effects/
Tue, 17 Jan 2017 00:00:00 +0000/2017/01/17/vis_interaction_effects/Moderator effects or interaction effect are a frequent topic of scientific endeavor. Put bluntly, such effects respond to the question whether the input variable X (predictor or independent variable IV) has an effect on the output variable (dependent variable DV) Y: “it depends”. More precisely, it depends on a second variable, M (Moderator).
More formally, a moderation effect can be summarized as follows:
If the effect of X on Y depends on M, a moderator effect takes place.How to import a strange CSV
/2017/01/12/strange_csvs/
Thu, 12 Jan 2017 00:00:00 +0000/2017/01/12/strange_csvs/A typical task in data analysis is to import CSV-formatted data. CSV is nothing more than a text file with data in rectangular form; rows stand for observations (eg., persons), and columns represent variables (such as age). Columns are separed by a “separator”, often a comma. Hence the name “CSV” - “comma separeted values”. Note however that the separator can in principle anything you like (eg., “;” or tabulator or “ “).R startet nicht
/2017/01/11/r_startet_nicht/
Wed, 11 Jan 2017 00:00:00 +0000/2017/01/11/r_startet_nicht/Hilfe! Mein R startet nicht! Mein R startet zwar, tut aber nicht so, wie ich will. Sicherlich hat es sich (wieder einmal) gegen mich verschworen. Wahrscheinlich hilft nur noch Verschrotten… Bevor Sie zum äußersten schreiten, hier einige Tipps, die sich bewährt haben.
Lösungen, wenn R nicht (richtig) läuft AEG: Aus. Ein. Gut. Starten Sie den Rechner neu. Gerade nach Installation neuer Software zu empfehlen.
Sehen Sie eine Fehlermeldung, die von einem fehlenden Paket spricht (z.Convert data frame from 'wide' to 'long'
/2017/01/06/facial_beauty/
Fri, 06 Jan 2017 00:00:00 +0000/2017/01/06/facial_beauty/Thanks to my student Marie Halbich who took the pains to collect the data!
At times, your data set will be in “wide” format, i.e, many columns in comparison to rows. For some analyses however, it is more suitable to have the data in “long” format. That is, many rows in comparison to columns.
Let’s have a look at this data set, for example.
d <- read.csv("https://sebastiansauer.github.io/data/facial_beauty_raw.csv") This is the data from a study tapping into the effect of computerized “beautification” of some faces on subjective “like”.YACSDA (Fallstudie) zum Datensatz 'Affairs'
/2017/01/05/yacsda_affairs/
Thu, 05 Jan 2017 00:00:00 +0000/2017/01/05/yacsda_affairs/This YACSDA (Yet-another-case-study-on-data-analysis) in composed in German language. Some typical data analytical steps are introduced.
Wovon ist die Häufigkeit von Affären (Seitensprüngen) in Ehen abhängig? Diese Frage soll anhand des Datensates Affair untersucht werden.
Dieser Post stellt beispielhaft eine grundlegende Methoden der praktischen Datenanalyse im Rahmen einer kleinen Fallstudie (YACSDA) vor.
Quelle der Daten: http://statsmodels.sourceforge.net/0.5.0/datasets/generated/fair.html
Der Datensatz findet sich (in ähnlicher Form) auch im R-Paket COUNT (https://cran.r-project.org/web/packages/COUNT/index.html).
Laden wir als erstes den Datensatz in R.Why is the variance additive? An intuition.
/2017/01/04/additivity_variance/
Wed, 04 Jan 2017 00:00:00 +0000/2017/01/04/additivity_variance/The variance of some data can be defined in rough terms as the mean of the squared deviations from the mean.
Let’s repeat that because it is important:
Variance: Mean of squared deviations from the mean.
An example helps to illustrate. Assume some class of students are forced to write an exam in a statistics class (OMG). Let’s say the grades range fom 1 to 6, 1 being the best and 6 the worst.A Plain Markdown Post
/2016/12/30/hello-markdown/
Fri, 30 Dec 2016 00:00:00 +0000/2016/12/30/hello-markdown/This is a post written in plain Markdown (*.md) instead of R Markdown (*.Rmd). The major differences are:
You cannot run any R code in a plain Markdown document, whereas in an R Markdown document, you can embed R code chunks (```{r}); A plain Markdown post is rendered through Blackfriday, and an R Markdown document is compiled by rmarkdown and Pandoc. There are many differences in syntax between Blackfriday’s Markdown and Pandoc’s Markdown.Überleben auf der Titanic - YACSDA für nominale Daten
/2016/12/22/titanic/
Thu, 22 Dec 2016 00:00:00 +0000/2016/12/22/titanic/In dieser YACSDA (Yet-another-case-study-on-data-analysis) geht es um die beispielhafte Analyse nominaler Daten anhand des “klassischen” Falls zum Untergang der Titanic. Eine Frage, die sich hier aufdrängt, lautet: Kann (konnte) man sich vom Tod freikaufen, etwas polemisch formuliert. Oder neutraler: Hängt die Überlebensquote von der Klasse, in der derPassagiers reist, ab?
Diese Übung soll einige grundlegende Vorgehensweise der Datenanalyse verdeutlichen; Zielgruppe sind Einsteiger (mit Grundkenntnissen in R) in die Datenanalyse.Müncher Mietpreis: Übung zum p-Wert
/2016/12/21/mietpreis_p-wert/
Wed, 21 Dec 2016 00:00:00 +0000/2016/12/21/mietpreis_p-wert/Sie möchten die Hypothese (H0) testen, dass der mittlere Mietpreis in München 16,28€ beträgt (wie der Münchner Merkur einmal behauptet hat). Dafür ziehen Sie eine Stichprobe der Größe n = 36. Gehen Sie von einer SD von 3€ in der Population aus (Menge aller Mietwohnungen in München). Alpha sei 5%. Der Mittelwert Ihrer Stichprobe ist 16,79€. Nehmen Sie als H1 die Hypothese, dass der wahre mittlere Mietpreis höher ist.Some tricks on dplyr::filter
/2016/12/21/dplyr_filter/
Wed, 21 Dec 2016 00:00:00 +0000/2016/12/21/dplyr_filter/The R package dplyr has some attractive features; some say, this packkage revolutionized their workflow. At any rate, I like it a lot, and I think it is very helpful.
In this post, I would like to share some useful (I hope) ideas (“tricks”) on filter, one function of dplyr. This function does what the name suggests: it filters rows (ie., observations such as persons). The addressed rows will be kept; the rest of the rows will be dropped.Some thoughts on 'Dear stats curriculum developers'
/2016/12/08/stats_curriculum/
Thu, 08 Dec 2016 00:00:00 +0000/2016/12/08/stats_curriculum/Recently, Andrew Gelman (@StatModeling at Twitter) published a post with this title - ““Dear Major Textbook Publisher”: A Rant”.
In essence, he discussed how a good stats intro text book should be like. And complained about the low quality of some many textbooks out there.
As I am also in the business guilty of coming up with stats curriculum for my students (applied courses for business type students mostly), I discuss some thoughts for “stats curriculum developers” (like myself).Simulation of p-values
/2016/12/01/simu_p/
Thu, 01 Dec 2016 00:00:00 +0000/2016/12/01/simu_p/Teaching or learning stats can be a challenging endeavor. In my experience, starting with concrete (as opposed to abstract) examples helps many a learner. What also helps (for me) is visualizing.
As p-values are still part and parcel of probably any given stats curriculum, here is a convenient function to simulate p-values and to plot them.
“Simulating p-values” amounts to drawing many samples from a given, specified population (eg., µ=100, s=15, normally distributed).Pipe the Variance
/2016/11/30/pipe_variance/
Wed, 30 Nov 2016 00:00:00 +0000/2016/11/30/pipe_variance/One idea of problem solving is, or should be, I think, that one should tackle problems of high complexity, but not too high. That sounds trivial, cooler tone would be “as hard as possible, as easy as necessary” which is basically the same thing.
In software development including Rstats, a similar principle applies. Sounds theoretical, I admit. So see here some lines of code that has bitten me recently:
obs <- c(1,2,3) pred <- c(1,2,4) monster <- 1 - (sum((obs - pred)^2))/(sum((obs - mean(obs))^2)) monster ## [1] 0.Some musings on the validation of Satow's Extraversion questionnaire
/2016/11/23/validation_extraversion_questionnaire/
Wed, 23 Nov 2016 00:00:00 +0000/2016/11/23/validation_extraversion_questionnaire/Measuring personality traits is one of (the?) bread-and-butter business of psychologists, at least for quantitatively oriented ones. Literally, thousand of psychometric questionnaires exits. Measures abound. Extroversion, part of the Big Five personality theory approach, is one of the most widely used, and extensively scrutinized questionnaire tapping into human personality.
One rather new, but quite often used questionnaire, is Satow’s (2012) B5T. The reason for the popularity of this instrument is that it runs under a CC-licence - in contrast to the old ducks, which coute chere.Preparing survey results data
/2016/11/19/preparing_survey_data/
Sat, 19 Nov 2016 00:00:00 +0000/2016/11/19/preparing_survey_data/Analyzing survey results is a frequent endeavor (for some including me). Let’s not think about arguments whether and when surveys are useful or not (for some recent criticism see Briggs’ book).
Typically, respondents circle some option ranging from “don’t agree at all” to “completely agree” for each question (or “item”). Typically, four to six boxes are given where one is expected to tick one.
In this tutorial, I will discuss some typical steps to prepare the data for subsequent analyses.Crashkurs zur Erstellung von Barplots für Umfrage-Daten
/2016/11/13/crashkurs_barplots/
Sun, 13 Nov 2016 00:00:00 +0000/2016/11/13/crashkurs_barplots/Eine recht häufige Art von Daten in der Wirtschaft kommen von Umfragen in der Belegschaft. Diese Daten gilt es dann aufzubereiten und graphisch wiederzugeben. Dafür gibt dieser Post einige grundlegende Hinweise. Grundwissen mit R setzen wir voraus :-)
Eine ausführlichere Beschreibung hier sich z.B. hier.
Packages laden Nicht vergessen: Ein Computerprogramm (z.B. ein R-Package) kann man nur dann laden, wenn man es vorher installier hat (aber es reicht, das Programm/R-Package einmal zu installieren).New bar stacking with ggplot 2.2.0
/2016/11/13/improved_bar_stacking_ggplot2_220/
Sun, 13 Nov 2016 00:00:00 +0000/2016/11/13/improved_bar_stacking_ggplot2_220/Recently, ggplot2 2.2.0 was released. Among other news, stacking bar plot was improved. Here is a short demonstration.
Load libraries
library(tidyverse) library(htmlTable) … and load data:
data <- read.csv("https://osf.io/meyhp/?action=download") DOI for this piece of data is 10.17605/OSF.IO/4KGZH.
The data consists of results of a survey on extraversion and associated behavior.
Say, we would like to visualize the responsed to the extraversion items (there are 10 of them).
So, let’s see.Some thoughts (and simulation) on overfitting
/2016/11/13/overfitting_simulation/
Sun, 13 Nov 2016 00:00:00 +0000/2016/11/13/overfitting_simulation/Overfitting is a common problem in data analysis. Some go as far as saying that “most of” published research is false (John Ionnadis); overfitting being one, maybe central, problem of it. In this post, we explore some aspects on the notion of overfitting.
Assume we have 10 metric variables v (personality/health/behavior/gene indicator variables), and, say, 10 variables for splitting up subgroups (aged vs. young, female vs. male, etc.), so 10 dichotomic variables.Plotting survey results using `ggplot2`
/2016/11/12/plotting_surveys/
Sat, 12 Nov 2016 00:00:00 +0000/2016/11/12/plotting_surveys/Plotting (and more generally, analyzing) survey results is a frequent endeavor in many business environments. Let’s not think about arguments whether and when surveys are useful (for some recent criticism see Briggs’ book).
Typically, respondents circle some option ranging from “don’t agree at all” to “completely agree” for each question (or “item”). Typically, four to six boxes are given where one is expected to tick one.
In this tutorial, I will discuss some barplot type visualizations; the presentation is based on ggplot2 (within the R environment) .Horoskopstudie zum Barnumeffekt
/2016/11/09/horoskop-studie/
Wed, 09 Nov 2016 00:00:00 +0000/2016/11/09/horoskop-studie/Viele Menschen glauben an Horoskope. Doch warum? Ein Grund könnte sein, dass Horoskope einfach gut sind. Was heißt gut: Sie passen auf mich aber nicht auf andere Leute (mit anderen Strernzeichen) und sie sagen Dinge, die nützlich sind.
Ein anderer Grund könnte sein, dass sie uns schmeicheln und Gemeinplätze sind, denen jeder zustimmt: “Sie sind an sich ein Super-Typ, aber manchmal etwas ungeduldig” (oh ja, absolut, passt genau!). “Heute treffen Sie jemanden, der eine große Liebe werden könnte” (Hört sich gut an!Some reflections on stochastic independence
/2016/11/08/stochastic_independence/
Tue, 08 Nov 2016 00:00:00 +0000/2016/11/08/stochastic_independence/We are often interested in the question whether two variables are “associated”, “correlated” (I mean the normal English term) or “dependent”. What exactly, or rather in normal words, does that mean? Let’s look at some easy case.
NOTE: The example has been updated to reflect a more tangible and sensible scenario (find the old one in the previous commit at Github).
Titanic data For example, let’s look at survival rates of the Titanic disaster, to see whether the probability of survival (event A) depends on the whether you embarked for 1st class (event B).Bind lists to data frame for aggregating linear models results
/2016/11/04/bind_list_to_dataframe_lm/
Fri, 04 Nov 2016 00:00:00 +0000/2016/11/04/bind_list_to_dataframe_lm/I found myself doing the following: I had a bunch of predictors, one (numeric) outcome, and wanted to run I simple regression for each of the predictors. Having a bunch of model results, I would like to have them bundled in one data frame.
So, here is one way to do it.
First, load some data.
data(mtcars) str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.How to plot a 'percentage plot' with ggplot2
/2016/11/03/percentage_plot_ggplot2_v2/
Thu, 03 Nov 2016 00:00:00 +0000/2016/11/03/percentage_plot_ggplot2_v2/At times it is convenient to draw a frequency bar plot; at times we prefer not the bare frequencies but the proportions or the percentages per category. There are lots of ways doing so; let’s look at some ggplot2 ways.
First, let’s load some data.
data(tips, package = "reshape2") And the typical libraries.
library(dplyr) library(ggplot2) library(tidyr) library(scales) # for percentage scales Way 1 tips %>% count(day) %>% mutate(perc = n / nrow(tips)) -> tips2 ggplot(tips2, aes(x = day, y = perc)) + geom_bar(stat = "identity") Way 2 ggplot(tips, aes(x = day)) + geom_bar(aes(y = (.Different ways to set figure size in RMarkdown
/2016/11/02/figure_sizing_knitr/
Wed, 02 Nov 2016 00:00:00 +0000/2016/11/02/figure_sizing_knitr/Markdown is thought as a “lightweight” markup language, hence the name markdown. That’s why formatting options are scarce. However, there are some extensions, for instance brought by RMarkdown.
One point of particular interest is the sizing of figures. Let’s look at some ways how to size a figure with RMarkdown.
We take some data first:
data(mtcars) names(mtcars) ## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb" Not let’s plot.CLES plot
/2016/10/17/cles-plot/
Mon, 17 Oct 2016 00:00:00 +0000/2016/10/17/cles-plot/In data analysis, we often ask “Do these two groups differ in the outcome variable”? Asking this question, a tacit assumption may be that the grouping variable is the cause of the difference in the outcome variable. For example, assume the two groups are “treatment group” and “control group”, and the outcome variable is “pain reduction”.
A typical approach would be to report the strenght of the difference by help of Cohen’s d.Checking for NA with dplyr
/2016/10/16/nas-with-dplyr/
Sun, 16 Oct 2016 00:00:00 +0000/2016/10/16/nas-with-dplyr/Often, we want to check for missing values (NAs). There are of course many ways to do so. dplyr provides a quite nice one.
First, let’s load some data:
library(readr) extra_file <- "https://raw.github.com/sebastiansauer/Daten_Unterricht/master/extra.csv" extra_df <- read_csv(extra_file) Note that extra is a data frame consisting of survey items regarding extraversion and related behavior.
In case the dataframe is quite largish (many columns) it is helpful to have some quick way. Here, we have 25 columns.Multiple ways to subsetting data frames in R
/2016/10/15/indexing-in-r/
Sat, 15 Oct 2016 00:00:00 +0000/2016/10/15/indexing-in-r/Subsetting a data frame is an essential and frequently performed task. Here, some basic ideas are presented.
Get some data first.
str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 .How to read Github files into R easily
/2016/10/12/download-from-github/
Wed, 12 Oct 2016 00:00:00 +0000/2016/10/12/download-from-github/Downloading a folder (repository) from Github as a whole The most direct way to get data from Github to your computer/ into R, is to download the repository. That is, click the big green button:
The big, green button saying “Clone or download”, click it and choose “download zip”.
Of course, for those using Git and Github, it would be appropriate to clone the repository. And, although appearing more advanced, cloning has the definitive advantage that you’ll enjoy the whole of the Github features.Simple (R-)Markdown template for 'Onepager-reports' etc.
/2016/10/05/template-onepager/
Wed, 05 Oct 2016 00:00:00 +0000/2016/10/05/template-onepager/In my role as a teacher, I (have to) write a lot of marking feedback reports. My university provides a website to facilitate the process, that’s great. I have also been writing my reports with Pages, Word, or friends. But somewhat cooler, more attractive, and more reproducible would be using (a markup language such as) Markdown. Basically, that’s easy, but it would be of help to have a template that makes up a nice and nicely formatted report, like this:Using purrr to build a data frame of vectors (eg., from effect size statistics)
/2016/09/29/purrr-effsize/
Thu, 29 Sep 2016 00:00:00 +0000/2016/09/29/purrr-effsize/I just tried to accomplish the following with R: Compute effect sizes for a variable between two groups. Actually, not one numeric variable but many. And compute not only one measure of effect size but several (d, lower/upper CI, CLES,…).
So how to do that?
First, let’s load some data and some (tidyverse and effect size) packages:
knitr::opts_chunk$set(echo = TRUE, cache = FALSE, message = FALSE) library(purrr) library(ggplot2) library(dplyr) library(broom) library(tibble) library(compute.Summary for multiple variables using purrr
/2016/09/28/summary-mult-cols-purrr/
Wed, 28 Sep 2016 00:00:00 +0000/2016/09/28/summary-mult-cols-purrr/A frequent task in data analysis is to get a summary of a bunch of variables. Often, graphical summaries (diagrams) are wanted. However, at times numerical summaries are in order. How to get that in R? That’s the question of the present post.
Of course, there are several ways. One way, using purrr, is the following. I liked it quite a bit that’s why I am showing it here.
First, let’s load some data and some packages we will make use of.EDIT: Running multiple simple regressions with purrr
/2016/09/26/edit-multiple_lm_purrr_edit/
Mon, 26 Sep 2016 00:00:00 +0000/2016/09/26/edit-multiple_lm_purrr_edit/EDIT based on comments/ suggeestions from @JonoCarroll Disqus profile and @tjmahr twitter profile. See below (last step; look for “EDIT”).
Thanks for the input! 👍
reading time: 10 min.
Hadley Wickham’s purrr has given a new look at handling data structures to the typical R user (some reasoning suggests that average users doesn’t exist, but that’s a different story).
I just tried the following with purrr: - Meditate about the running a simple regression, FWIW - Take a dataframe with candidate predictors and an outcome - Throw one predictor at a time into the regression, where the outcome variable remains the same (i.Running multiple simple regressions with purrr
/2016/09/23/multiple-lm-purrr2/
Fri, 23 Sep 2016 00:00:00 +0000/2016/09/23/multiple-lm-purrr2/Hadley Wickham’s purrr has given a new look at handling data structures to the typical R user (some reasoning suggests that average users don’t exist, but that’s a different story).
I just tried the following with purrr:
Meditate about the running a simple regression, FWIW Take a dataframe with candidate predictors and an outcome Throw one predictor at a time into the regression, where the outcome variable remains the same (i.Code example for plotting boxplots instead of mean bars
/2016/09/22/use-boxplots/
Thu, 22 Sep 2016 00:00:00 +0000/2016/09/22/use-boxplots/On a recent psychology conference I had the impression that psychologists keep preferring to show mean values, but appear less interested in more detailled plots such as the boxplot. Plots like the boxplot are richer in information, but not more difficult to perceive.
For those who would like to have an easy starter on how to visualize more informative plots (more than mean bars), here is a suggestion:
# install.pacakges("Ecdat") library(Ecdat) # dataset on extramarital affairs data(Fair) str(Fair) ## 'data.How to promote open science? Some practical recommendations
/2016/09/22/openscience/
Thu, 22 Sep 2016 00:00:00 +0000/2016/09/22/openscience/I just attended the biannual conference of the German society of psychology (DPGs) in Leipzig; open science was a central, albeit not undisputed topic; a lot of interesting related twitter discussion.
image source: Felix Schönbrodt
Interestingly, a strong voice of German scientiests uttered their concerns about being scooped if/when sharing their data (during the official meeting of the society). This being said (sad), the German research foundation (DFG) has updated its guidelines now stressing (more strongly) that publicly funded projects should share their data, with the rationale that the data do not belong to the individual scientiest but to the public, as the public funded it (I find that convincing).Fallstudie zur explorative Datenanalyse (YACSDA) beim Datensatz 'TopGear'
/2016/09/14/yacsda_topgear/
Wed, 14 Sep 2016 00:00:00 +0000/2016/09/14/yacsda_topgear/YADCSDA in German language.
In dieser Fallstudie (YACSDA: Yet another case study of data analysis) wird der Datensatz TopGear analysiert, vor allem mit grafischen Mitteln. Es handelt sich weniger um einen “Rundumschlag” zur Beantwortung aller möglichen interessanten Fragen (oder zur Demonstration aller möglichen Analysewerkzeuge), sondern eher um einen Einblick zu einfachen explorativen Verfahren.
library(robustHD) ## Loading required package: perry ## Loading required package: parallel ## Loading required package: robustbase data(TopGear) # Daten aus Package laden library(tidyverse) Numerischer Überblick glimpse(TopGear) ## Observations: 297 ## Variables: 32 ## $ Maker <fctr> Alfa Romeo, Alfa Romeo, Aston Martin, Asto.Why Likert scales are (in general) not metric
/2016/09/07/likert-not-metric/
Wed, 07 Sep 2016 00:00:00 +0000/2016/09/07/likert-not-metric/Likert scales are psychologists’ bread-and-butter tool. Literally, thousands (!) of such “scales” (as they are called, rightfully or not) do exist. To get a feeling: The APA links to this database where 25,000 tests are listed (as stated by the website)! That is indeed an enormous number.
Most of these psychological tests use so called Likert scales (see this Wikipedia article). For example:
(Source: Wikipedia by Nicholas Smith)
Given their widespread use, the question how useful such tests are has arisen many times; see here, here, or here.Why is SD(X) unequal to MAD(X)?
/2016/08/31/why-sd-is-unequal-to-mad/
Wed, 31 Aug 2016 00:00:00 +0000/2016/08/31/why-sd-is-unequal-to-mad/MathJax.Hub.Config({ tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]} }); It may seem bewildering that the standard deviation (sd) of a vector X is (generally) unequal to the mean absolute deviation from the mean (MAD) of X, ie.
$$sd(X) \ne MAD(X)$$.
One could now argue this way: well, sd(X) involves computing the mean of the squared $$x_i$$, then taking the square root of this mean, thereby “coming back” to the initial size or dimension of x (i.Plot of mean with exact numbers using ggplot2
/2016/08/30/plot_dot_means/
Tue, 30 Aug 2016 00:00:00 +0000/2016/08/30/plot_dot_means/Often, both in academic research and more business-driven data analysis, we want to compare some (two in many cases) means. We will not discuss here that friends should not let friends plot barplots. Following the advise of Cleveland’s seminal book we will plot the means using dots, not bars.
However, at times we do not simply want the diagram, but we (or someone) is interested in the bare, plain, naked, exact numbers too.Shading multiple areas under normal curve
/2016/08/30/shade_normal_curve/
Tue, 30 Aug 2016 00:00:00 +0000/2016/08/30/shade_normal_curve/When plotting a normal curve, it is often helpful to color (or shade) some segments. For example, often we might want to indicate whether an absolute value is greater than 2.
How can we achieve this with ggplot2? Here is one way.
First, load packages and define some constants. Specifically, we define mean, sd, and start/end (z-) value of the area we want to shade. And your favorite color is defined.Simple way to plot a normal distribution with ggplot2
/2016/08/30/normal_curve_ggplot2/
Tue, 30 Aug 2016 00:00:00 +0000/2016/08/30/normal_curve_ggplot2/Plotting a normal distribution is something needed in a variety of situation: Explaining to students (or professors) the basic of statistics; convincing your clients that a t-Test is (not) the right approach to the problem, or pondering on the vicissitudes of life…
If you like ggplot2, you may have wondered what the easiest way is to plot a normal curve with ggplot2?
Here is one:
library(cowplot) ## Loading required package: ggplot2 ## ## Attaching package: 'cowplot' ## The following object is masked from 'package:ggplot2': ## ## ggsave p1 <- ggplot(data = data.Why absolute correlation value (r) cannot exceed 1. An intuition.
/2016/08/28/why-abs-correlation-is-max-1/
Sun, 28 Aug 2016 00:00:00 +0000/2016/08/28/why-abs-correlation-is-max-1/Pearson’s correlation is a well-known and widely used instrument to gauge the degree of linear association of two variables (see this post for an intuition on correlation).
There a many formulas for correlation, but a short and easy one is this one:
$$r = \varnothing(z_x z_y)$$.
In words, $$r$$ can be seen as the average product of z-scores.
In “raw values”, r is given by
$$ r = \frac{\frac{1}{n}\sum{\Delta X \Delta Y}}{\sqrt{\frac{1}{n}\sum{\Delta X^2}} \sqrt{\frac{1}{n}\sum{\Delta Y^2}}} $$.The effect of a status symbol on success in online dating: an experimental study (data paper)
/2016/08/27/data_status_dating/
Sat, 27 Aug 2016 00:00:00 +0000/2016/08/27/data_status_dating/This article has been published at The Winnower, it is distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.
Data can be accessed here.
Access the paper here.
CITATION: Sebastian Sauer, Alexander Wolff, The effect of a status symbol on success in online dating: an experimental study (data paper), The Winnower 3:e147241.Multiple t-Tests with dplyr
/2016/08/18/multiple-t-tests-with-dplyr/
Thu, 18 Aug 2016 00:00:00 +0000/2016/08/18/multiple-t-tests-with-dplyr/t-Test on multiple columns Suppose you have a data set where you want to perform a t-Test on multiple columns with some grouping variable. As an example, say you a data frame where each column depicts the score on some test (1st, 2nd, 3rd assignment…). In each row is a different student. So you glance at the grading list (OMG!) of a teacher!
How to do do that in R?Introduction to the measurement theory, and conjoint measurement theory
/2016/08/17/intro_measurement/
Wed, 17 Aug 2016 00:00:00 +0000/2016/08/17/intro_measurement/What is measurement? Why should I care?
Measurement is a basis of an empirical science. Image a geometer (a person measuring distances on the earth) with a metering rul made of rubber! Poor guy! Without proper measurement, even the smartest theory cannot be expected to be found, precisely because it cannot be measured.
So, what exactly is measurement? Measurement can be seen as tying numbers to empirical objects. But not in some arbritrary style.Looping through dataframe columns using purrr::map()
/2016/08/16/looping-purrr/
Tue, 16 Aug 2016 00:00:00 +0000/2016/08/16/looping-purrr/Let’s get purrr. Recently, I ran across this issue: A data frame with many columns; I wanted to select all numeric columns and submit them to a t-test with some grouping variables.
As this is a quite common task, and the purrr-approach (package purrr by @HadleyWickham) is quite elegant, I present the approach in this post.
Let’s load the data, the Affairs data set, and some packages:
data(Affairs, package = "AER") library(purrr) # functional programming library(dplyr) # dataframe wrangling library(ggplot2) # plotting library(tidyr) # reshaping df Don’t forget that the four packages need to be installed in the first place.Intuition on correlation
/2016/07/25/correlation-intuition/
Mon, 25 Jul 2016 00:00:00 +0000/2016/07/25/correlation-intuition/reading time: 10 min.
Pearson’s correlation (short: correlation) is one of statistics’ all time classics. With an age of about a century, it is some kind of grand dad of analytic tools – but an oldie who is still very busy!
Formula, interpretation and application of correlation is well known.
In some non-technical lay terms, correlation captures the (linear) degree of co-variation of two linear variables. For example: if tall people have large feet (and small people small feet), on average, we say that height and foot size are correlated.Practical data cleansing in R
/2016/07/24/data-cleansing/
Sun, 24 Jul 2016 00:00:00 +0000/2016/07/24/data-cleansing/What is “data cleansing” about?
Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here):
(Inspired by this)
Often, the first major part — “prepare” — is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt…Yet another case study on data analysis (YACSDA) – extramarital affairs data set
/2016/07/23/affairs/
Sat, 23 Jul 2016 00:00:00 +0000/2016/07/23/affairs/Ok, there are heaps of them on the net. Here comes my YACSDA. Maybe the only thing about it to mention is that it comes in German language.
Analytical language: R (3.3) Purpose: Demonstrate basic exploratory and modeling techniques Packages used: dplyr, ggplot2 Data set: Affair; source R package COUNT Analytical topics covered: descriptive statistics, visualization, liner model, logistic linear model Reproducibility: Rmarkdown, knitr, github Code on GithubWhy metric scale level cannot be taken for granted
/2016/07/21/measurement-01/
Thu, 21 Jul 2016 00:00:00 +0000/2016/07/21/measurement-01/One main business for psychologists is to examine questionnaire data. Extraversion, intelligence, attitudes… That’s bread-and-butter job for (research) psychologists.
Similarly, it is common to take the metric level of questionnaire data for granted. Well, not for the item level, it is said. But for the aggregated level, oh yes, that’s OK.
Despite its popularity, the measurement basics of such practice are less clear. On which grounds can this comfortable practice be defended?What to read in summer (German)
/2016/07/20/what-to-read/
Wed, 20 Jul 2016 00:00:00 +0000/2016/07/20/what-to-read/Below some consideration on what to read in summer times. In German language.
Lesezeit/reading time: 10-15 Min.
Literaturempfehlung Sommer 2016
Was soll ich lesen? Sommer, Sonne, Sonnenschein — ab in den Süden. Die Zeile “Lesen, lesen, lesen, lesen” würde sich nach meinem Dafürhalten auch ganz gut in den Song einpassen. Dafür hier ein paar Literaturempfehlungen. Von einer anständigen Sommerlektüre erwarte ich zweierlei: Dass die Kunst unterhaltsam sei. Zweitens, wenn als der Dampf sich nach dem Lesen erhebt, dass etwas zurückbleibt, außer dem Dampf.Case study on data wrangling with dplyr (German)
/2016/07/18/nycflights13/
Mon, 18 Jul 2016 00:00:00 +0000/2016/07/18/nycflights13/reading time (full): 30 min.
Data Wrangling with dplyr is a popular activity in data science/ statistics. A number of tutorial are available, but not so many in German language.
Data set analyzed in nycflights13::flights (R package). Available on CRAN. Ok, choosing this data set is not very creative, but, hey, quite nice data:)
Thus, here is a case study in German language; code ®is on Github.Intuition on Cohen's d
/2016/07/15/cohens-d-intuition/
Fri, 15 Jul 2016 00:00:00 +0000/2016/07/15/cohens-d-intuition/reading time: 5-10 min.
Cohen’s d is a widely known and extensively used measure of effect size. That is, d is used to gauge how strong an effect is (given the fact that the effect exists). For example, one way to estimate d is as follows:
data(tips, package = "reshape2") library(compute.es) t1 <- t.test(tip ~ sex, data = tips) t1$statistic ## t ## -1.489536 table(tips$sex) ## ## Female Male ## 87 157 tes(t1$statistic, 87, 157) ## Mean Differences ES: ## ## d [ 95 %CI] = -0.How to add a logo to a slidify presentation
/2016/07/05/slidify-logo/
Tue, 05 Jul 2016 00:00:00 +0000/2016/07/05/slidify-logo/reading time: 15-20 min.
Slidify is a cool tool to render HTML5 slide decks, see here, here or here for examples.
Features include:
reproducibility. You write your slide deck as you would write any other text, similar to Latex/Beamer. But you write using Markdown, which is easier and less clumsy. As you write plain text, you are free to use git. modern look. Just a website, nothing more. But with cool, modern features.Long vs. wide format, and gather()
/2016/07/04/gather-long-to-wide-format/
Mon, 04 Jul 2016 00:00:00 +0000/2016/07/04/gather-long-to-wide-format/reading time: 10 min.
A quite common task in data analysis is to change a dataset from wide to long format.
For example, this is a dataset in wide format:
Is is called wide, as, well, it is wide – several columns side by side.
For example, assume, we have measured a number of predictors (here: predictor_1, predictor_2, predictor_3), and an outcome measure (here: outcome). In this case, each variable is dichotomous (either yes or no).Cross-tabulate multiple variables
/2016/07/03/cross-tabulate-multiple-variables/
Sun, 03 Jul 2016 00:00:00 +0000/2016/07/03/cross-tabulate-multiple-variables/reading time: 15-20 min.
Recently, I analyzed some data of a study where the efficacy of online psychotherapy was investigated. The investigator had assessed whether or not a participant suffered from some comorbidities (such as depression, anxiety, eating disorder…).
I wanted to know whether each of these (10 or so) comorbidities was associated with the outcome (treatment success, yes vs. no).
Of course, an easy solution would be to “half-manually” check the association, eg.Why have z-transformed values a mean of zero and a sd of 1?
/2016/07/02/z-value-intuition/
Sat, 02 Jul 2016 00:00:00 +0000/2016/07/02/z-value-intuition/z-transformation is an ubiquitous operation in data analysis. It is often quite practical.
Example: Assume Dr Zack scored 42 points on a test (say, IQ). Average score is 40 in the relevant population, and SD is 1, let’s say. So Zack’s score is 2 points above average. 2 points equals to SDs in this example. We can thus safely infer that Zack is about 2 SDs above average (leaving measurement precision and other issues at side).About
/about/
Sun, 20 Nov 2011 00:00:00 +0000/about/I blog about data science, particularly using R, and with an applied interest to social sciences.
As a non-virtual person, I work as a professor at FOM University of Applied Sciences.
Posts reflect mostly my current thinking; and posts are not immune to thought updates. With luck things get less wrong in the course of time. All opions are my own. Faults are my own. Posts are organized as note books, as the crow flies, which is, as my thinking went.
/1/01/01/
Mon, 01 Jan 0001 00:00:00 +0000/1/01/01/\— title: Eliminating a factor reduces variance author: ’’ date: ‘2018-12-10’ slug: eliminating-a-factor-reduces-variance draft: TRUE categories: - rstats tags: - tutorial - plotting —
A well known measure to reduce variability and increase power in experimental (and observational) research design is to eliminate a factor that may influence the outcome variable.
“Eliminating” a factor means, by and above, to hold it constant.
Consider the following example. Say, an experiment is performed with two groups, and the experimental groups shows higher values than the control group.
/privacy/
Mon, 01 Jan 0001 00:00:00 +0000/privacy/Datenschutzerklärung Diese Datenschutzerklärung klärt Sie über die Art, den Umfang und Zweck der Verarbeitung von personenbezogenen Daten (nachfolgend kurz „Daten“) innerhalb unseres Onlineangebotes und der mit ihm verbundenen Webseiten, Funktionen und Inhalte sowie externen Onlinepräsenzen, wie z.B. unser Social Media Profile auf (nachfolgend gemeinsam bezeichnet als „Onlineangebot“). Im Hinblick auf die verwendeten Begrifflichkeiten, wie z.B. „Verarbeitung“ oder „Verantwortlicher“ verweisen wir auf die Definitionen im Art. 4 der Datenschutzgrundverordnung (DSGVO).