Example for Meng's 2018 article on big data bias

1 Load packages

library(tidyverse)  # data wrangling

2 Motivation

My colleague, Karsten Lübke, first grade statistician, pointed me out to a paper …

In 2018, the statistican Meng wrote a paper about biases in big data see here. In a nutshell, he argues that non-random samples will be worse when data is larger. Yes, you heard correctly: Big data is bad data, if not randomly drawn. That’s bold a claim.

3 Computing the effective sample size in the 2016’ US federal elections

There seem to have been approx. 200 Mio. voters in the 2016 US elections, according to this source; let’s take that value for \(N\).

Let’s follow Meng and take 2 Mio as the sample size \(n\) of a survey tapping into the question “who you gonna vote for?”.

Let’s further assume that Trump voters are a bit … shyer than Clinton’s voters. Say, a given Trump voter is (on average) 0.01% more reluctant to disclose her opinion (that she prefers Trump over Clinton); call this value \(d\) (delta). Note that that is only a very tiny correlation. Very, very tiny.

n <- 2e06
N <- 2e08
delta <- .001
n_eff <- 4 * (n/N)^2 * (1/delta)^2
n_eff
#> [1] 400

We would end up at a sample size of \(n=400\)! That’s flippingly small.

4 Conclusion

I’ve not yet digested Meng’s paper in detail, but it appears well-founded. Given that the conclusions are solid, it means that “big data is bad data”, in the most cases. The problem is not big data per se but rather the fact that it is not random data. All else beside random data is … not really garbish, but of much, much lower quality (and hence information) compared to a real random sample. Of course this problem applies to all other kind of observational research as well.

5 Further reading

See this post for an easy to read introduction.

6 Reproducibility

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Europe/Berlin               
#>  date     2020-12-09                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
#>  backports     1.2.0   2020-11-02 [1] CRAN (R 4.0.2)
#>  blogdown      0.21    2020-10-11 [1] CRAN (R 4.0.2)
#>  bookdown      0.21    2020-10-13 [1] CRAN (R 4.0.2)
#>  broom         0.7.2   2020-10-20 [1] CRAN (R 4.0.2)
#>  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.2)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.0.0)
#>  cli           2.2.0   2020-11-20 [1] CRAN (R 4.0.2)
#>  codetools     0.2-16  2018-12-24 [2] CRAN (R 4.0.2)
#>  colorspace    2.0-0   2020-11-11 [1] CRAN (R 4.0.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.0)
#>  DBI           1.1.0   2019-12-15 [1] CRAN (R 4.0.0)
#>  dbplyr        2.0.0   2020-11-03 [1] CRAN (R 4.0.2)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.2)
#>  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.0)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
#>  forcats     * 0.5.0   2020-03-01 [1] CRAN (R 4.0.0)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.2)
#>  ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.0)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
#>  haven         2.3.1   2020-06-01 [1] CRAN (R 4.0.0)
#>  hms           0.5.3   2020-01-08 [1] CRAN (R 4.0.0)
#>  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.0)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.0.2)
#>  jsonlite      1.7.1   2020-09-07 [1] CRAN (R 4.0.2)
#>  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.0)
#>  lubridate     1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
#>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.2)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.0)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.0.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
#>  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.2)
#>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
#>  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.2)
#>  ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.2)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
#>  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.2)
#>  Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
#>  readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.2)
#>  readxl        1.3.1   2019-03-13 [1] CRAN (R 4.0.0)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  reprex        0.3.0   2019-05-16 [1] CRAN (R 4.0.0)
#>  rlang         0.4.9   2020-11-26 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.5     2020-10-21 [1] CRAN (R 4.0.2)
#>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.2)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.2)
#>  rvest         0.3.6   2020-07-25 [1] CRAN (R 4.0.2)
#>  scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.0)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
#>  testthat      3.0.0   2020-10-31 [1] CRAN (R 4.0.2)
#>  tibble      * 3.0.4   2020-10-12 [1] CRAN (R 4.0.2)
#>  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
#>  tidyverse   * 1.3.0   2019-11-21 [1] CRAN (R 4.0.0)
#>  usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.2)
#>  vctrs         0.3.5   2020-11-17 [1] CRAN (R 4.0.2)
#>  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
#>  xfun          0.19    2020-10-30 [1] CRAN (R 4.0.2)
#>  xml2          1.3.2   2020-04-23 [1] CRAN (R 4.0.0)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.0)
#> 
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library