1 Load packages
library(tidyverse) # data wrangling
library(easystats)
library(digest) # hashes
2 Motivation
Importing a CSV file can yield to - slightly - different results, according to which functions are used for importing the file. The question is whether the data itself is constant across different methods, which is a neccessary condition for reliable analysis, or at least the importing function must be known for a reproducible analysis, in case different data can result when different import functions are used.
In this post, we will examine the effect of importing data using different functions.
3 Data
We’ll use the penguins
data set.
data_url <- "https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv"
4 Method 1: read.csv
read.csv
is a function from Base R.
Let’s try it.
d1 <- read.csv(data_url)
head(d1)
#> X species island bill_length_mm bill_depth_mm flipper_length_mm
#> 1 1 Adelie Torgersen 39.1 18.7 181
#> 2 2 Adelie Torgersen 39.5 17.4 186
#> 3 3 Adelie Torgersen 40.3 18.0 195
#> 4 4 Adelie Torgersen NA NA NA
#> 5 5 Adelie Torgersen 36.7 19.3 193
#> 6 6 Adelie Torgersen 39.3 20.6 190
#> body_mass_g sex year
#> 1 3750 male 2007
#> 2 3800 female 2007
#> 3 3250 female 2007
#> 4 NA <NA> 2007
#> 5 3450 female 2007
#> 6 3650 male 2007
5 Method 2: read_csv
d2 <- read_csv(data_url)
head(d2)
#> # A tibble: 6 × 9
#> ...1 species island bill_length_mm bill_dept…¹ flipp…² body_…³ sex year
#> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
#> 2 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
#> 3 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007
#> 4 4 Adelie Torgersen NA NA NA NA <NA> 2007
#> 5 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
#> 6 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
#> # … with abbreviated variable names ¹bill_depth_mm, ²flipper_length_mm,
#> # ³body_mass_g
6 Method 3: data_read
d3 <- data_read(data_url)
head(d3)
#> V1 species island bill_length_mm bill_depth_mm flipper_length_mm
#> 1 1 Adelie Torgersen 39.1 18.7 181
#> 2 2 Adelie Torgersen 39.5 17.4 186
#> 3 3 Adelie Torgersen 40.3 18.0 195
#> 4 4 Adelie Torgersen NA NA NA
#> 5 5 Adelie Torgersen 36.7 19.3 193
#> 6 6 Adelie Torgersen 39.3 20.6 190
#> body_mass_g sex year
#> 1 3750 male 2007
#> 2 3800 female 2007
#> 3 3250 female 2007
#> 4 NA <NA> 2007
#> 5 3450 female 2007
#> 6 3650 male 2007
7 First glimpse
glimpse(d1)
#> Rows: 344
#> Columns: 9
#> $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
glimpse(d2)
#> Rows: 344
#> Columns: 9
#> $ ...1 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
glimpse(d3)
#> Rows: 344
#> Columns: 9
#> $ V1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Nothing that peeks into the eye.
8 Hashes
A hash is like a fingerprint of a digital object - it is (quasi) unique. Let’s compute the hashes of the data sets. Note that we should preclude the first column as it’s name is set differently by the function.
d1 <- d1 %>% select(-1)
d2 <- d2 %>% select(-1)
d3 <- d3 %>% select(-1)
To get the hash value of some objects, we can use the function digest()
.
d1_hash <-
d1 %>%
digest()
d1_hash
#> [1] "1a3544902d7b1bc28121806bbe580883"
d2_hash <-
d2 %>%
digest()
d2_hash
#> [1] "3e0caf37ed36f86d754459a75c4f98b3"
d3_hash <-
d3 %>%
digest()
d3_hash
#> [1] "566b675fd32ac2705a875505f895469a"
9 Not exactly identical
As the hashes (fingerprints) differ, we can conclude that the objects are not exactly identical. However, the differences may stem from subtle variations such as atttributes or classes of the data frame.
Let’s focus on the data instead.
10 Data comparison
By formatting as a matrix
, we get rid of possible overheads of dataframes, leaving the pure data.
d1_matrix <-
d1 %>%
as.matrix()
d2_matrix <-
d2 %>%
as.matrix()
d3_matrix <-
d3 %>%
as.matrix()
Let’s checkt the attributes of the matrices:
d1_matrix %>% attributes()
#> $dim
#> [1] 344 8
#>
#> $dimnames
#> $dimnames[[1]]
#> NULL
#>
#> $dimnames[[2]]
#> [1] "species" "island" "bill_length_mm"
#> [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
#> [7] "sex" "year"
d2_matrix %>% attributes()
#> $dim
#> [1] 344 8
#>
#> $dimnames
#> $dimnames[[1]]
#> NULL
#>
#> $dimnames[[2]]
#> [1] "species" "island" "bill_length_mm"
#> [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
#> [7] "sex" "year"
d3_matrix %>% attributes()
#> $dim
#> [1] 344 8
#>
#> $dimnames
#> $dimnames[[1]]
#> NULL
#>
#> $dimnames[[2]]
#> [1] "species" "island" "bill_length_mm"
#> [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
#> [7] "sex" "year"
Identical.
Now let’s check the hashes of the matrices.
d1_matrix_hash <- d1_matrix %>% digest()
d2_matrix_hash <- d2_matrix %>% digest()
d3_matrix_hash <- d3_matrix %>% digest()
d1_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"
d2_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"
d3_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"
Identical.
11 Conclusion
We can conclude that the data is identical across the methods (leaving the first column aside).
Note that no random numbers where involved in this analysis.
12 Reproducibility
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Big Sur ... 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Berlin
#> date 2023-01-19
#> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
#> bayestestR * 0.13.0 2022-09-18 [1] CRAN (R 4.2.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#> blogdown 1.16 2022-12-13 [1] CRAN (R 4.2.0)
#> bookdown 0.31 2022-12-13 [1] CRAN (R 4.2.0)
#> broom 1.0.2 2022-12-15 [1] CRAN (R 4.2.0)
#> bslib 0.4.2 2022-12-16 [1] CRAN (R 4.2.0)
#> cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
#> callr 3.7.3 2022-11-02 [1] CRAN (R 4.2.0)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
#> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0)
#> coda 0.19-4 2020-09-30 [1] CRAN (R 4.2.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.1)
#> colorout * 1.2-2 2022-06-13 [1] local
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
#> correlation * 0.8.3 2022-10-09 [1] CRAN (R 4.2.0)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.1)
#> curl 4.3.3 2022-10-06 [1] CRAN (R 4.2.0)
#> data.table 1.14.6 2022-11-16 [1] CRAN (R 4.2.0)
#> datawizard * 0.6.5 2022-12-14 [1] CRAN (R 4.2.0)
#> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0)
#> dbplyr 2.2.1 2022-06-27 [1] CRAN (R 4.2.0)
#> devtools 2.4.5 2022-10-11 [1] CRAN (R 4.2.1)
#> digest * 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
#> dplyr * 1.0.10 2022-09-01 [1] CRAN (R 4.2.0)
#> easystats * 0.6.0 2022-11-29 [1] CRAN (R 4.2.1)
#> effectsize * 0.8.2 2022-10-31 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> emmeans 1.8.3 2022-12-06 [1] CRAN (R 4.2.0)
#> estimability 1.4.1 2022-08-05 [1] CRAN (R 4.2.0)
#> evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0)
#> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0)
#> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0)
#> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0)
#> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0)
#> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0)
#> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0)
#> htmlwidgets 1.6.1 2023-01-07 [1] CRAN (R 4.2.0)
#> httpuv 1.6.8 2023-01-12 [1] CRAN (R 4.2.0)
#> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0)
#> insight * 0.18.8 2022-11-24 [1] CRAN (R 4.2.0)
#> jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
#> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0)
#> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
#> later 1.3.0 2021-08-18 [1] CRAN (R 4.2.0)
#> lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> lubridate 1.9.0 2022-11-06 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> MASS 7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
#> Matrix 1.5-3 2022-11-11 [1] CRAN (R 4.2.0)
#> memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.2.0)
#> miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
#> modelbased * 0.8.6 2023-01-13 [1] CRAN (R 4.2.1)
#> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0)
#> multcomp 1.4-20 2022-08-07 [1] CRAN (R 4.2.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
#> mvtnorm 1.1-3 2021-10-08 [1] CRAN (R 4.2.0)
#> parameters * 0.20.1 2023-01-11 [1] CRAN (R 4.2.0)
#> performance * 0.10.2 2023-01-12 [1] CRAN (R 4.2.0)
#> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0)
#> pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.2.0)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
#> processx 3.8.0 2022-10-26 [1] CRAN (R 4.2.0)
#> profvis 0.3.7 2020-11-02 [1] CRAN (R 4.2.0)
#> promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
#> ps 1.7.2 2022-10-26 [1] CRAN (R 4.2.0)
#> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0)
#> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0)
#> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0)
#> remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.0)
#> report * 0.5.5 2022-08-22 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0)
#> rmarkdown 2.19 2022-12-15 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0)
#> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0)
#> sandwich 3.0-2 2022-06-15 [1] CRAN (R 4.2.0)
#> sass 0.4.4 2022-11-24 [1] CRAN (R 4.2.0)
#> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0)
#> see * 0.7.4 2022-11-26 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> shiny 1.7.4 2022-12-15 [1] CRAN (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
#> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> survival 3.5-0 2023-01-09 [1] CRAN (R 4.2.0)
#> TH.data 1.1-1 2022-04-26 [1] CRAN (R 4.2.0)
#> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0)
#> tidyr * 1.2.1 2022-09-08 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0)
#> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
#> urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.2.0)
#> usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.5.1 2022-11-16 [1] CRAN (R 4.2.0)
#> vroom 1.6.0 2022-09-30 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.0)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
#> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.2.0)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
#> zoo 1.8-11 2022-09-17 [1] CRAN (R 4.2.0)
#>
#> [1] /Users/sebastiansaueruser/Rlibs
#> [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────