Differences according to importing CSV using different functions

1 Load packages

library(tidyverse)  # data wrangling
library(easystats)
library(digest)  # hashes

2 Motivation

Importing a CSV file can yield to - slightly - different results, according to which functions are used for importing the file. The question is whether the data itself is constant across different methods, which is a neccessary condition for reliable analysis, or at least the importing function must be known for a reproducible analysis, in case different data can result when different import functions are used.

In this post, we will examine the effect of importing data using different functions.

3 Data

We’ll use the penguins data set.

data_url <- "https://vincentarelbundock.github.io/Rdatasets/csv/palmerpenguins/penguins.csv"

4 Method 1: read.csv

read.csv is a function from Base R.

Let’s try it.

d1 <- read.csv(data_url)
head(d1)
#>   X species    island bill_length_mm bill_depth_mm flipper_length_mm
#> 1 1  Adelie Torgersen           39.1          18.7               181
#> 2 2  Adelie Torgersen           39.5          17.4               186
#> 3 3  Adelie Torgersen           40.3          18.0               195
#> 4 4  Adelie Torgersen             NA            NA                NA
#> 5 5  Adelie Torgersen           36.7          19.3               193
#> 6 6  Adelie Torgersen           39.3          20.6               190
#>   body_mass_g    sex year
#> 1        3750   male 2007
#> 2        3800 female 2007
#> 3        3250 female 2007
#> 4          NA   <NA> 2007
#> 5        3450 female 2007
#> 6        3650   male 2007

5 Method 2: read_csv

d2 <- read_csv(data_url)
head(d2)
#> # A tibble: 6 × 9
#>    ...1 species island    bill_length_mm bill_dept…¹ flipp…² body_…³ sex    year
#>   <dbl> <chr>   <chr>              <dbl>       <dbl>   <dbl>   <dbl> <chr> <dbl>
#> 1     1 Adelie  Torgersen           39.1        18.7     181    3750 male   2007
#> 2     2 Adelie  Torgersen           39.5        17.4     186    3800 fema…  2007
#> 3     3 Adelie  Torgersen           40.3        18       195    3250 fema…  2007
#> 4     4 Adelie  Torgersen           NA          NA        NA      NA <NA>   2007
#> 5     5 Adelie  Torgersen           36.7        19.3     193    3450 fema…  2007
#> 6     6 Adelie  Torgersen           39.3        20.6     190    3650 male   2007
#> # … with abbreviated variable names ¹​bill_depth_mm, ²​flipper_length_mm,
#> #   ³​body_mass_g

6 Method 3: data_read

d3 <- data_read(data_url)
head(d3)
#>   V1 species    island bill_length_mm bill_depth_mm flipper_length_mm
#> 1  1  Adelie Torgersen           39.1          18.7               181
#> 2  2  Adelie Torgersen           39.5          17.4               186
#> 3  3  Adelie Torgersen           40.3          18.0               195
#> 4  4  Adelie Torgersen             NA            NA                NA
#> 5  5  Adelie Torgersen           36.7          19.3               193
#> 6  6  Adelie Torgersen           39.3          20.6               190
#>   body_mass_g    sex year
#> 1        3750   male 2007
#> 2        3800 female 2007
#> 3        3250 female 2007
#> 4          NA   <NA> 2007
#> 5        3450 female 2007
#> 6        3650   male 2007

7 First glimpse

glimpse(d1)
#> Rows: 344
#> Columns: 9
#> $ X                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
glimpse(d2)
#> Rows: 344
#> Columns: 9
#> $ ...1              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
glimpse(d3)
#> Rows: 344
#> Columns: 9
#> $ V1                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Nothing that peeks into the eye.

8 Hashes

A hash is like a fingerprint of a digital object - it is (quasi) unique. Let’s compute the hashes of the data sets. Note that we should preclude the first column as it’s name is set differently by the function.

d1 <- d1 %>% select(-1)
d2 <- d2 %>% select(-1)
d3 <- d3 %>% select(-1)

To get the hash value of some objects, we can use the function digest().

d1_hash <- 
d1 %>% 
  digest()

d1_hash
#> [1] "1a3544902d7b1bc28121806bbe580883"
d2_hash <- 
d2 %>% 
  digest()

d2_hash
#> [1] "3e0caf37ed36f86d754459a75c4f98b3"
d3_hash <- 
d3 %>% 
  digest()

d3_hash
#> [1] "566b675fd32ac2705a875505f895469a"

9 Not exactly identical

As the hashes (fingerprints) differ, we can conclude that the objects are not exactly identical. However, the differences may stem from subtle variations such as atttributes or classes of the data frame.

Let’s focus on the data instead.

10 Data comparison

By formatting as a matrix, we get rid of possible overheads of dataframes, leaving the pure data.

d1_matrix <-
  d1 %>% 
  as.matrix()

d2_matrix <-
  d2 %>% 
  as.matrix()

d3_matrix <-
  d3 %>% 
  as.matrix()

Let’s checkt the attributes of the matrices:

d1_matrix %>% attributes()
#> $dim
#> [1] 344   8
#> 
#> $dimnames
#> $dimnames[[1]]
#> NULL
#> 
#> $dimnames[[2]]
#> [1] "species"           "island"            "bill_length_mm"   
#> [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
#> [7] "sex"               "year"
d2_matrix %>% attributes()
#> $dim
#> [1] 344   8
#> 
#> $dimnames
#> $dimnames[[1]]
#> NULL
#> 
#> $dimnames[[2]]
#> [1] "species"           "island"            "bill_length_mm"   
#> [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
#> [7] "sex"               "year"
d3_matrix %>% attributes()
#> $dim
#> [1] 344   8
#> 
#> $dimnames
#> $dimnames[[1]]
#> NULL
#> 
#> $dimnames[[2]]
#> [1] "species"           "island"            "bill_length_mm"   
#> [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
#> [7] "sex"               "year"

Identical.

Now let’s check the hashes of the matrices.

d1_matrix_hash <- d1_matrix %>% digest()
d2_matrix_hash <- d2_matrix %>% digest()
d3_matrix_hash <- d3_matrix %>% digest()
d1_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"
d2_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"
d3_matrix_hash
#> [1] "08adb3f15d6ca8edbb2978795a2d7eba"

Identical.

11 Conclusion

We can conclude that the data is identical across the methods (leaving the first column aside).

Note that no random numbers where involved in this analysis.

12 Reproducibility

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Berlin
#>  date     2023-01-19
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package       * version  date (UTC) lib source
#>  assertthat      0.2.1    2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1    2021-12-13 [1] CRAN (R 4.2.0)
#>  bayestestR    * 0.13.0   2022-09-18 [1] CRAN (R 4.2.0)
#>  bit             4.0.5    2022-11-15 [1] CRAN (R 4.2.0)
#>  bit64           4.0.5    2020-08-30 [1] CRAN (R 4.2.0)
#>  blogdown        1.16     2022-12-13 [1] CRAN (R 4.2.0)
#>  bookdown        0.31     2022-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.2    2022-12-15 [1] CRAN (R 4.2.0)
#>  bslib           0.4.2    2022-12-16 [1] CRAN (R 4.2.0)
#>  cachem          1.0.6    2021-08-19 [1] CRAN (R 4.2.0)
#>  callr           3.7.3    2022-11-02 [1] CRAN (R 4.2.0)
#>  cellranger      1.1.0    2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.6.0    2023-01-09 [1] CRAN (R 4.2.0)
#>  coda            0.19-4   2020-09-30 [1] CRAN (R 4.2.0)
#>  codetools       0.2-18   2020-11-04 [2] CRAN (R 4.2.1)
#>  colorout      * 1.2-2    2022-06-13 [1] local
#>  colorspace      2.0-3    2022-02-21 [1] CRAN (R 4.2.0)
#>  correlation   * 0.8.3    2022-10-09 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2    2022-09-29 [1] CRAN (R 4.2.1)
#>  curl            4.3.3    2022-10-06 [1] CRAN (R 4.2.0)
#>  data.table      1.14.6   2022-11-16 [1] CRAN (R 4.2.0)
#>  datawizard    * 0.6.5    2022-12-14 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3    2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.2.1    2022-06-27 [1] CRAN (R 4.2.0)
#>  devtools        2.4.5    2022-10-11 [1] CRAN (R 4.2.1)
#>  digest        * 0.6.31   2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.0.10   2022-09-01 [1] CRAN (R 4.2.0)
#>  easystats     * 0.6.0    2022-11-29 [1] CRAN (R 4.2.1)
#>  effectsize    * 0.8.2    2022-10-31 [1] CRAN (R 4.2.0)
#>  ellipsis        0.3.2    2021-04-29 [1] CRAN (R 4.2.0)
#>  emmeans         1.8.3    2022-12-06 [1] CRAN (R 4.2.0)
#>  estimability    1.4.1    2022-08-05 [1] CRAN (R 4.2.0)
#>  evaluate        0.19     2022-12-13 [1] CRAN (R 4.2.0)
#>  fansi           1.0.3    2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0    2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 0.5.2    2022-08-19 [1] CRAN (R 4.2.0)
#>  fs              1.5.2    2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1    2022-09-08 [1] CRAN (R 4.2.0)
#>  generics        0.1.3    2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.4.0    2022-11-04 [1] CRAN (R 4.2.0)
#>  glue            1.6.2    2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0    2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1    2022-08-13 [1] CRAN (R 4.2.0)
#>  gtable          0.3.1    2022-09-01 [1] CRAN (R 4.2.0)
#>  haven           2.5.1    2022-08-22 [1] CRAN (R 4.2.0)
#>  hms             1.1.2    2022-08-19 [1] CRAN (R 4.2.0)
#>  htmltools       0.5.4    2022-12-07 [1] CRAN (R 4.2.0)
#>  htmlwidgets     1.6.1    2023-01-07 [1] CRAN (R 4.2.0)
#>  httpuv          1.6.8    2023-01-12 [1] CRAN (R 4.2.0)
#>  httr            1.4.4    2022-08-17 [1] CRAN (R 4.2.0)
#>  insight       * 0.18.8   2022-11-24 [1] CRAN (R 4.2.0)
#>  jquerylib       0.1.4    2021-04-26 [1] CRAN (R 4.2.0)
#>  jsonlite        1.8.4    2022-12-06 [1] CRAN (R 4.2.0)
#>  knitr           1.41     2022-11-18 [1] CRAN (R 4.2.0)
#>  later           1.3.0    2021-08-18 [1] CRAN (R 4.2.0)
#>  lattice         0.20-45  2021-09-22 [2] CRAN (R 4.2.1)
#>  lifecycle       1.0.3    2022-10-07 [1] CRAN (R 4.2.0)
#>  lubridate       1.9.0    2022-11-06 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3    2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS            7.3-58.1 2022-08-03 [1] CRAN (R 4.2.0)
#>  Matrix          1.5-3    2022-11-11 [1] CRAN (R 4.2.0)
#>  memoise         2.0.1    2021-11-26 [1] CRAN (R 4.2.0)
#>  mime            0.12     2021-09-28 [1] CRAN (R 4.2.0)
#>  miniUI          0.1.1.1  2018-05-18 [1] CRAN (R 4.2.0)
#>  modelbased    * 0.8.6    2023-01-13 [1] CRAN (R 4.2.1)
#>  modelr          0.1.10   2022-11-11 [1] CRAN (R 4.2.0)
#>  multcomp        1.4-20   2022-08-07 [1] CRAN (R 4.2.0)
#>  munsell         0.5.0    2018-06-12 [1] CRAN (R 4.2.0)
#>  mvtnorm         1.1-3    2021-10-08 [1] CRAN (R 4.2.0)
#>  parameters    * 0.20.1   2023-01-11 [1] CRAN (R 4.2.0)
#>  performance   * 0.10.2   2023-01-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1    2022-08-19 [1] CRAN (R 4.2.0)
#>  pkgbuild        1.4.0    2022-11-27 [1] CRAN (R 4.2.0)
#>  pkgconfig       2.0.3    2019-09-22 [1] CRAN (R 4.2.0)
#>  pkgload         1.3.2    2022-11-16 [1] CRAN (R 4.2.0)
#>  prettyunits     1.1.1    2020-01-24 [1] CRAN (R 4.2.0)
#>  processx        3.8.0    2022-10-26 [1] CRAN (R 4.2.0)
#>  profvis         0.3.7    2020-11-02 [1] CRAN (R 4.2.0)
#>  promises        1.2.0.1  2021-02-11 [1] CRAN (R 4.2.0)
#>  ps              1.7.2    2022-10-26 [1] CRAN (R 4.2.0)
#>  purrr         * 1.0.1    2023-01-10 [1] CRAN (R 4.2.0)
#>  R6              2.5.1    2021-08-19 [1] CRAN (R 4.2.0)
#>  Rcpp            1.0.9    2022-07-08 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3    2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1    2022-08-17 [1] CRAN (R 4.2.0)
#>  remotes         2.4.2    2021-11-30 [1] CRAN (R 4.2.0)
#>  report        * 0.5.5    2022-08-22 [1] CRAN (R 4.2.0)
#>  reprex          2.0.2    2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang           1.0.6    2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.19     2022-12-15 [1] CRAN (R 4.2.0)
#>  rstudioapi      0.14     2022-08-22 [1] CRAN (R 4.2.0)
#>  rvest           1.0.3    2022-08-19 [1] CRAN (R 4.2.0)
#>  sandwich        3.0-2    2022-06-15 [1] CRAN (R 4.2.0)
#>  sass            0.4.4    2022-11-24 [1] CRAN (R 4.2.0)
#>  scales          1.2.1    2022-08-20 [1] CRAN (R 4.2.0)
#>  see           * 0.7.4    2022-11-26 [1] CRAN (R 4.2.0)
#>  sessioninfo     1.2.2    2021-12-06 [1] CRAN (R 4.2.0)
#>  shiny           1.7.4    2022-12-15 [1] CRAN (R 4.2.0)
#>  stringi         1.7.12   2023-01-11 [1] CRAN (R 4.2.0)
#>  stringr       * 1.5.0    2022-12-02 [1] CRAN (R 4.2.0)
#>  survival        3.5-0    2023-01-09 [1] CRAN (R 4.2.0)
#>  TH.data         1.1-1    2022-04-26 [1] CRAN (R 4.2.0)
#>  tibble        * 3.1.8    2022-07-22 [1] CRAN (R 4.2.0)
#>  tidyr         * 1.2.1    2022-09-08 [1] CRAN (R 4.2.0)
#>  tidyselect      1.2.0    2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse     * 1.3.2    2022-07-18 [1] CRAN (R 4.2.0)
#>  timechange      0.2.0    2023-01-11 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0    2022-03-28 [1] CRAN (R 4.2.0)
#>  urlchecker      1.0.1    2021-11-30 [1] CRAN (R 4.2.0)
#>  usethis         2.1.6    2022-05-25 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2    2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.1    2022-11-16 [1] CRAN (R 4.2.0)
#>  vroom           1.6.0    2022-09-30 [1] CRAN (R 4.2.0)
#>  withr           2.5.0    2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.36     2022-12-21 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3    2021-11-30 [1] CRAN (R 4.2.0)
#>  xtable          1.8-4    2019-04-21 [1] CRAN (R 4.2.0)
#>  yaml            2.3.6    2022-10-18 [1] CRAN (R 4.2.0)
#>  zoo             1.8-11   2022-09-17 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/sebastiansaueruser/Rlibs
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────