Practical data cleansing in R

What is “data cleansing” about?

Data analysis, in practice, consists typically of some different steps which can be subsumed as “preparing data” and “model data” (not considering communication here):

(Inspired by this)

Often, the first major part — “prepare” — is the most time consuming. This can be lamented since many analysts prefer the cool modeling aspects (since I want to show my math!). In practice, one rather has to get his (her) hands dirt…

In this post, I want to put together some kind of checklist of frequent steps in data preparation. More precisely, I would like to detail some typical steps in “cleansing” your data. Such steps include:

  • identify missings
  • identify outliers
  • check for overall plausibility and errors (e.g, typos)
  • identify highly correlated variables
  • identify variables with (nearly) no variance
  • identify variables with strange names or values
  • check variable classes (eg. characters vs factors)
  • remove/transform some variables (maybe your model does not like categorial variables)
  • rename some variables or values (especially interesting if large number)
  • check some overall pattern (statistical/ numerical summaries)
  • center/scale variables

You can read the full post including source code here (Github). Here is an output file (html).

Example: Analyse missing values