Monday, April 21, 2014

Organizing data-cleaning scripts


About two years ago it finally dawned on me that having a single gigantic R file for a project wasn't all that practical.  Since then, I've been trying out a few systems for breaking the larger project into smaller scripts.  Today, I came across this introduction to data cleaning in R, which nicely divides the project into several steps (in the figure above).  The authors suggest (at a minimum) saving the data at each of these stages, which seems totally reasonable.

Roughly, the stages of the data cleaning process can be broken down into: 1) Raw data: this is the format of the original data source -- it's possible that some sort of conversion is necessary before the data can even be read into R, or that once the data are are loaded, the variable types or column names have problems.  2) Technically correct data is the result of the most basic cleaning process -- at the very least, your data should be the "shape" you expect (the right number of rows and columns if you're expecting rectangular data), numbers should look like numbers rather than strings, etc.  Technically correct data, despite the proper formatting, may have erroneous values -- these may range from 'outlawed' values (like negative durations), to suspicious values (e.g. an individual's height entered as 9').  3) Consistent data is ready for analysis.

Clearly this division won't be sufficient for all file-organization needs, but it seems like a nice thing to keep in the back of the mind..


No comments:

Post a Comment