Previous posts on this blog have emphasized that the need for reproducibility in economics is hindered by suboptimal tools, and DIME Analytics has proposed various toolkits to address some key constraints to reproducible applied economics research (for instance, on reproducible tables and impact evaluation data management and analysis). Today's blog introduces a new DIME Analytics Stata package to increase the efficiency and documentation of data work from the very early stages: iefieldkit.
Why? Data as a research product
The key emphasis here is on documenting the process of data generation hopefully not just to meet journal submission requirements but more importantly to allow others to use the data. Recent efforts from teams like BITSS and the AEA have begun emphasizing the importance of elements like data publication and, super importantly, citation. Citing data requires that these data be documented—groups such as OpenICPSR, the World Bank Microdata Catalog team, and the Harvard Dataverses have structured the archival process for research data. These developments bring heightened attention to the handling, cleaning, and quality of data beginning from its rawest form.
How? A new Stata package
As recently emphasized in the AEA reproducibility guidelines, the ideal situation for researchers producing primary data is to be able to document the full chain of events involved in generating these new data. However, all who have produced data know: this is a messy process, and it's hard to know how to start. This is what the iefieldkit package aims to provide: an efficient way to document the process from the very early stages.
So, how? The JEL codes from the companion Stata Journal article (ungated version here) provide a great table of contents: primary data collection, ODK, SurveyCTO, data cleaning, survey harmonization, duplicates, codebooks … The package provides a workflow for collecting and processing data while simultaneously documenting the process. This set of commands is intended to serve as a standard tool to enable what we call a self-documenting workflow, where the act of writing documentation also serves to provide the key instructions to the Stata code. (NB: the current GitHub and SSC releases have advanced a bit since journal acceptance, so be sure to read the updated help files and DIME Wiki pages!) At the time of this launch, the iefieldkit package includes the following commands:
- ietestform, to validate the Stata-compatibility of SurveyCTO forms;
- ieduplicates and iecompdup, to continuously resolve duplicate observations in incoming data;
- iecodebook (and its subcommands) to rapidly clean, label, append, and document datasets.
These core iefieldkit commands have a common theme — they try to dramatically reduce the amount of Stata code that has to be read or written in the process of their associated tasks. This is for four reasons:
- Data cleaning code is highly standardized, boring to write, and hard to read
- Many team members who have key contributions to make at this stage are not Stata coders
- Documenting data collection tasks is tedious and time-intensive, and therefore often skipped
- Standardizing inputs across tasks reduces mistakes
Using these commands will hopefully limit the production of hard-to-read do-files implementing years' worth of data cleaning, error correction, and data construction. As a result, mistakes often go unnoticed and it can be very difficult to trace back how and when decisions were made.
The works
Instead of having the user write repetitive code to deal individually with each survey question, variable, or observation, each of these commands interacts dynamically with one or more spreadsheet files. Those files provide the input instructions to the code itself, and then they also become the documentation for the way each element was handled. These instruction spreadsheets are carefully designed for browsing with the human eye and writing with natural language. This process has three core features: (1) lots of repetitive tasks can be written very quickly; (2) a large number of additional quality checks are built into the commands and run for each action; and, perhaps most importantly, (3) the spreadsheet is also an output in itself, as it becomes a piece of documentation that is readable with minimal technical training. This further means that any team member can quickly read and understand what tasks are being undertaken and provide feedback on the process — for example, during the course of survey design or duplicates resolution — without ever having to open Stata. Voilà. The Figures below provide a before/after iefieldkit comparison, while the cool video above provides further guidance.
Wait, you said spreadsheets are now okay??
To be very clear: we always strongly recommend against processing and analyzing data in spreadsheets, as that work is not reproducible. However, there is a substantial advantage to using spreadsheet for structuring large amounts of standardized inputs and ensuring they are applied correctly. The motivation behind this effort to increase accessibility is our belief that the highest level of data quality will never be achieved unless the data work process is accessible to the full project team, particularly so when the data is self-collected. So data work should not be treated as a one-person technical task, but rather as social process that involves a wide variety of team members with different skillsets. The main beneficiary of that is data quality and therefore the quality of research. These commands should help teams take a big step towards ensuring that data quality and documentation is complete – starting before a survey is even fielded or the first piece of analytical data construction is coded.
The iefieldkit package is under continuous development on GitHub and is available for download through SSC. DIME Analytics also provides extensive descriptions of the software on the DIME Wiki and crafted video guidance for using the commands.
Recap of useful links:
Join the Conversation