# Data Corruption and Mucked up Stratification –Problems and (Potential) Solutions in Primary Research

## This page in:

So you’ve spent a couple years on a project and invested intellectual thought, come up with an interesting and possibly novel research question, designed and tested appropriate surveys, engaged local partners for the intervention, and gone ahead and implemented the study. While several challenges and implementation problems arise during the study, these can be dealt with on the spot, often through hours of conference calls. But the really painful stuff happens when you find out ex-post that some mistake happened which cannot be corrected anymore. Here are two such examples of fieldwork errors and what we did to correct for them the best we could.

**(a) ****Lost Data Due to Software Corruption**

Data collection in the field has been made infinitely easier and efficient with the use of PDAs to conduct surveys. Doing so completely eliminates the need for data entry and minimizes errors transcribing hand-written surveys into electronic form. In addition, a non-trivial chunk of time is saved as data is pretty much available as soon as it is entered and organized by the field team.

However, playing with software and PDAs comes with some risk. Even with adequate data storage and back-up protocols, software errors can simply wipe out chunks of data. Unfortunately, we faced this exact problem in a recent study of households in South Africa. Essentially the survey firm lost a significant batch of data from the endline that was collected on PDAs. No backup is available and the data is gone -- corrupted, irrecoverable. The problem is that since data was collected and stored in batches and randomization (and data collection) was at the geographic group level, the batch lost consists entirely of individuals in the treatment group. Hence, any subsequent attrition analysis will show bias.

But this is not really attrition; it is data corruption.

So how did we deal with this? We came up with three options:

(1) Ignore attrition in the full sample and use all available endline data in the analysis. At the same time, be upfront about the data corruption and explain in detail in the paper the reasons for data loss. Also, since randomization was at the group level with matched pairs, show attrition analysis excluding this “problem” pair for the rest of the sample.

(2) Figure out exactly how many data points or people we "lost" and then randomly delete observations from the paired control group to balance out the sample. Conduct analysis on this revised sample.

(3) The extreme form of (2) where we simply drop everyone in the grouped pair, in both treatment and control in all endline analysis.

In the end, we discussed with colleagues and plan on option (1). Option (3) seems to be an excessive penalty on statistical power, though we plan on conducting robustness analysis where we do exactly this (as well as (2)). However, since this is attrition at random (we know the reason for the attrition), we opted for (1). Our plan is to be very open about it in the paper and explain the reasons for data loss. We will also produce and discuss a table to show that the baseline sample is balanced for the sample for whom we do have endline data to account for concerns that data collection waves (and hence storage chunks) were correlated with respondent characteristics. Perfect? Perhaps not, but being intellectually honest and open can never be a misstep.

**(b) ****Mucked up Stratification**

Another example of an ex-post unsolvable problem comes from a different project in South Africa. Here, the research design called for initial stratification prior to randomization into treatment and control groups at the individual level.

We first did a listing exercise of 3,000 people and the idea was to randomly pick an analysis sample of 1,000 and stratify by x,y,z and then randomize within strata. Unfortunately, that did not happen because the Stata code had a small error in it which was not caught in time.

This is what the code did: from the 3,000, the sample was first stratified on x,y,z correctly. Within strata, treatment was assigned correctly. Next, each observation (NOT within strata but all 3,000) were ranked based on a random number generator. The first 1,000 were then picked to be in our study sample.

This is not stratified randomization, rather just simply (and roughly) unconditional randomization. That is, it is equivalent to randomly picking 1,000 people from the 3,000, and then assigning them treatment and control groups randomly.

Luckily, we are balanced on all the variables we wanted to stratify on and importantly we are also more-or-less balanced on the number of people in treatment and control (which was not guaranteed by the method that was employed).

So to summarize, our sample is effectively not stratified although it happens to be balanced on the variables we wanted to stratify on (a benefit of randomization anyway). So we can conduct balanced heterogeneous analysis on baseline margins as planned. All is good but the question is how do we report this in the paper? Or do we at all?

One clear option is to simply ignore stratification and not mention it in the write-up. Sure, but in a world where study designs are laid out in advance, one would have to explain deviations from it. And advanced reporting of research designs is where our field is likely heading. In this particular case, everything worked out and our sample is balanced, but we will still mention the original experiment design in the paper and the reason for deviating from it.

The full paper can be found here.

## Join the Conversation

Hi Bilal,

Can you elaborate a bit on what the advantages of option 2 in the first scenario are? I was unclear on why balance would be important in that situation. Is this because your original sample was balanced and you wanted to avoid the bias that comes with the typical unbalanced cluster RCT estimator?

Also, on a related note, if you happen to be looking for subjects for future blog posts, I would be very interested in one on strategies for dealing with missing data in general. This is an issue that has come up in an RCT I am working on and we are struggling with how to deal with it. The strategies that seem to be used in clinical trials don't seem to be used too often in RCTs of social programs and it is not really clear to us why.

Thanks for the comment, Doug. For the first scenario, luckily our baseline outcomes are balanced between treatment and control if we exclude the lost treatment individuals from this analysis. Hence, balance on baseline outcomes is not the issue, rather the only reason to consider either option 2 or 3 is to deal with mechanical bias in attrition analysis. Specifically, if we regress "baseline individual present at endline" on a treatment dummy, ceteris paribus, we will see a negative coefficient because the lost endline data consists entirely of individuals in the treatment group. I think a bigger problem would have been if we lost balance on baseline outcomes as well, and perhaps a future blogpost can elaborate on strategies to deal with that.