An important, and stressful, part of the job when conducting studies in the field is managing the number of things that do not go according to plan. Markus, in his series of field notes, has written about these (see, for example, here  and here ) roller coaster rides we call impact evaluations. I have a post waiting to discuss the burnout that people involved in these studies are increasingly feeling these days, but today I want to talk about what went wrong in one of my projects recently and how we’re planning to deal with it.
One of my impact evaluation (IE) studies is a multi-site affair, in which a sample of 12 children in 50 schools in each of four districts has been drawn after listing all the schools in these districts. The intervention has four treatment arms, including the control group, so we are practically conducting the experiment separately four districts. The sample size is small enough (to detect meaningful impacts on outcomes we’re interested in) that we need all the power we can get. Blocking the randomization into small clusters of four using a couple of baseline characteristics that are predictive of these outcomes is one way of reducing the variance and increasing the statistical power. Using controls in the analysis ex-post is another.
So, after baseline data collection in each district, we have been cleaning the data for two baseline characteristics quickly, dividing the schools into blocks of four according to these data, and then providing these data to our counterparts in the government to conduct the public lottery with the schools. The names of schools assigned to each block is written on an envelope, and the envelope is filled with four dots with different colors, and the headmasters of each school come and pick a dot from their respective envelopes and hold it up for everyone to see, which is then recorded. The school representatives with the same color dot, i.e. assigned to the same treatment, are invited to a separate room and explained their treatment status.
In the first three districts, the script was followed to the dot (no pun intended). Our counterparts were happy; the invited school representatives liked the transparency of the process. For us, the process was also great: this is my first time doing “block randomization” in one of my field projects and I was enjoying seeing mean baseline characteristics across treatment groups that are almost identical to the second decimal point. Our randomization balance table for the first impact paper was already practically finished.
Well, then came the fourth district. Due to a miscommunication between our field staff and our counterparts (really nobody’s fault – as they say “stuff happens.”), the blocking information was not relayed to those conducting the lottery even though they thought that the list they were given was already blocked. So, they prepared the envelopes as usual, going down the list in fours, and conducted the lottery as usual. When we heard that the lottery had already happened, we knew something had gone wrong (as our research assistants had not yet finished preparing the blocking data) and were praying that we had not gotten unlucky (the point of blocking is that you are taking chance out of the equation with respect to balance between treatment groups). Alas, it was not to be: the value of one of the baseline variables was almost 50% lower in one treatment group than the three others. We had simply gotten unlucky (the chance of this outcome was less than 10%, by my calculations)…
Having spent a handsome sum on baseline data collection and having barely powered the study to detect meaningful effects, I struggled to keep my cool in front of my laptop and immediately went into crises management mode: what could be done? The worst case scenario was that we had lost approximately 50 schools, or a quarter of our sample to the IE study.
But, this was when the diligence of our colleagues who conducted the lottery came in handy. Not only had they followed the script to a “t” by randomly assigning the schools in this final study district into the four treatment arms, but they had also done it by first assigning the schools into blocks of four. If I could find the envelopes/blocks that had caused the imbalance, I could exclude them and check balance again.
One way of thinking about blocking is that you are minimizing the variance within each envelope/bin/block. Looking at the data that our field team had put together (but had not been able to communicate to the folks conducting the public lottery), I calculated the average standard deviation of the variable that was imbalanced to be less than 0.3 across the 12 envelopes. In the envelopes that were used for the real lottery, the same figure was more than 0.5, with one envelope having a standard deviation above 1, and a few others around 0.7. In these envelopes with high variance, all the low value schools had drawn the yellow dot by chance. After some examination, I concluded that if I excluded four envelopes from the final district, the baseline characteristics were balanced within this district and in the study sample as a whole. This way, we’d lose 16 schools (4x4), or four schools per treatment arm, rather than 47. I felt a sigh of relief.
Having talked to a couple of colleagues and thinking about it a fair bit more, my sense is that this is kosher going forward. From an internal validity standpoint, all the remaining 31 schools were divided into envelopes and randomly assigned to various treatment groups or control. I can certainly use the estimates from those schools as causal. External validity is more problematic in this district, as I will be excluding 16 schools that happened to have high variance within their blocks. Our research assistants are trying to figure out if there are any common characteristics to these 16 schools (for example, the order in which they were listed could have been the order in which the data were collected) so that we can make a judgment on whether we’re excluding a special group of schools out of our eligible population or not. But, this is a second order problem, considering that these schools form less than 10% of the total study sample. Randomization inference to establish confidence intervals is probably also out for this district, but we can live with that. Finally, we can always include these schools in our analysis (with appropriate baseline controls) to see if their inclusion makes a difference to the findings.
Any thoughts our readers may have on this would be greatly appreciated…