The New York Times recently had a piece on the retraction and re-issuance of a study in Spain based on a randomized trial of the Mediterranean Diet’s effect on heart disease. The original study was meant to be an individualized random assignment of 7,447 people aged 55 to 80 to one of three different diets – a control diet (advice to just reduce fat content), or two variants of the Mediterranean Diet (in which they were given free olive oil or free nuts). The study was originally published in the New England Journal of Medicine (NEJM) in 2013. The authors then appear to have been surprised to find their study on a list of suspicious trials. There are several parts to this story I thought would be of interest for doing impact evaluations in development, which I discuss below.
Why did the study appear on a list of suspicious trials?
The paper which came up with this list appears to have looked at the standard Table 1 which compares treatment and control means, and then looked at the p-values for testing equality of means to decide on whether it was very unlikely that the trial was properly randomized – basically looking for trials in which means were too similar (too many p-values close to 1) or too dissimilar (too many p-values close to 0). They identified 11 studies from the NEJM which met these criteria. The editors report that in five of the cases it was because the tables had mislabeled standard errors as standard deviations, in five of the cases it was because the test procedure used to identify suspicious trials didn’t account for the pattern of correlation amongst baseline variables, and then there was this 11th case.
This is interesting because we have discussed in the past whether it is useful to formally test for a difference in means for a study which you know to have been randomized. But this shows a case where it was useful in identifying that what happened in practice was not what the authors reported had happened.
What happened to compromise the Mediterranean Diet study?
The authors issued a retraction and new analysis that notes there were departures from the randomization procedures
Problem 1: Some individuals in the trial were in fact not randomly assigned
There were 425 participants who shared a household with someone else in the study that had already been assigned to one of the three diets. These participants were not randomly assigned, but were assigned to the same diet as the member already assigned. Note that I take these numbers from the NEJM paper, the New York Times article reports different numbers affected.
Problem 2: Randomization was decentralized, and was not done according to protocol at one of the eleven study sites.
In one of the 11 sites, rather than randomizing at the individual level, 467 participants were given treatment at the clinic level (there were 11 clinics at this site). The New York Times notes that these were small villages and “Participants there complained that some neighbors were receiving free olive oil, while they got only nuts or inexpensive gifts. So the investigator decided to give everyone in each village the same diet. He never told the leaders of the study what he had done.”
Problem 3: Randomization was meant to be done following a specific table which set out the assignment, but there may have been issues in how this was used at one site.
This affected 593 participants. This is a situation where patients would come in over time, and need to be allocated to different groups. Rather than data going to a researcher to randomize each time, sites were given tables which had assignments set for 1000 participants, stratified into four groups. However, at one of the sites the actual numbers assigned to the different treatments differ by quite a bit from the numbers that should have come from the tables, and it is unclear why.
It is easy to imagine versions of all three of these problems arising in development trials, especially those where assignment to treatment has to be done in a decentralized way. A first example is Glewwe et al.’s eyeglasses experiment in China, where they randomized in pairs of townships, and note that in 5 control townships, the local officials used the left-over funds from treated townships to buy glasses for the controls, and in one pair no one was offered glasses in the treatment, but were in the control. They drop the non-compliant townships (but present results for the full sample in an appendix). A second example is Cai et al, who note that in a microfinance experiment, treatment villages in one county were late to start issuing loans, so they drop this county for the analysis. A third example of randomization going awry is Ferree et al’s work testing the use of technology to engage citizens to observe election polling stations in South Africa: a data error meant that the messages and instructions were sent to an entirely different group than intended – the authors then treat this data as a natural experiment, using the assignment that actually took place. A final example is one I previously shared as failure 8 in my learning from failures post – where a Bank in Uganda ended up inviting the control group to training, and we had to use non-experimental methods in the end.
So what should you do if you find something similar happening in your study?
You can imagine several different solutions to this:
- Just drop the individuals who were not randomly assigned – this is what Andrew Gelman suggests. More generally, the idea is that if randomization is stratified, and randomization gets messed up in one strata, just drop that strata and your design should still be balanced in the remaining strata. This changes the population that you are now estimating the effect for, but then gives you the treatment effect for the new population of individuals in strata where randomization was correctly implemented. The authors do this as part of their re-analysis.
- Analyze the data based on what the correct assignment should have been: here the authors had given out tables that specified how individuals should be assigned to treatment and control. In other cases the randomization might be done by computer, even if then it is not implemented. Here you would then be able to estimate intent-to-treat effects by using the correct assignment to treatment, and then LATE effects instrumenting the treatment they actually got with the one they were assigned to. This seems fine to me for problem 2 and 3 above. However, I would still want to drop the individuals in problem 1 here, since I would be worried about SUTVA being violated due to individuals in the same household having their treatments interact. A downside of this approach is that it dilutes power, since there will be a bunch of observations in the study that do not receive their assigned treatment.
- Use non-experimental methods: the authors re-analyze the data controlling for a bunch of covariates, using propensity score matching, and performing sensitivity analysis to see how strong a confounder would have to be to explain the observed results.
Does the same apply if take-up and/or implementation go awry?
Suppose that instead of telling you that people were incorrectly assigned in one of the eleven sites, I told you that the treatment was never implemented at all in one of these sites (as in the China microfinance example above), or that take-up was really low in that site. Can I just drop this site, or just drop a randomization strata where take-up is low or implementation goes awry?
I think the same arguments as above apply, so that if you have stratified randomization by site or some other strata, then you could drop that stratum and still have random assignment in the remaining strata. But this is more controversial, since it involves endogenously changing the parameter you are trying to estimate – you started off trying to estimate the impact of treatment on the full sample, but now are estimating it for the strata in which take-up is high or implementation was correctly done. The concern might be that the reason that take-up was low or there was no implementation in a particular site was because participants/implementers expected treatment effects to be low (or negative) in that site. The consequence is that the estimated treatment effect for the strata where take-up and implementation do occur may be an over-estimate of the treatment effect for the original population of interest, even if it is an unbiased estimate of the effect in the sample you do have.
Any other examples of randomization gone wrong to share?
I tried crowdsourcing examples on twitter, which were helpful in identifying the eyeglasses and poll observations studies (thanks!). I know sometimes if the randomization gets completely botched, the study may get abandoned, but if you know of other examples from development experiments, please share, so readers can see how this has been handled in different cases.