Syndicate content

Berk Ozler's blog

Worm Wars: A Review of the Reanalysis of Miguel and Kremer’s Deworming Study

Berk Ozler's picture
This post was updated on July 24, 2015 in reponse to the increased traffic to the site from Twitter upon the publication of the replication discussed below and the authors' response in the International Journal of Epidemiology. I took a day to re-review the papers in question and not surprisingly, what I said below remains as valid as 6 months ago because all that happened is that the papers got published without much change since they first appeared on the 3ie website. However, I do have a few new thoughts (and one new table from Hamory Hicks, Kremer, and Miguel), which I discuss below. My original post remains unedited. I'd also like to thank Stefane Helleringer for a nice response he wrote about the definion of ITT in public health: see the back and forth here.

Despite the differences in various methodological and data handling choices, which I discussed below in my original post, it is clear that the interpretation of whether one believes the results of Miguel and Kremer are robust really rests on whether one splits the data or not. Therefore it is important to solely focus on this point and think about which choice is more justified and whether the issue can be dealt with another way. A good starting point is the explanation of DAHH in their pre-analysis plan as to why they decided to split the data into years and analyze it cross-sectionally rather than the difference-in-difference method in the original MK (2004):
The data from a stepped wedge trial can be thought of as a one-way cross-over, and treated as such, by comparing before and after in the cross-over schools (group 2) and accounting for the secular trend using the non-crossing schools (groups 1 and 3). However, such an approach requires assumptions about the uniformity of the trend and the ability of the model to capture the secular change, and as such loses the advantage of randomization.
This explanation seems confused to me: the common trend assumptions are something that need to be established in observational studies when we're using Diff-in-Diff as an identification strategy, but in a cluster-randomized trial like this one, we have it by design: Groups 2 & 3 are perfect counterfactuals for each other if the randomization has been done correctlty. Surely, if we look at a bunch of outcomes, we might find random differences in the changes from 1998-1999 between the two groups, but that's no reason to assume that there is something wrong with this approach or that it takes us away from the advantage of randomization. Analyzing everything cross-sectionally by year and not controlling for the lagged value of the outcome variable is costing DAHH some statistical power instead.

However, let's accept for a second DAHH's argument that there's something strange with Group 2 and we're wary of it. Them it seems to me that the solution is simple: why not look at the two clean groups that never change treatment status the whole study period of 1998-1999. In other words, exclude Group 2, pool all the data for 1998 and 1999 and compare the effects between Group 1 and Group 3. Sure, we lose power from throwing out a whole study arm, but if the results stand we're done! Thankfully, Joan Hamory Hicks was able to run this analysis and send me the table below, which is akin to their Table 3 in their original response:

As you can see, all effect sizes on school participation are about 6 percentage points (pp), which is remarkably close to the effect size of 7 pp in the original study. The p-values went up from <0.01 to <0.05, but that is fully expected having shed a third of the sample. So, even if you think that there is something strange going on with Group 2, for which the visual inspection presented by DAHH in Figure 3 is really not sufficient, you still have similarly-sized and statistically significant effects when making the cleaner comparison of Groups 1 & 3. Problem solved?

I want to conclude by making a bigger picture point about replications. They are really a really expanded version of robustness checks that are conducted for almost any paper. It's just that the incentives are different: authors want robustness and replicators might be tempted to find a hole or two to poke in the evidence and "debunk" the paper (if I had a dime yesterday for every deworming debunked tweet...). But, when that happens, I start worrying about multiple hypothesis testing. We now know and have tools for how to deal with multiple inference corrections, when the worry is Type I errors (false rejections of a correct null). But, what about Type 2 errors? After all this is exactly what a replicator would be after: finding a manner of handling the data/analysis that makes the results go away. But, how do we know whether that is a true "failure to reject" or a Type 2 error? Even in studies with 80% power, there is a 20%chance that each independent test will fail to reject under the null of a positive effect. The more of these you try, the more likely you'll come across one or two estimates that are insignificant. What to do about that?

To be fair to the authors, they were at least aware of this issue, mentioned on page 7 of the PAP:

We aim to deal with this problem by making a small number of analyses using as much of the original data as possible at each stage and concentrating initially on the direct intervention effects on the major study outcomes.

But, then this is where it would have been really important to have a very clear PAP, describing only a very few, carefully methodologically justified, analyses proposed and sticking very strictly to it. But, every step of the way when the authors decide to weight or not weight the data (cluster summaries), splitting the data by year, adjusted/unadjusted estimates, alternative treatment definitions dropping large numbers of observations, etc. there is a fork and the fork opens up more roads to Type 2 errors. We need replications of studies that are decently powered themselves, where the replicators are careful to hoard all the power that is there and not scatter it along the way.

I hope that this update has brought some clarity to the key issues that are surrounding the debate about the publication of the replication results and the accompanying flurry of articles. I was an unwitting and unwilling participant of the Twitter storm that ensued, only because many of you were responsible for repeatedly pointing out the fact that I had written the blog post below six months ago and linking to it incessantly throughout the day. I remain indebted to our readers who are a smart and thoughtful bunch...


This post follows directly from the previous one, which is my response to Brown and Wood’s (B&W) response to “How Scientific Are Scientific Replications?” It will likely be easier for you to digest what follows if you have at least read B&W’s post and my response to it. The title of this post refers to this tweet by @brettkeller, the responses to which kindly demanded that I follow through with my promise of reviewing this replication when it got published online.

Power calculations: what software should I use?

Berk Ozler's picture

In my experimental work, I almost always do cluster-randomized field experiments (CRTs – T for trials), and therefore I always used the Optimal Design software (OD for short), which is freely available and fairly easy to use with menu based dialogue boxes, graphs, etc. However, preparing some materials for a course with a couple of colleagues, I came to realize that it has some strange basic limitations. That led me to invest some time into finding out about my alternatives in Stata. I thought I’d share a couple of things I learned here.

Starting antiretroviral treatment early: an update

Berk Ozler's picture

Almost four years ago I wrote a blog post titled “Advocating a treatment that may not help the treated?”, which was in response to the news that starting treatment with antiretroviral drugs immediately rather than waiting until the then standard of falling below a CD4+ count of 250 significantly reduced transmission of HIV among HIV-discordant couples. The study also reported effects on the health of the HIV-infected partner and found that the evidence for any beneficial effects for the person being treated were weak at best.

Poverty Reduction: Sorting Through the Hype

Berk Ozler's picture
After seeing PowerPoint slides of the preliminary findings over the course of more than a year, it’s nice to be able to report that the six-country study that is evaluating the “ultra-poor graduation” approach (originally associated with BRAC) is finally out.

Be an Optimista, not a Randomista (when you have small samples)

Berk Ozler's picture
We are often in a world where we are allowed to randomly assign a treatment to assess its efficacy, but the number of subjects available for the study is small. This could be because the treatment (and its study) is very expensive – often the case in medical experiments – or because the condition we’re trying to treat is rare leaving us with two few subjects or because the units we’re trying to treat are like districts or hospitals, of which there are only so many in the country/region of interest.

Preregistration of studies to avoid fishing and allow transparent discovery

Berk Ozler's picture
The demand for pre-analysis plans that are registered at a public site prior available for all consumers to be able to examine has recently increased in social sciences, leading to the establishment of several social science registries. David recently included a link to Ben Olken’s JEP paper on pre-analysis plans in Economics. I recently came across a paper by Humphreys, de la Sierra, and van der Windt (HSW hereon) that proposes a comprehensive nonbinding registration of research. The authors end up agreeing on a number of issues with Ben, but still end up favoring a very detailed pre-analysis plan. As they also report on a mock reporting exercise and I am also in the midst of writing a paper that utilized a pre-analysis plan struggling with some of the difficulties identified in this paper, I thought I’d link to it a quickly summarize it before ending the post with a few of my own thoughts.

Weekly Links March 20: Giving away TOMS shoes, evaluating anti-terrorism interventions, Ben Olzer, and more...

Berk Ozler's picture

Bruce Wydick on the Impact of giving away TOMS Shoes: He gives kudos to TOMS for being open for evaluation and being responsive to findings, but what caught my eye was this observation: "The bad news is that there is no evidence that the shoes exhibit any kind of life-changing impact,..."

Why is Difference-in-Difference Estimation Still so Popular in Experimental Analysis?

Berk Ozler's picture
David McKenzie pops out from under many empirical questions that come up in my research projects, which has not yet ceased to be surprising every time it happens, despite his prolific production. The last time it happened was a teachable moment for me, so I thought I’d share it in a short post that fits nicely under our “Tools of the Trade” tag.