Syndicate content


The infinite loop failure of replication in economics

Markus Goldstein's picture
In case you missed it, there was quite a brouhaha about worms and the replication of one particular set of results this summer (see Dave's anthology here).   I am definitely not going to wade into that debate, but there is a recent paper by Andrew Chang and Phillip Li which gives us one take on the larger issue involved:  the replication of published results.   Their conclusion is nicely captured in t

Worm Wars: A Review of the Reanalysis of Miguel and Kremer’s Deworming Study

Berk Ozler's picture
This post was updated on July 24, 2015 in reponse to the increased traffic to the site from Twitter upon the publication of the replication discussed below and the authors' response in the International Journal of Epidemiology. I took a day to re-review the papers in question and not surprisingly, what I said below remains as valid as 6 months ago because all that happened is that the papers got published without much change since they first appeared on the 3ie website. However, I do have a few new thoughts (and one new table from Hamory Hicks, Kremer, and Miguel), which I discuss below. My original post remains unedited. I'd also like to thank Stefane Helleringer for a nice response he wrote about the definion of ITT in public health: see the back and forth here.

Despite the differences in various methodological and data handling choices, which I discussed below in my original post, it is clear that the interpretation of whether one believes the results of Miguel and Kremer are robust really rests on whether one splits the data or not. Therefore it is important to solely focus on this point and think about which choice is more justified and whether the issue can be dealt with another way. A good starting point is the explanation of DAHH in their pre-analysis plan as to why they decided to split the data into years and analyze it cross-sectionally rather than the difference-in-difference method in the original MK (2004):
The data from a stepped wedge trial can be thought of as a one-way cross-over, and treated as such, by comparing before and after in the cross-over schools (group 2) and accounting for the secular trend using the non-crossing schools (groups 1 and 3). However, such an approach requires assumptions about the uniformity of the trend and the ability of the model to capture the secular change, and as such loses the advantage of randomization.
This explanation seems confused to me: the common trend assumptions are something that need to be established in observational studies when we're using Diff-in-Diff as an identification strategy, but in a cluster-randomized trial like this one, we have it by design: Groups 2 & 3 are perfect counterfactuals for each other if the randomization has been done correctlty. Surely, if we look at a bunch of outcomes, we might find random differences in the changes from 1998-1999 between the two groups, but that's no reason to assume that there is something wrong with this approach or that it takes us away from the advantage of randomization. Analyzing everything cross-sectionally by year and not controlling for the lagged value of the outcome variable is costing DAHH some statistical power instead.

However, let's accept for a second DAHH's argument that there's something strange with Group 2 and we're wary of it. Them it seems to me that the solution is simple: why not look at the two clean groups that never change treatment status the whole study period of 1998-1999. In other words, exclude Group 2, pool all the data for 1998 and 1999 and compare the effects between Group 1 and Group 3. Sure, we lose power from throwing out a whole study arm, but if the results stand we're done! Thankfully, Joan Hamory Hicks was able to run this analysis and send me the table below, which is akin to their Table 3 in their original response:

As you can see, all effect sizes on school participation are about 6 percentage points (pp), which is remarkably close to the effect size of 7 pp in the original study. The p-values went up from <0.01 to <0.05, but that is fully expected having shed a third of the sample. So, even if you think that there is something strange going on with Group 2, for which the visual inspection presented by DAHH in Figure 3 is really not sufficient, you still have similarly-sized and statistically significant effects when making the cleaner comparison of Groups 1 & 3. Problem solved?

I want to conclude by making a bigger picture point about replications. They are really a really expanded version of robustness checks that are conducted for almost any paper. It's just that the incentives are different: authors want robustness and replicators might be tempted to find a hole or two to poke in the evidence and "debunk" the paper (if I had a dime yesterday for every deworming debunked tweet...). But, when that happens, I start worrying about multiple hypothesis testing. We now know and have tools for how to deal with multiple inference corrections, when the worry is Type I errors (false rejections of a correct null). But, what about Type 2 errors? After all this is exactly what a replicator would be after: finding a manner of handling the data/analysis that makes the results go away. But, how do we know whether that is a true "failure to reject" or a Type 2 error? Even in studies with 80% power, there is a 20%chance that each independent test will fail to reject under the null of a positive effect. The more of these you try, the more likely you'll come across one or two estimates that are insignificant. What to do about that?

To be fair to the authors, they were at least aware of this issue, mentioned on page 7 of the PAP:

We aim to deal with this problem by making a small number of analyses using as much of the original data as possible at each stage and concentrating initially on the direct intervention effects on the major study outcomes.

But, then this is where it would have been really important to have a very clear PAP, describing only a very few, carefully methodologically justified, analyses proposed and sticking very strictly to it. But, every step of the way when the authors decide to weight or not weight the data (cluster summaries), splitting the data by year, adjusted/unadjusted estimates, alternative treatment definitions dropping large numbers of observations, etc. there is a fork and the fork opens up more roads to Type 2 errors. We need replications of studies that are decently powered themselves, where the replicators are careful to hoard all the power that is there and not scatter it along the way.

I hope that this update has brought some clarity to the key issues that are surrounding the debate about the publication of the replication results and the accompanying flurry of articles. I was an unwitting and unwilling participant of the Twitter storm that ensued, only because many of you were responsible for repeatedly pointing out the fact that I had written the blog post below six months ago and linking to it incessantly throughout the day. I remain indebted to our readers who are a smart and thoughtful bunch...


This post follows directly from the previous one, which is my response to Brown and Wood’s (B&W) response to “How Scientific Are Scientific Replications?” It will likely be easier for you to digest what follows if you have at least read B&W’s post and my response to it. The title of this post refers to this tweet by @brettkeller, the responses to which kindly demanded that I follow through with my promise of reviewing this replication when it got published online.

Getting beyond the mirage of external validity

Markus Goldstein's picture
This post is coauthored with Eliana Carranza
No thoughtful technocrat would copy a program in every detail for a given context in her or his country.    That's because they know (among other things) that economics is not a science but a social (or dismal even) science, and so replication in the fashion of chemistry isn't an option.  For economics, external validity in the strict scientific sense is a mirage.

Response to Brown and Wood's "How Scientific Are Scientific Replications? A Response"

Berk Ozler's picture
I thank Annette Brown and Benjamin Wood (B&W from hereon) for their response to my previous post about the 3ie replication window. It not only clarified some of the thinking behind their approach, but arrived at an opportune moment – just as I was preparing a new post on part 2 of the replication (or reanalysis as they call it) of Miguel and Kremer’s 2004 Econometrica paper titled “Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities,” by Davey et al. (2014b) and the response (Hicks, Kremer, and Miguel 2014b, HKM from hereon).  While I appreciate B&W’s clarifications, I respectfully disagree on two key points, which also happen to illustrate why I think the reanalysis of the original data by Davey et al. (2014b) ends up being flawed.

How scientific are scientific replications? A response by Annette N. Brown and Benjamin D.K. Wood

A few months ago, Berk Ozler wrote an impressive blog post about 3ie’s replication program that posed the question “how scientific are scientific replications?” As the folks at 3ie who oversee the replication program, we want to take the opportunity to answer that question. Our simple answer is, they are not meant to be.

Guest Post by Sebastian Galiani: Replication in Social Sciences: Generalization of Cause-and-Effect Constructs

I agree with the general point raised by Berk in his previous post in this blog (read it here). We need to discuss when and how to conduct scientific replication of existing research in social sciences. I also agree with him that, at least in economics, pure replication analysis –which in my view it is the only genuine replication analysis- is of secondary interest –I hope to return to this issue in a future contribution in this blog. Instead, I believe that we should emphasize replication of relevant and internally valid studies both in similar and different environments. There is now excessive confidence in the knowledge gathered by a single study in a particular environment, perhaps as a result of a misconstruction of the virtues of experimentation in social sciences. As Donald T. Campbell once wrote (1969):

Calling all skeptics

Markus Goldstein's picture

Have you seen an impact evaluation result that gives you pause? Well, now there’s an institutional way to check on results of already published evaluations.    3ie recently announced a program for replication. They are going to focus on internal validity – replicating the results with the existing data and/or using different data from the same population to check results (in some cases).