People are talking about a “replication revolution”. They don’t mean what you think they mean…


This page in:

Yesterday, Markus blogged about a new initiative by 3ie to replicate studies in economics. It is indeed called “3ie’s Replication Program”. Below I argue that while this may be a worthy endeavor, it is mislabeled…

The 3ie website states: “The replications funded are ‘internal replications’—those that use the data from the original study, and possibly existing secondary datasets from the same location, in order to check the validity and robustness of the estimations and recommendations.” Even that sentence does not make me think “replication”. If you’re doing robustness checks by using different estimation techniques, different controls, variants of outcome variables, or other data sets, you’re not doing a replication study: you’re doing robustness checks.

From looking at the link at the website, I imagine that this will work something like this: you either have access to the data and the .do files from the authors’ web pages or the journal site or you request it. Once you get it, the first thing to check would be that their .do files run as they should and produce the tables in the paper. Once this test is passed, then what? Recreate some variables (if you have access to the raw data)? Change estimation techniques? Reexamine authors’ interpretations? Except the first test, none of this is replication of study findings. Furthermore, what prevents the torture of the data and data mining that we’re worried about in the first place in this well-intentioned exercise itself? Should the “replication” proposal be registered ahead of time and be limited only to what’s included within?

See, when other disciplines are talking about a “replication revolution”, they are talking about replicating experiments – for real. For example, when, in my last blog post, I linked to this piece in Nature on the slew of studies that do not replicate, they are talking about running the same experiment over and over again, under the same exact circumstances, to see if you can obtain the same outcomes in the treatment and control groups. This is something that should, and I mean “should in theory”, be possible under laboratory conditions.

Of course, even in sciences like psychology or lab experiments in economics, replication of this kind is definitely possible in theory and worth conducting, but is very hard. It is almost impossible to replicate the exact same conditions under which an experiment was run. And, it is definitely impossible to replicate field experiments in economics. Suppose you tried: can you draw a similarly large random sample from the study area? Maybe they already all know about the previous program, which had large spillovers. Two years have passed and the underlying economic conditions that can affect the heterogeneity of impacts changed. You want to hire the same NGO to run the program exactly the same way, but they moved on. The data collection firm has new staff, so sample attrition is different… You get the point: can’t do it – exact replication is not for those of us conducting field experiments. You can’t falsify a set of results by trying to replicate it exactly.

So, what do we want instead? We want “conceptual replications” – to borrow a term from the aforementioned Nature piece. Give researchers money to follow up on specific studies, but give them incentives to do it in a systematic manner so that we can both try to come as close to rerunning the same program in different settings and tweaking the experiment creatively and purposefully to learn important lessons.

About six months ago, at a conference on conditional cash transfers (CCTs), I proposed this to donors who provide funding for impact evaluations: it has not gone anywhere yet. My idea was to fund follow-up work to individual studies that have perhaps shown promise but definitely not ready for prime time. But, the follow-up study would not be just one other study in another locale: it would be a thorough multi-country design, perhaps with multiple teams of researchers (preferably including the original team) working in tandem trying the same thing but perhaps in slightly different ways. The key would be that they would all work together in the beginning to design this coordinated, ambitious effort, and collaborate. The current approach of letting “a thousand flowers bloom” does not work for multiple reasons. For one,  Ioannidis argues in this paper that when multiple teams are in a chase for statistically significant results, the findings are less reliable.

But, this type of work is hard to do – you would have to be crazy to try to set it up on your own. I know many people who are burned out trying to run the one or two large and important field studies that they are leading: imagine them trying to set up a follow-up multi-country study (after having spent years on the original one). There are less than a handful of these types of ambitious efforts going on, most likely a sub-optimal number, and funding, incentives, and logistical support for that kind of an effort would make a difference. The funding hurdle is especially hard: we’d have to convince donors to dish out millions of dollars for one ambitious, multi-country, multi-team study. It seems risky to donors, who have to show results to their funders or their taxpayers: better to diversify the portfolio. But, letting a thousand flowers bloom does not help with the kinds of replications of a concept in need of some proof.

In medicine, many drugs that are successful in Phase 2 trials fail in Phase 3 trials. See, for example, this piece by Malcolm Gladwell in the New Yorker (you can read the abstract, subscription needed for the whole article). There are many reasons for this: heterogeneity of impacts; moving from a particular sample to the general population; moving from controlled conditions to general conditions, i.e. from efficacy to effectiveness. We should treat one off studies in economics like the Phase 2 efficacy trials. Their effectiveness still needs to be confirmed with much larger samples all around the world, while these economic “drugs” need to be tweaked, with dosages carefully varied, etc. We need encouragement and funding for this kind of work.

I have little doubt that the new initiative by 3ie is a worthy endeavor. As the website states, this work will discourage both outright dishonesty and carelessness on the part of the researchers that were not caught by the referees during the journal review process. It is also likely to be cheap enough to not compete with resources needed for the kind of replications we need in economics. Perhaps one suggestion: why not work with journals to hire capable graduate students to do this work? Journals, whether we like it or not, are the gateway to these publications. Many of them, at least the good ones, already make the authors do a slew of robustness checks, produce online appendices, have them make the data available, etc. What they don’t do is to check the .do files and the actual data analysis. Couldn’t 3ie work with journals, who would perform the kind of “validity and robustness checks” is it after before a paper is published? This is more work for editors, but a system could be established where a percentage of papers are selected randomly.

In the meantime, we’re waiting for donors to announce an initiative that funds collaborative efforts to move from efficacy to effectiveness in development economics. Have research teams join forces and collaborate across countries and settings to answer an important question rather than race each other towards the unknown…


Berk Ozler

Lead Economist, Development Research Group, World Bank

Join the Conversation

David Roodman
May 24, 2012

Berk, just as you have little doubt that the 3ie initiative is a worthy (and cheap) endeavor, I have little doubt that what you advocate is worthy. Why then launch the post with unnecessary criticism over semantics? The 3ie program's use of the term "replication" is reasonable and the concise program documentation: and…
...makes clear what 3ie means by the word. I don't think semantic disputes are helpful when the usage of a word is reasonable and clear.

I think the term "conceptual replication" traces to a 2001 paper by psychologist John Hunter (…), who offered this typology: statistical replication, scientific replication, and conceptual replication. I think the typology embraces all the kinds of replication you discuss.

Disclosure: I serve on the advisory board for the 3ie replication program.

Berk Ozler
May 25, 2012

Actually, if a word has three distinct and pertinent meanings and you name a new initiative to refer to exclusively one of those meanings, especially when many in the field are using it in reference to the other two in their debates, then I am not sure that the usage is as clear as it could be -- I'd actually say it is rather loose.

I like the three distinctions in the typology you cited. Perhaps better to call the new initiative "3ie's Statistical Replication Initiative"...


May 24, 2012

Hi David,

Thanks. I don't think there is any criticism of 3ie or its initiative in my post. I do, however, think that the use of a term can be reasonable and clear to its user but still misunderstood by the wider audience. In very little of the debate on "replication" I have been following (and referred to in my post) in the media, blogs, Twitter, and Facebook in the past few months, most of which concerns the failures to replicate in the fields of psychology or medicine and a little bit of economics field experiments, have people been referring to "statistical replication" when they mention failures to replicate. Most of that recent debate seemed to have centered around scientific or conceptual replications (the examples I gave in last week's post were also referring to these). That's my reading of the debate, in any case...

You have been involved in a very public statistical replication and so perhaps think more quickly about statistical replication when you hear the word, which is fair. My post was simply clarifying to readers who may have been tuned into the debate I was following and could easily be confused as to what was being proposed by "3ie's replication program". Blogs are a good tool to offer such clarifications and I took advantage. No conflict or offense was intended and none should be taken by anyone involved...



Ben Wood
May 29, 2012

Thanks Berk and David for furthering the replication dialogue. I wanted to offer a few points of clarification from 3ie. While Berk’s point on external versus internal replications is well taken, terminology varies across researchers. For example the Reproducibility Project (, which is an interesting new endeavor, uses the term reproducibility instead of (external) replication. 3ie commissioned Vegard Iversen and Richard Palmer-Jones (who also sits on our Replication Program Advisory Group) to write a detailed history of and best practices for replication. We rely on their (forthcoming) working paper to define the stages of internal replication for our program: pure, statistical, and scientific. These definitions are derived from the previous work of Hamermesh ( and Easley et al. (, with scientific replication including the possibility of incorporating other pre-existing datasets or testing alternative theories of change.

Berk’s other point on the possibilities of data mining is also appreciated. The replication researchers will submit formal proposals with their replication plans, thus comparisons between planned and final replications are possible. 3ie will consider ways to make these plans public. Thanks for the suggestion.

We certainly hope that a major initiative to conduct Phase 3 trials materializes sometime. Incentivizing and publicizing completed internal replications might just be the first step needed to move the development community towards large-scale external replication research.