Yesterday, Markus  blogged  about a new initiative by 3ie  to replicate studies in economics. It is indeed called “3ie’s Replication Program ”. Below I argue that while this may be a worthy endeavor, it is mislabeled…
The 3ie website states: “The replications funded are ‘internal replications’—those that use the data from the original study, and possibly existing secondary datasets from the same location, in order to check the validity and robustness of the estimations and recommendations.” Even that sentence does not make me think “replication”. If you’re doing robustness checks by using different estimation techniques, different controls, variants of outcome variables, or other data sets, you’re not doing a replication study: you’re doing robustness checks.
From looking at the link  at the website, I imagine that this will work something like this: you either have access to the data and the .do files from the authors’ web pages or the journal site or you request it. Once you get it, the first thing to check would be that their .do files run as they should and produce the tables in the paper. Once this test is passed, then what? Recreate some variables (if you have access to the raw data)? Change estimation techniques? Reexamine authors’ interpretations? Except the first test, none of this is replication of study findings. Furthermore, what prevents the torture of the data and data mining that we’re worried about in the first place in this well-intentioned exercise itself? Should the “replication” proposal be registered ahead of time and be limited only to what’s included within?
See, when other disciplines are talking about a “replication revolution”, they are talking about replicating experiments – for real. For example, when, in my last blog post , I linked to this piece in Nature  on the slew of studies that do not replicate, they are talking about running the same experiment over and over again, under the same exact circumstances, to see if you can obtain the same outcomes in the treatment and control groups. This is something that should, and I mean “should in theory”, be possible under laboratory conditions.
Of course, even in sciences like psychology or lab experiments in economics, replication of this kind is definitely possible in theory and worth conducting, but is very hard. It is almost impossible to replicate the exact same conditions under which an experiment was run. And, it is definitely impossible to replicate field experiments in economics. Suppose you tried: can you draw a similarly large random sample from the study area? Maybe they already all know about the previous program, which had large spillovers. Two years have passed and the underlying economic conditions that can affect the heterogeneity of impacts changed. You want to hire the same NGO to run the program exactly the same way, but they moved on. The data collection firm has new staff, so sample attrition is different… You get the point: can’t do it – exact replication is not for those of us conducting field experiments. You can’t falsify a set of results by trying to replicate it exactly.
So, what do we want instead? We want “conceptual replications” – to borrow a term from the aforementioned Nature piece. Give researchers money to follow up on specific studies, but give them incentives to do it in a systematic manner so that we can both try to come as close to rerunning the same program in different settings and tweaking the experiment creatively and purposefully to learn important lessons.
About six months ago, at a conference on conditional cash transfers (CCTs), I proposed this to donors who provide funding for impact evaluations: it has not gone anywhere yet. My idea was to fund follow-up work to individual studies that have perhaps shown promise but definitely not ready for prime time. But, the follow-up study would not be just one other study in another locale: it would be a thorough multi-country design, perhaps with multiple teams of researchers (preferably including the original team) working in tandem trying the same thing but perhaps in slightly different ways. The key would be that they would all work together in the beginning to design this coordinated, ambitious effort, and collaborate. The current approach of letting “a thousand flowers bloom” does not work for multiple reasons. For one, Ioannidis argues in this paper  that when multiple teams are in a chase for statistically significant results, the findings are less reliable.
But, this type of work is hard to do – you would have to be crazy to try to set it up on your own. I know many people who are burned out trying to run the one or two large and important field studies that they are leading: imagine them trying to set up a follow-up multi-country study (after having spent years on the original one). There are less than a handful of these types of ambitious efforts going on, most likely a sub-optimal number, and funding, incentives, and logistical support for that kind of an effort would make a difference. The funding hurdle is especially hard: we’d have to convince donors to dish out millions of dollars for one ambitious, multi-country, multi-team study. It seems risky to donors, who have to show results to their funders or their taxpayers: better to diversify the portfolio. But, letting a thousand flowers bloom does not help with the kinds of replications of a concept in need of some proof.
In medicine, many drugs that are successful in Phase 2 trials fail in Phase 3 trials. See, for example, this piece  by Malcolm Gladwell in the New Yorker (you can read the abstract, subscription needed for the whole article). There are many reasons for this: heterogeneity of impacts; moving from a particular sample to the general population; moving from controlled conditions to general conditions, i.e. from efficacy to effectiveness. We should treat one off studies in economics like the Phase 2 efficacy trials. Their effectiveness still needs to be confirmed with much larger samples all around the world, while these economic “drugs” need to be tweaked, with dosages carefully varied, etc. We need encouragement and funding for this kind of work.
I have little doubt that the new initiative by 3ie is a worthy endeavor. As the website states, this work will discourage both outright dishonesty and carelessness on the part of the researchers that were not caught by the referees during the journal review process. It is also likely to be cheap enough to not compete with resources needed for the kind of replications we need in economics. Perhaps one suggestion: why not work with journals to hire capable graduate students to do this work? Journals, whether we like it or not, are the gateway to these publications. Many of them, at least the good ones, already make the authors do a slew of robustness checks, produce online appendices, have them make the data available, etc. What they don’t do is to check the .do files and the actual data analysis. Couldn’t 3ie work with journals, who would perform the kind of “validity and robustness checks” is it after before a paper is published? This is more work for editors, but a system could be established where a percentage of papers are selected randomly.
In the meantime, we’re waiting for donors to announce an initiative that funds collaborative efforts to move from efficacy to effectiveness in development economics. Have research teams join forces and collaborate across countries and settings to answer an important question rather than race each other towards the unknown…