Syndicate content

Towards policy irrelevance? Thoughts on the experimental arms race and Chris Blattman’s predictions

David McKenzie's picture

Chris Blattman posted an excellent (and surprisingly viral) post yesterday with the title “why I worry experimental social science is headed in the wrong direction”. I wanted to share my thoughts on his predictions.
He writes:
Take experiments. Every year the technical bar gets raised. Some days my field feels like an arms race to make each experiment more thorough and technically impressive, with more and more attention to formal theories, structural models, pre-analysis plans, and (most recently) multiple hypothesis testing. The list goes on. In part we push because want to do better work. Plus, how else to get published in the best places and earn the respect of your peers?
It seems to me that all of this is pushing social scientists to produce better quality experiments and more accurate answers. But it’s also raising the size and cost and time of any one experiment.

Leading him to predict that:
I think it could also going to push experimenters to increase sample sizes, to be able to meet these more strenuous standards. If so, I’d expect this to reduce the quantity of field experiments that get done.
I also expect that higher standards will be disproportionately applied to experiments. So it some sense it will raise the bar for some work over others. Younger and junior scholars will have stronger incentives to do observational work.
…to me the danger is this: That all the effort to make experiments more transparent and accurate in the end instead limits how well we understand the world, and that a reliance on too few studies makes our theory and judgment and policy worse rather than better.

Back to the past? Will experiments move away from directly testing big government policies?
As I read it, Chris sees the future in which teams of senior researchers conduct massively expensive large-scale experiments with lots of bells-and-whistles in just a few settings. But I think part of the trends he is talking about pushes strongly in another direction - ironically back towards many of the original types of experiments done in development economics.

Consider two types of experiments. Type A is a researcher working with a small NGO program, with a single firm, or with an intervention they have designed and funded themselves. Here the researcher has large control over the intervention being done, how it is implemented, the selection of individuals in the study, whether there are multiple treatment arms, and the timeline of the project. But the trade-off is that it is a small project and external validity concerns are pretty high. Many of my initial firm experiments took this form, as did a lot of the early experiments in health and education in development.
Experiment type B is evaluating a large government policy implemented at scale. This solves the sample size issue and helps more in terms of external validity issues. But here the researcher has much less control over what is being done, how it is implemented, the timeline, etc. I’ve done a lot more of this type of work more recently in many experiments at the World Bank, as have a number of other researchers in the second wave of experiments in development. There has been rapid growth in these types of evaluations as funders like the World Bank, 3ie, and DFID have encouraged this work.

The problem is that many of the advances Chris points to are much easier to handle in a type A experiment than a type B. You want a clearly laid out ex ante hypothesis restricted to just a couple of outcomes, along with several treatment arms to allow me to estimate the key parameters of a structural model? Sure, I’ll think through all these issues first and design the perfect experiment to test between competing hypotheses. I might even be able to do this with a really short timeframe. A recent example epitomizing this type is this nice paper by Emily Breza, Supreet Kaur and Yogita Shamdasani on the morale effects of pay inequality: they basically set up their own small firm, run a month long experiment in which they offer different pay rates to 378 Indian workers, and see what happens to productivity. More generally I expect to see a lot more of what Sendhil Mullainathan calls mechanism experiments.

In contrast, consider trying to evaluate the impact of a large government policy. The range of potential outcomes that may be affected is much larger, most of the time I don’t really understand what the intervention is until after it gets implemented (and even then I don’t always understand all the details or it is a bundle of different things), the time frame is at the mercy of the government but is likely to be super long, and it is often going to be way more expensive to collect data. But if we want to know the impacts of government policies, this is directly answering that question.

My observation is that it is getting harder and harder to publish these type B experiments, for many of the reasons Chris mentioned. At first it was enough to say “here is a massive government policy designed to address an important public need, for which we have no evidence as to its effectiveness” and report on an experiment that shows what the impact is. (Although it is worth noting that one of the most famous examples of a randomized experiment of a government program, that of Progresa/Oportunidades, has the paper reporting the treatment effects on the main outcome (education) published in the JDE, and it was then only papers which used this data to estimate a structural model, look at spillovers, or distinguish between competing theories that were published in top-5 journals). But unless your evaluation is of a major U.S. policy, it is becoming harder to sell policy impact papers to top journals.

The incentives of researchers are therefore to move back away from engaging with policy, to situations where they can control every detail of the experiment themselves – which has always been the approach of the lab experimenters. So the question is whether field experiments in economics will just become a branch of lab experiments and only indirectly inform policies through pointing to mechanisms?

The double-double standard persists
As a final point, I agree with Chris’s prediction that higher standards will be disproportionately applied to experiments. One of the first posts I did on this blog was a rant against the external validity double standard being applied to field experiments in development. It seems like a lot of these other trends are likely to also have a double standard – pre-analysis plans being a case in point: such approaches will almost never be used for observational studies; concerns about p-hacking are much more of a concern in practice for non-experimental studies etc.


Submitted by Rachel Glennerster on

While there are many benefits of doing large RCTs with governments and I would hate to see it becoming harder to get these published, we need to be careful in assuming that RCTs with governments are more policy relevant than those with NGOs. The NGO Pratham, reaches more kids than many governments. Plus an RCT that tests underlying human behavior may generalize better than a specific program, and thus be more policy relevant in the long run. Think for example of the RCTs (with NGOs) that tested pricing of preventative goods.

Submitted by Ashu Handa on

One thing the perfectly controlled study cannot mimic is implementation failure, another reason why studying messy government programs is important. Read more here at

Add new comment