When done well, randomized experiments at least provide internal validity – they tell us the average impact of a particular intervention in a particular location with a particular sample at a particular point in time. Of course we would then like to use these results to predict how the same intervention would work in other locations or with other groups or in other time periods. An interesting new paper  by Hunt Allcott and Sendhil Mullainathan provides an approach for doing this, and points out some of the pitfalls involved.
They consider the results from 14 nearly identical energy conservation experiments involving 550,000 households across the U.S. The treatment in these experiments is to mail Home Energey Reports to residential electricity customers, comparing their energy use to those of their neighbors, and also providing them with action steps to conserve energy. Several studies have looked at the outcomes of these experiments, including a paper  by Costa and Kahn which got some attention for the finding that Republicans tended to increase their energy usage when they found out they were using less energy than their neighbors whereas Democrats didn’t.
The company that does these experiments works with local utilities in each area, and at the time of writing, experiments had been conducted at 14 different sites, with only “minor” differences in implementation across sites (some variations in frequency of the reports and in the presentation). The mean average treatment effect (ATE) across experiments is a 2.06 percent reduction in electricity use, varying by a factor of 2.4, from 1.37 percent to 3.32 percent across sites. This variation is economically meaningful, as it determines whether the program is cost-effective or not.
Extrapolating to non-treatment sites
Under a strong version of external validity, treatment effects should be able to be extrapolated from one location to another once one takes account of observable differences between locations – i.e. we should be able to use our experimental results and observable data to say how a treatment will perform in any particular location.
The authors have quite detailed data on both the aggregate level and household level. Using this data, for each experimental site, they use data from the other 13 sites to try and predict what the ATE would be. They use two methods for this prediction: parametric extrapolation, and reweighting using inverse probability weights. They can then compare the predicted treatment effect to what the experiment shows the predicted effect to actually be, and test whether the difference is significant or not.
The bad news is that they reject equality of the predicted and actual ATEs in 47 percent of the cases – that is, there is economically significant variation in the treatment effects across locations that is not explained by observables.
On the other hand, they find the extrapolations from the experimental data do much better than non-experimental estimates of the treatment effect using data from the actual location – so while experiments don’t extrapolate perfectly, they do better than non-experimental alternatives.
A weaker version of external validity is to argue that we might not be able to accurately predict how a treatment might work in a specific location, but we should be able to at least predict how it will do on average if experiments were run in a random sample of settings. But this is where partner selection bias comes in – the authors find that the electric utilities where the experiments were run had different ownership, were larger, and tended to be in wealthier states with stronger environmental regulation than other utilities in the U.S – and that these characteristics were correlated with the average treatment effects.
This is but one example, and so the question is whether it applies more broadly. Some of the original work on job-training programs in the U.S. found non-random selection of sites into the program. Hunt and Mullainathan also consider the case of microfinance experiments – showing that the MFIs that have participated in experiments differ significantly on a number of dimensions from the average MFI in the Microfinance Information Exchange database.
Where to from here then?
To date we have very few examples where the same experiment has been attempted in multiple locations, making it hard to assess how easy it is to extrapolate. The results of this study are therefore very interesting and act as a warning to blithely taking the results of one study and expecting them to hold elsewhere. So what should we do?
First, the authors argue this is not a reason to abandon experiments (non-experimental methods do even worse) and as I’ve argued before  external validity issues are often just as much of an issue in non-experimental studies in development. The authors note that when treatment effects are difficult to generalize, there is all the more reason to obtain internally valid estimates in the target population of interest.
Second, they ask researchers to provide more discussion of context and how the results of experiments might be expected to differ in other settings of policy interest.
Third, they note that mechanisms  might generalize more easily than ATEs across sites and domains, so spending more attention on identifying these mechanisms could be fruitful.
Most of all, I think this paper should be considered as the start of an exciting new line of work. I see many of our existing experiments as proof of concept – showing that a particular policy can work somewhere and under some relevant conditions is certainly not trivial. But as we learn more about whether an intervention can work, the question of when and how much it will work increasingly comes to the forefront- and so efforts such as this one are likely to become increasingly important.
- external validity