Towards a more systematic approach to external validity: understanding site-selection bias


This page in:

The impact evaluation of a new policy or program aims to inform the decision on wider adoption and even, perhaps, national scale-up. Yet often the practice of IE involves a study in one localized area – a sample “site” in the terminology of a newly revised working paper by Hunt Allcott. This working paper leverages a unique program roll-out in the U.S. to explore the challenges and pitfalls that arise when generalizing IE results from a handful of sites to a larger context. And the leap from applying impact estimates taken in one site to the larger world is not always straightforward.
To extrapolate results from a single evaluation to a wider population, practitioners need to assume that there is no unobservable characteristic that (a) influences the treatment effect, and (b) varies between the study sample and the wider population. (Note that differences in observed characteristics are also not ideal but can, in principle, be controlled). Allcott calls this two-part assumption the assumption of external unconfoundedness – a necessary assumption if we wish to recommend the adoption of a new program on the basis of an evaluation conducted in a handful of sites.
When should we worry about external unconfoundedness? When the IE is subject to site- or implementer-selection bias. Typically this bias arises when the pilot program is deliberately located in an area that enables program success. For example, if pilot program success is due to the high local capacity for implementation – and this capacity does not exist elsewhere – then our impact estimates will be biased. Alternatively, the selected site may contain characteristics that influence successful implementation – perhaps the area has more cohesive social networks or higher human capital than elsewhere. In so far as these characteristics can be measured then site selection bias will be accounted for, but it is notoriously difficult to measure and observe all possible relevant characteristics.
The new working paper is an extended look at the validity of the external unconfoundedness assumption in the context of the Opower program - an information campaign directed to electricity consumers in the U.S. with the goal of promoting energy conservation. Amazingly, Opower has conducted sequential RCTs of program effectiveness in 111 sites (!) around the U.S.
This previous version of the Allcott working paper focused on treatment heterogeneity as the program was introduced in the first ten sites – each site constituting a city or county. Rich micro data on 512,000 households in these ten sites not only include details on power consumption but also allow for an investigation of determinants of treatment heterogeneity, which in turn can help out-of-sample predictions and an investigation of site-selection bias.
What does the Opower intervention entail? It’s a simple information campaign for electricity consumers where the program sends treated households a tailored home energy report comparing electricity use with the average neighbor as well as providing energy conservation tips. These reports are sent either every month or every quarter for two years.
Results from the first 10 sites suggest that the Opower program reduces electricity usage by 1.7 percent – a point estimate that suggests a total savings of $2.3 billion if scaled nationally. After these first 10 sites the program continued to conduct a randomized experiment at the additional 101 sites. However for these later 101 sites, the average program effect is estimated at only 1.2 percent of total usage – still a substantial savings in dollar terms but almost 30% less than the initial program estimate and a difference large enough to affect the determination of program cost-effectiveness.
To explore potential drivers of site-specific difference, the author leverages micro data to model the initial site selection probability of the first 10 study sites. The focus falls on four types of potentially key mediating covariates: (a) pre-program electricity usage (b) the local presence of additional conservation programs (c) utility level characteristics such as ownership type (d) population preferences for energy conservation as sites differentially contain consumers with strong preferences for environmental conservation. These conservation preferences are proxied in the data by income, education, political party vote share, and the share of hybrid vehicles. It turns out that site selection probability is indeed positively correlated with population preferences – the early sites for the Opower study had many households that cared about conservation and this, in turn, led to atypically large treatment effects estimates.
The main take home message is not that IE is not useful. Indeed an earlier paper by Allcott concluded that the non-experimental evaluations of the Opower program give very misleading estimates of program impact – much more misleading then the estimates from the first 10 Opower sites.
But external unconfoundedness needs to be taken seriously and not just in this particular context. As corollary evidence, Allcott reviews the IE literature around micro-finance with a focus on the type of micro-finance institutions that have partnered with researchers to conduct randomized trials. A comparison of these partner institutions vis-à-vis the global population of MFIs reveals systematic differences in characteristics such as default rates and organizational structure that, in turn, may affect program performance.
While the Allcott paper leverages remarkably rich data to explore site-selection bias, the underlying problem has been a long-standing concern for impact evaluation. Indeed much of the IE with which I am involved purposively takes this concern into account by introducing the evaluated program in a variety of settings within the national context.
So how can we practice in a more systematic way when it comes to identifying possible external confounders in impact estimates? Clearly we need to think about partner choice and how potential implementing partners may or may not look like the broader population of potential partners (note that this holds true even for government implemented programs when implemented in select regions of a country). We should also start to regularly compare the characteristics of the study sample with the characteristics of the population to which the program is ultimately intended. For any study that wishes to draw larger policy implications, perhaps we should view these analytic steps akin to the investigation of covariate balance that helps to establish the internal validity of impact estimates.


Jed Friedman

Senior Economist, Development Research Group, World Bank

Join the Conversation

Sean Muller
May 15, 2014

Thanks for an interesting post.
You might be interested in a paper I'm presenting at ABCDE next month, which has the same basic punchline but with more formal detail. And a bit more on other literatures (Rothwell, etc). An older version has been circulating I think but a slightly updated version is here:
Thoughts/comments welcome.

Jed Friedman
May 15, 2014

Sean, thanks very much for your comment and the link to the paper - I will read it asap! Jed

Claudia Sepúlveda
May 15, 2014

The 2014 ABCDE will be held at the World Bank headquarters, June 2-3, on the theme of the Role of Theory in Development Economics. The conference will have a session on the Validity of Experiments. Three papers will be presented and David McKenzie will be the discussant. 
The three papers to be presented are as follows:

Designing Experiments to Measure Spillover Effects” presented by Berk Özler (World Bank and University of Otago, New Zealand) 

“Randomized trials for Policy: a Review of the External Validity of Treatment Effects” presented by Sean Muller (University of Cape Town, South Africa)

“’Revisit’ the Silk Road: A Quasi-Experiment Approach Estimating the Effects of Railway Speed-Up Project on China-Central Asia Exports” presented by Hangtian Xu (Tohoku University, Japan)

Let's start the conversation and register for the conference at:

May 26, 2014

Interesting stuff. It's the sort of thing that on the one hand, when it is brought up, one tends to react "Of course you'd expect some variation ...", and yet is so often forgotten when people say "We know intervention X works ..."
Implementer bias is something that has concerned me for some time, or to be more precise the failure of many funding agencies to consider its implications. See