Will that successful intervention over there get results over here? We can never answer with full certainty, but a few steps may help


This page in:

Imagine you are a local policy maker who just read about a new effective social program in another city and you want to determine whether this program would work for your area. This question, of course, concerns the external validity of program impacts, which we have discussed repeatedly here on this blog (see here and here for recent examples). The act of extrapolating evaluation results to other settings must always be driven, in part, by key assumptions – it’s unavoidable. But there are analytic methods that sometimes assist this extrapolation, thereby reducing the severity of the necessary assumptions.

A 2005 paper by Hotz, Imbens, and Mortimer predicted the results of a job-training program in new contexts by adjusting expected treatment impact for differences in the observable characteristics of participants. They do this by, you guessed it, balancing the observed characteristics of the control population in the program area and the new area through a matching estimator. They then estimate a treatment effect for the new area with the matching weights that reflect the characteristics of the new population.

Now a newly published paper by Carlos Flores and Oscar Mitnik extends this method to a multi-site setting wherein a policy maker can look at the results of a program implemented in numerous locations and infer local impacts by leveraging all of this information at the same time. The example is, again, a job-training program. This time the program was piloted in 5 cities in the US that randomized the offer of training to currently unemployed workers. For this multi-site setting, the authors estimate a generalized propensity score with a multinomial logit or probit in order to find comparable individuals in each of the program sites for every individual in the new site. After these scores are estimated, the region of common overlap needs to be determined through the adoption of a trimming rule similar to the trimming rules when the treatment is binary.

If the only reason for possible difference in program performance is due to imbalances in the observable characteristics of participants between the old and new sites, then the method sketched above should predict program performance fairly well. Yet even after adjusting for differences in individual characteristics, it is obvious that other factors can cause the outcomes of the comparison groups to diverge across study sites.  For example location specific features – such as capacity to implement the program – would surely also affect program impacts and the covariate rebalancing approach doesn’t address this.

In the specific case of job-training programs, an obvious cause of divergence would be the local labor market conditions. One way to account for these conditions is to simply control for relevant labor market measures in the main estimating regression.  What Flores and Mitnik do instead is model the effect of local conditions on outcomes in the pre-intervention period and then use the estimated parameters to adjust the outcomes of interest in the post-randomized period.

To assess the success of this overall approach for program impact extrapolation, Flores and Mitnik measure the extent to which these corrective measures equalize the outcomes of interest in the control groups in the various sites. To capture the degree of similarity in outcomes , the authors calculate the normalized root mean squared distance (rmsd) – the square root of the sum of squared deviations of each site outcome from the global mean – to benchmark improvements in inference. The main outcome explored is whether the individual was ever employed in the two years following program onset.

The unadjusted rmsd of ever employed is 0.121, six times greater than what would be expected in an experiment where the rmsd = 0.020. (This reference experimental value is calculated through a placebo experiment in which placebo sites are randomly assigned to individuals while keeping the number of observations in each site the same as the actual case.) Without taking into account local economic conditions, the generalized propensity score approach, which balances individual characteristics, reduces the rmsd to .099 – so comparability is improved, but still remains far from the randomized benchmark. Once outcomes are adjusted for local conditions, the rmsd is further reduced to 0.041. And if one of the fives sites – Riverside, California – which has distinctly different local conditions is excluded from the analysis, the rmsd actually reaches as low as 0.20. Thus for study sites with similar local labor market conditions to those sites studied remain, a local policy maker could likely infer program impacts fairly closely with this method.

Unfortunately for sites that look like Riverside, California, the possible program impact heterogeneity there cannot be fully separated from heterogeneity in local conditions. This last point speaks to the limitations of these impact projections: in situations where unobserved local characteristics interact with the program to affect impact we would fail to accurately extrapolate results. In these cases little can be done with the existing data except, perhaps, to bound impacts in one direction. There would be no substitute to careful theorizing on the causal mechanisms of the program and how they might apply in the new context.


Jed Friedman

Senior Economist, Development Research Group, World Bank

Join the Conversation

Lant Pritchett
February 08, 2014

I like this comment as it reveals that the whole movement for basing policy on "rigorous evidence" (by which was meant either RCTs or evidence that people liked the identification of) is on its last legs.
First, suppose an RCT has been done on impact of job training on earnings in five American cities and not yours, say my home town of Boise Idaho. Suppose there are some observables that are available in the five cities and that hence I can predict from the RCT the likely causal impact in Boise. One can do that, but at this stage all of the propaganda of "rigorous" is lost. There is nothing defensibly more "rigorous" about this evidence than any other evidence that could be deployed. That is, I might have an OLS estimate of the association of wages and actual job training from implementation in Boise. There is no sense in which the extrapolation of evidence from an RCT elsewhere is more "rigorous" than OLS in Boise. That is, we know the internal validity issues that "correlation is not causation" for OLS but we also know the problems with external validity. So this is a tradeoff between two pieces of non-rigorous evidence so all rhetoric about "using rigorous evidence" is now irrelevant as in the proposed use the RCT evidence isn't rigorous.
(see Pritchett and Sandefur 2013 for empirical examples where it appears the external validity problems of variation across context are much worse than internal validity problems with causal identification).
Second, it is worth noting how wildly at odds what is being discussed is from most potential development applications. This is extrapolating from Riverside California to, say, Boise Idaho. But what about Riverside California to Cali Colombia? Or Ankara Turkey? or Nairobi Kenya? We would have to suppose that the variation in causal impact across contexts is mostly? primarily? exclusively? captured by measures that are measured in the RCT sites and that these measures of also comparable to the measures in policy application contexts? But without a complete, coherent, theoretically sound and empirically validated model that provides the "invariance laws" these adjustments are stabs in the dark.
Take the example of job training. Suppose that job training is more effective in US cities with lower measured unemployment rates because, it just so happens, given the US context these proxy across cities/regions for strong labor demand. But the "unemployment rate" may well be raised by transfer programs that allow for greater search. So it could be a poor city has a lower measured unemployment even with strength of labor demand. In this case not only would the casual impact lack external validity but the adjustments to external validity would lack external validity as lower unemployment could be associated with less impact in a city in a poor country rather than bigger impact as the adjustment would suggest. Maybe my example is false--but in extrapolating results across contexts no one knows if it is or not and the evidence doesn't tell us.
So in nearly all development applications we are in exactly the position Jed suggests in which "little can be done" with existing RCT estimates.

Jed Friedman
February 10, 2014

Hi Lant, thanks so much for your thoughts. I agree that extending policy lessons from setting A to setting B requires structural assumptions (so they better be carefully considered) and likely adjustments through quasi-experimental methods. There is just no way around this and the blind recommendation of "RCT-rigorous" evidence from setting A to B is terrible practice. Fortunately in my daily experience I don't see this that often. If anything I fear we may be evaluating too much - i.e. re-evaluating the whole kitty in A' (a setting highly similar to A), instead of thinking carefully about how the lessons from A translate to A'.
It reminds me of the story of the government official in, say, Zambia, who says "don't tell me about evidence from Rwanda - I only want to know what will work here". And in many senses this official is correct and acting in the best interests of her constituents and the state budget.
I would still posit there is room for RCTs as one tool she should consider for situated policy learning - in some settings it may be the best choice, in others it may be entirely inapplicable, and perhaps for some questions the "OLS" would yield the closest approximation (and would also be the cheapest approach!). We truly need to systematically think about when, where, and how we recommend a full-blown prospective RCT evaluation versus other approaches. And this decision need include the necessary resources the study would command as well the  existing evidence base for the question asked.
Relatively small RCTs are particularly good for illuminating previously untested interventions/technologies assuming that SUTVA is not violated, etc. RCTs are particularly bad if we care about the performance of a whole system while intervening in a part of the system with something like information that can easily contaminate controls.
So I am quite sympathetic to your view, but perhaps don't lean all the way in your direction. For me, RCTs belong in the toolkit. But if it's the only hammer we employ then we are doing development a great disservice.

Oscar Mitnik
February 23, 2014

Dear Jed and Lant,
Thanks Jed for your careful explanation of the results in our paper. First, we want to make clear that we sympathize very much with the comments from both Jed and Lant. We believe that in the paper we make two points that are useful regarding the topics discussed in your comments: 1) If one has control individuals from RCTs in different locations, and one is not able to at least have individuals with similar observable characteristics between control groups in different locations, then it is very likely that no comparison is feasible of the treatment effects of those RCTs. In our paper we check this through a common support condition (multi-site), which in our paper drops a very large number of the individuals in Riverside. This alone shows that Riverside is probably quite different to the other sites, and it is not a good idea to use it in comparisons. 2) It is indeed important to consider possible site-specific differences across locations (e.g., local economic conditions) when information from an RCT in one location wants to be used to make inferences about the potential effect from the implementation of a given intervention in another location. Our attempt at controlling for local economic conditions is rather specific to the data we have. But, the whole point is that we are explicitly imposing a model, and as any model, it needs to be evaluated for reasonableness. Indeed, these types of adjustments would necessarily need to be specific to the particular context, which is in spirit close to Jed's point about “careful theorizing”. Also, we agree with Lant that, at this level, it is not clear a priori whether extrapolating from an RTC is necessarily better than employing “non-rigorous evidence” (e.g., OLS) to evaluate the effects of the intervention at the location of interest using only data from that specific location. (Of course, in general, it is better to have alternative estimators of a given effect and choose the appropriate one depending of the situation at hand.)
Regarding Lant's comment about trying to apply these types of adjustment methods across countries, we completely share the strong concerns about the difficulties and dangers of doing so. Definitely anyone attempting to do that would need to approach the exercise with extreme caution. However, we still think that the first part of our approach, that of controlling for differences in individual characteristics, can be very useful, even if one is worried about differences in the environment that may overwhelm the differences in individual characteristics. To put it in another way: if one is not even able to show that at least based on certain characteristics there is enough “overlap” between the groups across locations, then it is time to stop the exercise and not even continue trying to assess external validity. But, if one were able to show that imposing some type of common support condition can eliminate a large portion of the differences in control groups, the next step would be to evaluate whether a carefully developed model could allow controlling for differences in the environment. For example, if several RCTs are produced in different regions within a country, then at least in those cases these types of adjustments may be useful (whether they can be useful across countries or not, the context and which countries are being compared would very much determine that). Of course, as also mentioned by Lant, at this point whether such an approach would be better than a “non-rigorous” approach using data only from the location of interest would likely depend on the application at hand.
It is great that more and more people are discussing these issues. We enjoyed reading the Pritchett and Sandefur 2013 paper (and agree with many of its points). We also encourage those interested in these topics to take a look at the recent papers by Rajeev Dehejia (2013, http://ideas.repec.org/p/unu/wpaper/wp2013-011.html) and by Allcott and Mullainathan (2012, http://www.nber.org/papers/w18373).
Carlos Flores & Oscar Mitnik