Published on Development Impact

Will that successful intervention over there get results over here? We can never answer with full certainty, but a few steps may help

Jed Friedman

February 05, 2014

This page in:

Imagine you are a local policy maker who just read about a new effective social program in another city and you want to determine whether this program would work for your area. This question, of course, concerns the external validity of program impacts, which we have discussed repeatedly here on this blog (see here and here for recent examples). The act of extrapolating evaluation results to other settings must always be driven, in part, by key assumptions – it’s unavoidable. But there are analytic methods that sometimes assist this extrapolation, thereby reducing the severity of the necessary assumptions.

A 2005 paper by Hotz, Imbens, and Mortimer predicted the results of a job-training program in new contexts by adjusting expected treatment impact for differences in the observable characteristics of participants. They do this by, you guessed it, balancing the observed characteristics of the control population in the program area and the new area through a matching estimator. They then estimate a treatment effect for the new area with the matching weights that reflect the characteristics of the new population.

Now a newly published paper by Carlos Flores and Oscar Mitnik extends this method to a multi-site setting wherein a policy maker can look at the results of a program implemented in numerous locations and infer local impacts by leveraging all of this information at the same time. The example is, again, a job-training program. This time the program was piloted in 5 cities in the US that randomized the offer of training to currently unemployed workers. For this multi-site setting, the authors estimate a generalized propensity score with a multinomial logit or probit in order to find comparable individuals in each of the program sites for every individual in the new site. After these scores are estimated, the region of common overlap needs to be determined through the adoption of a trimming rule similar to the trimming rules when the treatment is binary.

If the only reason for possible difference in program performance is due to imbalances in the observable characteristics of participants between the old and new sites, then the method sketched above should predict program performance fairly well. Yet even after adjusting for differences in individual characteristics, it is obvious that other factors can cause the outcomes of the comparison groups to diverge across study sites. For example location specific features – such as capacity to implement the program – would surely also affect program impacts and the covariate rebalancing approach doesn’t address this.

In the specific case of job-training programs, an obvious cause of divergence would be the local labor market conditions. One way to account for these conditions is to simply control for relevant labor market measures in the main estimating regression. What Flores and Mitnik do instead is model the effect of local conditions on outcomes in the pre-intervention period and then use the estimated parameters to adjust the outcomes of interest in the post-randomized period.

To assess the success of this overall approach for program impact extrapolation, Flores and Mitnik measure the extent to which these corrective measures equalize the outcomes of interest in the control groups in the various sites. To capture the degree of similarity in outcomes , the authors calculate the normalized root mean squared distance (rmsd) – the square root of the sum of squared deviations of each site outcome from the global mean – to benchmark improvements in inference. The main outcome explored is whether the individual was ever employed in the two years following program onset.

The unadjusted rmsd of ever employed is 0.121, six times greater than what would be expected in an experiment where the rmsd = 0.020. (This reference experimental value is calculated through a placebo experiment in which placebo sites are randomly assigned to individuals while keeping the number of observations in each site the same as the actual case.) Without taking into account local economic conditions, the generalized propensity score approach, which balances individual characteristics, reduces the rmsd to .099 – so comparability is improved, but still remains far from the randomized benchmark. Once outcomes are adjusted for local conditions, the rmsd is further reduced to 0.041. And if one of the fives sites – Riverside, California – which has distinctly different local conditions is excluded from the analysis, the rmsd actually reaches as low as 0.20. Thus for study sites with similar local labor market conditions to those sites studied remain, a local policy maker could likely infer program impacts fairly closely with this method.

Unfortunately for sites that look like Riverside, California, the possible program impact heterogeneity there cannot be fully separated from heterogeneity in local conditions. This last point speaks to the limitations of these impact projections: in situations where unobserved local characteristics interact with the program to affect impact we would fail to accurately extrapolate results. In these cases little can be done with the existing data except, perhaps, to bound impacts in one direction. There would be no substitute to careful theorizing on the causal mechanisms of the program and how they might apply in the new context.

Get updates from Development Impact

Authors

Jed Friedman

Lead Economist, Development Research Group, World Bank

More Blogs By Jed

Also of Interest

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.

Lant Pritchett

February 08, 2014

I like this comment as it reveals that the whole movement for basing policy on "rigorous evidence" (by which was meant either RCTs or evidence that people liked the identification of) is on its last legs. First, suppose an RCT has been done on impact of job training on earnings in five American cities and not yours, say my home town of Boise Idaho. Suppose there are some observables that are available in the five cities and that hence I can predict from the RCT the likely causal impact in Boise. One can do that, but at this stage all of the propaganda of "rigorous" is lost. There is nothing defensibly more "rigorous" about this evidence than any other evidence that could be deployed. That is, I might have an OLS estimate of the association of wages and actual job training from implementation in Boise. There is no sense in which the extrapolation of evidence from an RCT elsewhere is more "rigorous" than OLS in Boise. That is, we know the internal validity issues that "correlation is not causation" for OLS but we also know the problems with external validity. So this is a tradeoff between two pieces of non-rigorous evidence so all rhetoric about "using rigorous evidence" is now irrelevant as in the proposed use the RCT evidence isn't rigorous. (see Pritchett and Sandefur 2013 for empirical examples where it appears the external validity problems of variation across context are much worse than internal validity problems with causal identification). Second, it is worth noting how wildly at odds what is being discussed is from most potential development applications. This is extrapolating from Riverside California to, say, Boise Idaho. But what about Riverside California to Cali Colombia? Or Ankara Turkey? or Nairobi Kenya? We would have to suppose that the variation in causal impact across contexts is mostly? primarily? exclusively? captured by measures that are measured in the RCT sites and that these measures of also comparable to the measures in policy application contexts? But without a complete, coherent, theoretically sound and empirically validated model that provides the "invariance laws" these adjustments are stabs in the dark. Take the example of job training. Suppose that job training is more effective in US cities with lower measured unemployment rates because, it just so happens, given the US context these proxy across cities/regions for strong labor demand. But the "unemployment rate" may well be raised by transfer programs that allow for greater search. So it could be a poor city has a lower measured unemployment even with strength of labor demand. In this case not only would the casual impact lack external validity but the adjustments to external validity would lack external validity as lower unemployment could be associated with less impact in a city in a poor country rather than bigger impact as the adjustment would suggest. Maybe my example is false--but in extrapolating results across contexts no one knows if it is or not and the evidence doesn't tell us. So in nearly all development applications we are in exactly the position Jed suggests in which "little can be done" with existing RCT estimates.

I like this comment as it reveals that the whole movement for basing policy on "rigorous evidence" (by which was meant either RCTs or evidence that people liked the identification of) is on its last legs. First, suppose an RCT has been done on impact of job training on earnings in five American cities and not yours, say my home town of Boise Idaho. Suppose there are some observables that are available... in the five cities and that hence I can predict from the RCT the likely causal impact in Boise. One can do that, but at this stage all of the propaganda of "rigorous" is lost. There is nothing defensibly more "rigorous" about this evidence than any other evidence that could be deployed. That is, I might have an OLS estimate of the association of wages and actual job training from implementation in Boise. There is no sense in which the extrapolation of evidence from an RCT elsewhere is more "rigorous" than OLS in Boise. That is, we know the internal validity issues that "correlation is not causation" for OLS but we also know the problems with external validity. So this is a tradeoff between two pieces of non-rigorous evidence so all rhetoric about "using rigorous evidence" is now irrelevant as in the proposed use the RCT evidence isn't rigorous. (see Pritchett and Sandefur 2013 for empirical examples where it appears the external validity problems of variation across context are much worse than internal validity problems with causal identification). Second, it is worth noting how wildly at odds what is being discussed is from most potential development applications. This is extrapolating from Riverside California to, say, Boise Idaho. But what about Riverside California to Cali Colombia? Or Ankara Turkey? or Nairobi Kenya? We would have to suppose that the variation in causal impact across contexts is mostly? primarily? exclusively? captured by measures that are measured in the RCT sites and that these measures of also comparable to the measures in policy application contexts? But without a complete, coherent, theoretically sound and empirically validated model that provides the "invariance laws" these adjustments are stabs in the dark. Take the example of job training. Suppose that job training is more effective in US cities with lower measured unemployment rates because, it just so happens, given the US context these proxy across cities/regions for strong labor demand. But the "unemployment rate" may well be raised by transfer programs that allow for greater search. So it could be a poor city has a lower measured unemployment even with strength of labor demand. In this case not only would the casual impact lack external validity but the adjustments to external validity would lack external validity as lower unemployment could be associated with less impact in a city in a poor country rather than bigger impact as the adjustment would suggest. Maybe my example is false--but in extrapolating results across contexts no one knows if it is or not and the evidence doesn't tell us. So in nearly all development applications we are in exactly the position Jed suggests in which "little can be done" with existing RCT estimates.

paridhi gupta

June 05, 2020

was a nice one helped a lot in my school project good information given and nice language is used!!!!!!!!!!!!!!!!!!!!