Imagine you are a local policy maker who just read about a new effective social program in another city and you want to determine whether this program would work for your area. This question, of course, concerns the external validity of program impacts, which we have discussed repeatedly here on this blog (see here and here for recent examples). The act of extrapolating evaluation results to other settings must always be driven, in part, by key assumptions – it’s unavoidable. But there are analytic methods that sometimes assist this extrapolation, thereby reducing the severity of the necessary assumptions.
A 2005 paper by Hotz, Imbens, and Mortimer predicted the results of a job-training program in new contexts by adjusting expected treatment impact for differences in the observable characteristics of participants. They do this by, you guessed it, balancing the observed characteristics of the control population in the program area and the new area through a matching estimator. They then estimate a treatment effect for the new area with the matching weights that reflect the characteristics of the new population.
Now a newly published paper by Carlos Flores and Oscar Mitnik extends this method to a multi-site setting wherein a policy maker can look at the results of a program implemented in numerous locations and infer local impacts by leveraging all of this information at the same time. The example is, again, a job-training program. This time the program was piloted in 5 cities in the US that randomized the offer of training to currently unemployed workers. For this multi-site setting, the authors estimate a generalized propensity score with a multinomial logit or probit in order to find comparable individuals in each of the program sites for every individual in the new site. After these scores are estimated, the region of common overlap needs to be determined through the adoption of a trimming rule similar to the trimming rules when the treatment is binary.
If the only reason for possible difference in program performance is due to imbalances in the observable characteristics of participants between the old and new sites, then the method sketched above should predict program performance fairly well. Yet even after adjusting for differences in individual characteristics, it is obvious that other factors can cause the outcomes of the comparison groups to diverge across study sites. For example location specific features – such as capacity to implement the program – would surely also affect program impacts and the covariate rebalancing approach doesn’t address this.
In the specific case of job-training programs, an obvious cause of divergence would be the local labor market conditions. One way to account for these conditions is to simply control for relevant labor market measures in the main estimating regression. What Flores and Mitnik do instead is model the effect of local conditions on outcomes in the pre-intervention period and then use the estimated parameters to adjust the outcomes of interest in the post-randomized period.
To assess the success of this overall approach for program impact extrapolation, Flores and Mitnik measure the extent to which these corrective measures equalize the outcomes of interest in the control groups in the various sites. To capture the degree of similarity in outcomes , the authors calculate the normalized root mean squared distance (rmsd) – the square root of the sum of squared deviations of each site outcome from the global mean – to benchmark improvements in inference. The main outcome explored is whether the individual was ever employed in the two years following program onset.
The unadjusted rmsd of ever employed is 0.121, six times greater than what would be expected in an experiment where the rmsd = 0.020. (This reference experimental value is calculated through a placebo experiment in which placebo sites are randomly assigned to individuals while keeping the number of observations in each site the same as the actual case.) Without taking into account local economic conditions, the generalized propensity score approach, which balances individual characteristics, reduces the rmsd to .099 – so comparability is improved, but still remains far from the randomized benchmark. Once outcomes are adjusted for local conditions, the rmsd is further reduced to 0.041. And if one of the fives sites – Riverside, California – which has distinctly different local conditions is excluded from the analysis, the rmsd actually reaches as low as 0.20. Thus for study sites with similar local labor market conditions to those sites studied remain, a local policy maker could likely infer program impacts fairly closely with this method.
Unfortunately for sites that look like Riverside, California, the possible program impact heterogeneity there cannot be fully separated from heterogeneity in local conditions. This last point speaks to the limitations of these impact projections: in situations where unobserved local characteristics interact with the program to affect impact we would fail to accurately extrapolate results. In these cases little can be done with the existing data except, perhaps, to bound impacts in one direction. There would be no substitute to careful theorizing on the causal mechanisms of the program and how they might apply in the new context.
A 2005 paper by Hotz, Imbens, and Mortimer predicted the results of a job-training program in new contexts by adjusting expected treatment impact for differences in the observable characteristics of participants. They do this by, you guessed it, balancing the observed characteristics of the control population in the program area and the new area through a matching estimator. They then estimate a treatment effect for the new area with the matching weights that reflect the characteristics of the new population.
Now a newly published paper by Carlos Flores and Oscar Mitnik extends this method to a multi-site setting wherein a policy maker can look at the results of a program implemented in numerous locations and infer local impacts by leveraging all of this information at the same time. The example is, again, a job-training program. This time the program was piloted in 5 cities in the US that randomized the offer of training to currently unemployed workers. For this multi-site setting, the authors estimate a generalized propensity score with a multinomial logit or probit in order to find comparable individuals in each of the program sites for every individual in the new site. After these scores are estimated, the region of common overlap needs to be determined through the adoption of a trimming rule similar to the trimming rules when the treatment is binary.
If the only reason for possible difference in program performance is due to imbalances in the observable characteristics of participants between the old and new sites, then the method sketched above should predict program performance fairly well. Yet even after adjusting for differences in individual characteristics, it is obvious that other factors can cause the outcomes of the comparison groups to diverge across study sites. For example location specific features – such as capacity to implement the program – would surely also affect program impacts and the covariate rebalancing approach doesn’t address this.
In the specific case of job-training programs, an obvious cause of divergence would be the local labor market conditions. One way to account for these conditions is to simply control for relevant labor market measures in the main estimating regression. What Flores and Mitnik do instead is model the effect of local conditions on outcomes in the pre-intervention period and then use the estimated parameters to adjust the outcomes of interest in the post-randomized period.
To assess the success of this overall approach for program impact extrapolation, Flores and Mitnik measure the extent to which these corrective measures equalize the outcomes of interest in the control groups in the various sites. To capture the degree of similarity in outcomes , the authors calculate the normalized root mean squared distance (rmsd) – the square root of the sum of squared deviations of each site outcome from the global mean – to benchmark improvements in inference. The main outcome explored is whether the individual was ever employed in the two years following program onset.
The unadjusted rmsd of ever employed is 0.121, six times greater than what would be expected in an experiment where the rmsd = 0.020. (This reference experimental value is calculated through a placebo experiment in which placebo sites are randomly assigned to individuals while keeping the number of observations in each site the same as the actual case.) Without taking into account local economic conditions, the generalized propensity score approach, which balances individual characteristics, reduces the rmsd to .099 – so comparability is improved, but still remains far from the randomized benchmark. Once outcomes are adjusted for local conditions, the rmsd is further reduced to 0.041. And if one of the fives sites – Riverside, California – which has distinctly different local conditions is excluded from the analysis, the rmsd actually reaches as low as 0.20. Thus for study sites with similar local labor market conditions to those sites studied remain, a local policy maker could likely infer program impacts fairly closely with this method.
Unfortunately for sites that look like Riverside, California, the possible program impact heterogeneity there cannot be fully separated from heterogeneity in local conditions. This last point speaks to the limitations of these impact projections: in situations where unobserved local characteristics interact with the program to affect impact we would fail to accurately extrapolate results. In these cases little can be done with the existing data except, perhaps, to bound impacts in one direction. There would be no substitute to careful theorizing on the causal mechanisms of the program and how they might apply in the new context.
Join the Conversation