Impact evaluations already take a long time to do – a year or two setting up the project and doing a baseline, time for the intervention to occur, and then a year or two to collect follow-up data – suddenly you are at five years plus and that is before trying to negotiate the journal publication process. Given tenure clocks, funding cycles, and windows for influencing policy, it is no wonder that most evaluations of active labor market interventions and firm interventions focus on impacts over periods of 12 to 18 months. At least this provides some period of measurement in which financial returns from investments can be compared to costs, but the cost-effectiveness of interventions depends heavily on whether they have long-term impacts. Measuring only short-term impacts is even more of a problem when it comes to educational and early-childhood interventions, in which researchers typically tell us this program led to 0.2 standard deviation improvement in test scores, whereas we really want to know does this help the children find better jobs and be less likely to be poor as adults.
Surrogate Index to the rescue?
A new working paper by Susan Athey, Raj Chetty, Guido Imbens and Hyunseung Kang (ungated version here that should update tomorrow to the latest version) discusses how one can combine multiple short-term outcomes into an index (the surrogate index) which is the predicted value of the long-term outcome of interest given these short-term outcomes, and, in turn, get an estimate of the long-term treatment impact from the treatment impact on this index.
This approach is best illustrated through the empirical example they use in their paper. The intervention was a job-assistance program in California that was implemented via a randomized experiment to help welfare recipients find work. The key outcome of interest was employment, and using quarterly administrative on employment, past research tracked the participants and found treatment resulted in higher employment rates 9-years after treatment of 6.4 percentage points. The key question is then whether one could have learned this long-term treatment effect if data on the participants were only available for short-term follow-ups?
Their approach is as follows:
Step 1: Form a Surrogate Index by using an observational data source to predict the relationship between short-term outcomes and the long-term outcome of Interest.
The surrogate index is the conditional expectation of the long-term outcome given the intermediate outcomes (and any pre-treatment covariates). Here they take administrative data just for the treatment group in one experimental site for illustrative purposes, and run:
Employment after 9 years = a + b1*Quarter 1 employment + b2*Quarter 2 employment +…+ b6*Quarter 6 employment + e
In this case, they are combining data from the first six quarters on employment, and use this to predict long-term employment. When many potential short-term outcomes are available, machine-learning methods like LASSO or random forests could be used to make this prediction.
Step 2: Use your impact evaluation to estimate the impact of your treatment on this surrogate index.
Here, using the coefficients from step 1, all the short-term outcomes are combined together into this surrogate index, and you simply regress:
Surrogate index = c + d*Treatment +e
The coefficient d then gives the estimated long-term effect of the treatment. In their case, they investigate how many quarters of data they need to use, and find with 6 quarters or more, they are able to get fairly accurate estimates of the 9-year impact.
Bonus: they also note that even when the long-term outcome is available, the surrogate index can still be helpful in estimating the long-term treatment effect more precisely.
What assumptions are needed for this to work?
1. Unconfoundedness: this is the standard requirement that treatment be orthogonal to potential outcomes (potentially after conditioning on pre-treatment variables). This will be satisfied in randomized experiments, but will need the usual additional amount of work to justify in non-experimental evaluations.
2. Surrogacy: this requires that the long-term outcome is independent of the treatment, conditional on the full set of surrogates. This is unlikely to be satisfied if you use just a single short-term outcome (e.g. a test score) as a surrogate, but may be more plausible if you have a whole range of intermediate outcomes that together span the causal chain between treatment and the long-term outcome. In their employment example, they show that just using a single quarter of employment does not work well, and it is only by combining multiple quarters of short-term employment outcomes that non-linear employment dynamics can be captured that matter for predicting long-term outcomes. In the education literature, the concern here is that education may affect long-term earnings not just through test scores and school attendance rates, but also through a range of soft skills that may not always be measured.
3. Comparability: this requires that the conditional distribution of the long-term outcome given the surrogates is the same in the observational and experimental samples. In their example, they show that using data from the treatment group in Riverside does seem to work in also predicting the association between short-term employment dynamics and long-term employment in three other Californian cities. But we might be more concerned with this assumption if you have data on the treatment effects of job-training on short-term employment outcomes in Guatemala, but then use social security data from Mexico to examine the association between short-term employment and long-term employment to predict the surrogate index. The authors suggest that, over time, a “library” of surrogate indices could be developed to systematically catalog sets of surrogates that match long-term outcomes of interest.
What if these assumptions don’t hold?
The authors note that:
· Out-of-sample validation can be used to make us more confident in the surrogacy assumption. For example, they show that using a surrogate index based on 6 quarters of data closely tracks experimental impacts 2, 3, and 4 years after random assignment – so seeing this approach work well over this medium-term makes one more confident in it extrapolating to longer-term.
· Bounding approaches can be used to examine robustness to violations of this assumption, using an approach similar to Oster (2019)
A potential concern I have is similar to an issue that arises in mediation analysis, which is how to think about treatment effect heterogeneity in these cases. Here’s an example, motivated by work I did on wage subsidies for young women in Jordan, where these women are say 22 years old at the time of intervention, and I would ideally like to know what impacts are on employment by age 30:
· Suppose that young women fall into three types (always work, only work when given subsidy, never work). The wage subsidy then has an average treatment effect which raises employment in the short-term (say quarterly employment rates at age 23 and 24), with this effect being entirely concentrated on the compliers (those who only work when given the subsidy).
· I could then take data from the Jordanian social security system for young women from several years ago, and would find that quarterly employment rates at age 23 and 24 are strongly associated with employment rates at age 30 (the people working at 23 and 24 are working at 30, the ones hardly working at 23 and 24 are not). So these short-term employment rates seem like a good surrogate.
· So I would then predict a long-term impact on employment from this treatment, based on the short-term results. But the issue here is that the type of women for whom the treatment influences short-term results (the compliers), are not the type of women for whom short-term employment is a good predictor of long-term employment.
This type of heterogeneity would violate the surrogacy assumption, since long-term employment would not be independent of treatment after conditioning on the surrogate. The key here is that the surrogacy assumption requires that how an individual attains the surrogate index does not matter for predictive power. i.e. that it doesn’t matter for predicting employment at age 30 whether someone gets employed at age 23 because they were a complier and received some treatment, or because they are always employed. In my view, this assumption is more plausible when you have multiple rounds of short-term data over time, including some time when treatment is not being received, and when you have a range of different variables. So in my employment example, I would be confident if I form the surrogate index using multiple quarters of data on employment, wages, hours worked, attitudes towards future work, and mental health than if I just used a single one year follow-up to measure two or three employment rated outcomes then.
I look forward to seeing applications and validation of this approach in development settings, to start building this “library” of knowledge of when we might be able to use surrogate indices to get accurate estimates of long-term treatment effects from short-term follow-ups.
Join the Conversation