Of all the impact evaluation methods, the one that consistently (and justifiably) comes last in the methods courses we teach is matching. We de-emphasize this method because it requires the strongest assumptions to yield a valid estimate of causal impact. Most importantly this concerns the assumption of unconfoundedness, namely that selection into treatment can be accurately captured solely as a function of observable covariates in the data. That is we must assume there is no selection on unobservables - an assumption that, in general, makes me and others very uncomfortable.
There are additional challenges with matching, not the least of which is a veritable forest of matching methods and numerous other modeling decisions (such as bandwidth choice or trimming decisions) with no clear guidance on the “correct” decision. Fortunately two important papers published earlier this year should help guide our choice of method as well as inform us of the cost in bias when the critical identifying assumption doesn’t hold.
The first paper by Huber, Lechner, and Wunsch, and published in the Journal of Econometrics , presents a new approach to the simulation based assessment of matching estimators with, what they term, an “empirical Monte Carlo study”. This approach uses a real large-scale administrative data set of German workers; some of whom have participated in state run job-training programs. Originally constructed to estimate the impacts of these programs, here the data is used to simulate “placebo-treatments” among the non-treated. In this analytic trick, the “true” matching specification is stipulated with a saturated model estimated between treatment and controls while the “true” program effect is known to be zero, by construction since it’s a placebo treatment on a sub-sample of controls. Different variants of matching estimators can be applied to the repeatedly drawn sub-samples in order to compare impact estimates against the known true effect.
The authors choose this approach rather than stipulating the data generating process in a more traditional Monte Carlo analysis by the understandable wish to exploit real data – and hence real selection problems and dependencies between treatment status and outcomes. Of course this approach implies that any lessons from the exploratory analysis are most applicable to similar settings of active labor market program evaluations, with unknown applicability to other settings (although this limitation holds for any particular Monte Carlo approach, either with real or generated data).
As the known effect size is zero in these placebo tests, the actual estimated effect will be contrasted across the following estimators:
- inverse probability weighting (a la Hirano and Imbens )
- one-to-one matching
- radius matching (with and without bias correction)
- kernel matching
- standard parametric models
Both the small- and large-sample performances of these estimators are contrasted, as well as results with and without trimming extreme observations (and results with other key parameter variations such as percent treated, and the strength of selection in the matching equation). All told, hundreds of variants of methods, sample sizes, and tuning parameters are contrasted in term of their bias, precision, and mean squared error (MSE). (The programming code for all of these variants can be found at this link ).
Given all of these comparisons there are, of course, a lot of nuanced results. However the main take-away messages are the following:
(1) Trimming control observations that have “too large” a weight in the estimates makes a big difference in bias reduction (and the authors propose a sensible modified trimming rule – I refer to the paper for details).
(2) Direct one-to-one matching often has the least bias in large samples but lacks precision and so has a high MSE. These estimators also tend to have relatively large small-sample bias.
(3) The dominant performer in most settings is regression-adjusted radius matching (with a fairly large radius). The dominance of this method is especially apparent if there are a large number of matching covariates. Importantly, this method also performs the best when the selection equation is deliberately mis-specified suggesting that it may be the most robust to at least moderate mis-specification of the selection equation.
A comprehensive selection equation is a critical issue with matching and so matching, when done well, should be a data intensive approach. If we had unlimited resources to collect information that would render the currently unobserved observable, then the identifying assumption behind matching – full selection on observables – would be innocuous. But the reality is often far from this ideal. And when the unconfoundedness assumption is just plain wrong, we will have a biased estimated of the causal impact. But to what degree?
The second recent paper by Lechner and Wunsch, published in Labour Economics , explores this question using the same large employer-employee administrative data as in the first paper. This data contains unusually rich information on worker, job, firm, and region characteristics that should provide a more credible selection correction for employment search assistance and job training program evaluation.
Lechner and Wunsch’s approach is the same – to simulate placebo participants from among the non-participants, thus ensuring that the selection model is known (as estimated using actual participants and non-participants which then becomes the “true” selection parameters in the placebo data). As well, the known effect in the placebo test is again zero. The matching estimator used is the bias adjusted radius matching deemed most preferred in the Huber et al paper.
Previous non-experimental job-training evaluations have attempted to remove bias by including basic socio-demographic data of unemployed workers combined with pre-treatment outcomes (short-run labor market histories) and regional information. This information does reduce the bias is estimating the effects of job training on subsequent employment or earnings. However only matching on these characteristics will overstate the impact of job training on subsequent employment by roughly 40% for men and 20% for men, as found by Lechner and Wunsch.
The big concern in non-experimental evaluations of job training programs is the inability to control for selection on unobserved motivation, productivity, and employability. It turns out in this German data exercise that the addition of less commonly available information on longer-run worker labor histories and unemployment spells as well as good measures of worker health is necessary to remove the remaining bias. Thus these measures (labor market histories and health) are able to capture important determinants that have previously been unmeasured.
Now of course the external validity of these findings really applies most closely to job training or search assistance programs found in OECD countries. It would be naive to take the specific lessons in these papers to other settings without thinking through their applicability. Nevertheless this exercise is a smart demonstration of the challenges that matching estimators face. Rich data are required to justify causal identification based on selection on observables. In the case of active labor market programs in Germany, basic socio-demographic and regional information (of the type commonly found in studies of this sort) remove some of the bias from non-experimental matching, but additional information, such as longer-term labor market histories and health, are necessary to reduce the bias very close to zero.