Many economists are really skeptical about the credibility of matching estimators for identifying treatment impacts. For example, in a 2019 blogpost titled “why so much hate against propensity score matching?”, Paul Hünermund wrote “Apparently, in the year 2019 it’s not possible anymore to convince people in an econ seminar with a propensity score matching (or any other matching on observables, for that matter).”. Jennifer Doleac writes “This is your regular reminder that propensity score matching is typically not a good way to measure causal effects. Yes, there are exceptions. Whatever you want to do probably isn’t it.”. In his recent textbook, Scott Cunningham (p. 208) writes “Economists are oftentimes skeptical that CIA (the conditional independence assumption) can be achieved in any data set – almost as an article of faith”.
It’s been said that “friends don’t let friends do IV” – should we be saying the same about matching? Or when will matching be more plausible, and what do you need to do to argue for this plausibility?
The problem with matching (and with OLS for non-experimental causal inference)
This is not a post about HOW you should do matching, so I’m not going to focus on whether you use propensity scores, nearest neighbors based on Mahalanobis distances, coarsened exact matching, or some other method (including some flexible OLS within the common support). Whichever method you have used, the assumption is that after you have matched treated and control units on observables, any difference in outcomes reflects the causal effect of treatment, and not the influence of some unobserved variable.
The problem is that our economic models are typically all about self-selection. Models such as the Roy model have individuals self-selecting into occupations based on their unobserved productivity; the related Borjas model has individuals choosing whether to emigrate on the basis of their unobserved earnings determinants at home and abroad; while the Heckman et al. notion of essential heterogeneity has individuals selecting whether or not to participate in a program in part on the basis of their likely treatment effect. As Imbens and Rubin (2015, p. 264-265) write in their textbook:
“In observational studies it is less clear why such similar units should receive different treatment assignments. Especially in settings where the units are individuals and the assignment mechanism is based on individual choices, one might be concerned that individuals who look ex ante identical (i.e. identical in terms of pre-treatment characteristics) but who make different choices must be different in unobserved ways that invalidates a causal interpretation of differences in their outcomes”
The standard approach, and textbook approach to thinking about whether matching will deliver reliable causal estimates has been through statistical plausibility tests. The first has been through comparison of matching estimators to those from randomized trials, and examining whether matching can approximate the estimate from an experiment in a particular setting. This has helped in providing some lessons on the types of variables that are useful to match on (e.g. matching on multiple years of labor earnings data when examining impacts of a labor training program, making sure people are from the same labor markets, and that data is collected in the same way for treatment and controls). But knowing that matching worked ok for this subsample of a U.S. labor training experiment in the 1980s may not make us that confident it will work for evaluating another type of program in another setting.
The second type of statistical test has then been a set of diagnostic and plausibility tests that can be done with your data without having an experiment to compare to. Chapter 21 of Imbens and Rubin’s textbook provides a great example of this, going through three types of checks:
1. Estimating the effect of the treatment on unaffected outcomes, particularly the lagged outcome. E.g. if you have data for 5 years of earnings pre-treatment, match the data on years t-5 to t-2, and then test whether there is a difference in the t-1 earnings.
2. Estimate the effect of a pseudo-treatment on an outcome: for example, if you have two sets of possible control groups (e.g. applicants not selected for the program, and non-applicants, or individuals in two neighboring states) running your model on these two groups and seeing whether there is a difference in outcomes.
3. Assessing sensitivity of estimates to different choices of controls: e.g. if you match on 5 years of pre-treatment earnings versus 3-years, do the results differ much.
These are useful checks to help make a case for plausibility, but they still do not help answer the fundamental question that bothers so many people in thinking about matching as a solution “Why, if these units are so similar, did one get treated and another not?”
Rhetorical plausibility and cases where matching is more (and less) plausible
What the textbooks and papers don’t talk about enough is that a good use of matching needs to not just make a statistical case, but also a rhetorical case for why it is plausible. That is, just as instrumental variables relies not just on statistical tests, but also a skillful use of context and theory to justify the exclusion restriction, we need the same discussion in matching. Judea Pearl (2013, p. 350) notes this nicely (in making the case for using a DAG to justify the story):
“the golden rule of causal analysis: No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design… The admissibility .. can be established only by appealing to the causal knowledge available to the investigator”
So when does it seem more plausible that, conditional on matched observables, the reason one unit gets treated and another is random, or at least not due to a variable that is also correlated with the outcome?
· Case 1 (Separate decision-maker with limited information decides on treatment): A first case is when the person making the decision about treatment assignment is different from the individual actually receiving the treatment, and has limited information in doing so, all of which is observed by the econometrician. This type of scenario occurs when people self-select into applying for a program, but then program administrators may make decisions based only on the limited information on the application form, all of which you have in your data. This has analogs to the judge leniency IV designs: the assumption is that it is just which official deals with the application that drives treatment choice, and that this should not independently affect the program outcome.
An early example of this is Angrist (1998), who wants to look at the effect of voluntary military service on earnings, and argues that by looking at applicants who have all self-selected, and then observing the main characteristics of the applicants that the military uses to select individuals (age, schooling, test scores, etc.), that any differences remaining are ignorable – that is, that it is some bureaucratic procedure unrelated to future earnings that determines who gets chosen.
A second example of this approach is in Ly and Riegert (2014), who are looking at the effect of classmates on student achievement. They argue that high school principals do not know their first-year students at the time they assign them to classes, so have to do the allocation just based on a limited set of student characteristics which are in the registration files, and then when it comes to separating students who look nearly identical, it is as if random who gets assigned to which class.
Delius and Sterck (2020) provides a very recent example, looking at the impact of a cash transfer program for refugees on businesses that get a license. They look at businesses that self-selected into applying for licenses, and observe all the information about the businesses that the World Food Program used from the application form to select businesses. It is not made clear exactly how the officials then decided among similar applicants, but the implicit assumption is that it was just bureaucractic happenstance.
Berk’s recent paper gives another type of example in this category – bureaucratic errors might determine who gets selected and who doesn’t. For example, they note that spelling differences, the case backlog, and issues linking registrations of different members could explain why one household gets selected and another household with similar observables did not.
A key tricky issue with this approach is that you want the decision-maker to make decisions on the basis of just a few variables, but the decision can’t be too deterministic. If there is a strictly deterministic rule (e.g. the military will take anyone aged under 30 who has at least finished high school and scores at least 75% on the test), then the common support condition will be violated – and we would want to consider regression-discontinuity instead. So we need bureaucratic discretion, not just rules, and we need this to be not based on private knowledge about the applicants in a way that would influence outcomes. We need this discretion to be arbitrary, that is, for the bureaucrat not to have private information that they use to select among applicants. To really make this convincing, we want as much information provided as possible about exactly how the decision-maker decides among similar cases.
· Case 2 (Capacity limits and small frictions/noise): A second related answer to the question “why, if these individuals were so similar, did one unit get treated and the other not?” is that there were binding capacity constraints, perhaps coupled with some noise in the process of applying. This is the argument John Gibson and I made in using matching (combined with difference-in-differences) to evaluate a new seasonal worker program from Vanuatu and Tonga to New Zealand. As in case 1, there was a selection by employers and villages on a few observable characteristics like English literacy, age, and gender. Then there was excess demand for this migration opportunity, so among similar applicants, not all could be chosen.
· Case 3 (randomization substrata): In work with Gabriel Ibarra and Claudia Ruiz, we conducted an RCT of a financial literacy intervention in Mexico that only had 0.8% take-up. As I previously blogged, as well as stopping at the insignificant ITT, we also wanted to examine whether there was an effect for those who actually got treated. We used matching to match treated individuals who took up treatment to similar individuals in the control group on 18 months of monthly pre-treatment credit card behavior. Our answer to the question of why one individual got treated and another with the same observable characteristics did not is then that it was due to the random allocation of training.
· Case 4 (Decision-maker cares about a different outcome than the evaluator/unobserved costs affect take-up but not outcomes): this is a more subtle case than the rest, but an important one. Imbens and Rubin (2015) and Imbens (2004) discuss how take-up of treatment may be driven by unobserved differences between units that are themselves unrelated to the outcome of interest, when the evaluator cares about a different outcome than the decision-maker. They give an example of a firm deciding on whether or not to take up a new technology, where the firm wants to maximize profits, and costs of take-up are not observed. The costs then may determine why two observationally equivalent firms have different technology use (since the costs affect profits, what the firm cares about), but if the decision-maker then wants to examine the impact of the technology on firm output, the unobserved costs of technology take-up may not affect this (of course this assumes that firms don’t have unobserved productivity that is correlated with their unobserved cost structure). They also give a second example relating to drug use, where the cost structure of different insurance plans may determine whether a physician gives drug A vs drug B to a patient, but if we are interested in health outcomes, perhaps the specific insurance cost structure does not independently affect this.
An implication of this is that matching may be plausible for identifying causal impacts of a treatment on one outcome, but not on another outcome in the same dataset/sample.
The bottom line of all of these cases is the need to really understand the context well, to understand how the decision process for deciding on treatment gets made, and to be able to provide rhetorical plausibility as well as the statistical plausibility for using matching.
Many economic questions do not fit these cases
The other key point to note is that many attempted uses of matching do not correspond to these cases – and it is these indiscriminate uses of matching that have given the approach such a bad name. I have seen way too many papers that just attempt to use matching without any plausible reason why two identical individuals will end up with different treatments. For example, papers which use propensity score matching to look at the impact of migration from Mexico to the U.S. on some outcome – a situation where there is no separate bureaucratic decision-maker, no capacity constraints, and theory tells us that individuals are self-selecting on a whole host of unobserved characteristics like risk aversion, wealth shocks, language ability, marital frictions, etc. that are likely to also affect whatever outcome is of interest. Or attempts to evaluate firm support programs which match applicants to non-applicants on the basis of some firm performance data, when the worry is that those who apply may be more entrepreneurial, have experienced a productivity shock, be more politically connected, etc.
How much this matters depends on the setting, and on what your goal is. The bias from self-selection may be second-order in some settings – for example, in comparing matching estimators to experimental estimates of migration from Tonga to New Zealand, John Gibson, Steven Stillman and I find the matching estimators overstate the income gains to migrating by 20% - but given the experimental treatment effect is a 263% increase in income, the matching bias is small in comparison. This gets to the question of what types of questions need careful identification for learning, which I’ve written about here.