Maybe 2022 will be the year where the new difference-in-differences (DiD) literature has matured enough that we now don’t need to learn about a new paper that questions the old ways of doing things every few days. After an explosion of recent DiD papers, there is at least a sense that the literature is maturing enough to overview and synthesize. In December, I linked to one new overview paper by Clément de Chaisemartin and Xavier D’Haultfoeuille. The new year has then seen it joined by a new, really nicely written and easy to read, working paper with the lovely (crowd-sourced) title “What’s Trending in Difference-in-Differences?” by Jonathan Roth, Pedro Sant’Anna, Alyssa Bilinski and John Poe. I thought I’d summarize some key takeaways from the Roth et al. paper. They note that most of the new DiD literature can be seen as relaxing some of the assumptions of the classic two-period, large sample of independent units with parallel trends, DiD model in one of three ways: introducing multiple periods and staggered treatment timing; considering potential violations of parallel trends; and departing from the assumption of observing a sample of many independent clusters sampled from a super-population. I’ll highlight some key lessons on each, tie to some of our previous blogs, and then note some further things to consider.
1. Dealing with multiple periods and staggered treatment timing
The basic issue here should be very familiar by now to most readers of our blog, since we have had at least three posts (post 1, post 2, post 3) on the issues that can arise when there is heterogeneity of treatment effects over time or across units. The key issue here is that once you have treatment effect heterogeneity, there is no single treatment effect, and instead you will recover a weighted average of different comparisons between units. The standard two-way fixed effects model (see Wooldridge for an expanded model that can work) will not only make “clean” comparisons between treated to not-yet-treated units, but also “forbidden” comparisons such as comparing later treated to earlier treated units. This can result in negative weights being used in the weighted average, and even potentially result in the weighted average having the opposite sign to any of the treatment effects for individual units. There are a variety of different estimators that rely on only using “clean” controls, and part of the confusion is then with such a proliferation of suggested alternative estimators, researchers may be unsure which to use. A few key points the Roth et al. paper makes on this are:
· The most important one perhaps for applied researchers is that their practical experience is that these heterogeneity-robust DiD estimators typically produce similar answers to one another – so the most important thing is to use at least one of these new methods.
· Once we allow for heterogeneity of treatment effects, there is no longer a single estimand of interest – and there are decisions to be made about what weighted average you care about. For example, suppose your treatment is implemented at the state level, and seven states adopted the treatment in 1992, two states in 1994, and one state in 1996, and twenty states in 2000. You now have four cohorts of treated units. But even if we take an “event-study” parameter as our estimand of interest, where, we want the weighted average of the treatment effects k years after implementation, we still need to decide whether we should weight each of the four cohorts equally, or give more weight to cohorts where more states were treated. So you need to be very clear about what precisely you are estimating, and why it is an economically-interesting weighted average.
· The different estimators vary in which units they use as clean controls, e.g. whether they use the never treated, the not-yet-treated, or just the last-to-be-treated; in the number of pre-periods they use in constructing the differences (e.g. whether you just difference based on the period right before treatment, or use multiple pre-treatment periods); and in the weights they use in aggregating. They also vary in what they assume about parallel trends – whether they need to hold for all combinations of periods and groups, or whether weaker parallel trends assumptions are needed. In practice this can result in a trade-off between requiring stronger assumptions for the benefit of getting more precision, or allowing for weaker assumptions but with potentially noisier estimates.
· Things get even more complicated when treatments are continuous, or when they turn on and off over time, and typically even stronger assumptions are required in these cases.
2. Relaxing or allowing the parallel trends assumption to be violated
We have also discussed some of this recent literature in two posts (post 1, post 2), where we discussed when the parallel trends assumption would be more plausible (see also this post), issues involved in testing for pre-trends (including low power and pre-testing bias), and robustness approaches that allow for some violations of parallel trends. Some additional things that I think are important from this section of the paper are:
· In contexts with staggered treatment timing, this can also affect the estimation of pre-trend coefficients in the typical event study plot used to illustrate no pre-trends - so using the new methods for dealing with staggered DiD is needed for looking at pre-trends here too.
· Beware of adding covariate*time interactions. The authors note that the parallel trends assumption may be more plausible conditional on covariates. One common way applied researchers try to deal with this is to add interactions of some covariates with time. However, they note that if there are heterogeneous treatment effects that also depend on the covariate, this can lead to biased estimates for the ATT, and so semi- or non-parametric methods that require weaker homogeneity assumptions can be used.
· Alternative approaches to allow for parallel trends conditional on Xs include combining matching or inverse-probability weighting with DiD, with doubly-robust estimation methods now available that are valid if either the propensity-score or the outcome model are correctly specified.
· While Ancova is your friend for RCTs, be a bit cautious matching on pre-treatment outcomes in non-experimental DiD. This point came up in a comment Jon wrote on my adversarial or “long and squiggly” post on DiD – the concern is that if the treatment and control units do have different outcome distributions pre-treatment, then matching may select control units for matches that had particularly large shocks right before treatment, who then experience mean reversion that leads to a bias effect.
· The idea of using multiple control groups to bound violations of parallel trends – they give an example where groups are industries, and one set of controls is from industries that are more cyclical than the treated group, and another control group is from less cyclical industries – and so we might expect the counterfactual movement of the economy for the treated group to lie between these two.
3. Relaxing sampling assumptions
The third set of recent literature they discuss looks at how to do inference under deviations from the assumption that we have a large number of independent clusters from a super-population.
· A first case is when there are not that many clusters (e.g. only a few states get treated). Solutions include model-based approaches, wild-bootstrap, and permutation-based approaches. As part of this discussion, they offer their take on how to think about the seminal Card and Krueger (1994) minimum wage study, that compared employment in New Jersey and Pennsylvania after New Jersey raised its minimum wage. Since there are only two states here, one approach would be to just take these states as fixed, and then model state-level shocks as a violation of the parallel trends assumption, using the methods discussed in part 2 above.
· A second case is using design-based inference, where the set of units (e.g. states) is not assumed to come from a bigger population, and instead randomness just comes from treatment assignment. The main takeaway from this approach is to cluster at the level of effective treatment assignment.
Essential readings: The paper finishes with two really useful tables for applied researchers. The first is a checklist for DiD practitioners and the second is a table of statistical packages in R and Stata for implementing all these new methods.
What’s missing? The paper does a great job of drawing out the main lessons and synthesizing the approaches in this new literature, and hopefully making it much easier for people to navigate this new literature. It really does a good job in explaining the key issues without going too heavy into formula, and should be really helpful for thinking through what statistical methods to use.
The tricky part that is not covered is advice on the rhetorical/contextual/economic as distinct from econometric issues that arise in using these methods. That is, while the new papers clarify very well the statistical assumptions needed for estimation, effective use of these methods also requires being able to understand what the threats to these assumptions are in different contexts, and to make a plausible rhetorical argument as to why we should think the assumptions hold. This is much harder to codify, but there is a set of folk wisdom that has accumulated over time, that it would be nice to agglomerate also in an article providing advice for practitioners. Perhaps an excellent topic for a future blogpost.