When your difference-in-differences has too many differences

|

This page in:

Violations of the parallel trends assumption

Difference-in-differences is a popular method for estimating program impacts using non-experimental data. However, it relies crucially on the assumption of parallel trends — that individuals receiving the program would have, on average, experienced the same changes in their outcomes had they not received the program as individuals who did not receive the program.

 

This motivates a natural question: what should you do when you are trying to estimate difference-in-differences, but the parallel trends assumption is violated? Recent papers have proposed methods to robustify difference-in-differences estimates to violations of parallel trends (discussed on this blog here; relatedly, recent work has shown how parallel trends violations can be introduced by dynamic treatment effects combined with staggered adoption, and amplified by treatment effect heterogeneity, discussed in this blog here). As a complementary approach, in this blog we go back to the basics and revisit a more old school way of addressing this issue: how to estimate difference-in-differences when parallel trends holds, but only for individuals with the same observable characteristics?  

 

Parallel trends conditional on observables

We discuss a generalization of the parallel trends assumption, to parallel trends conditional on observables. This generalization is fairly natural: we assume that individuals with a particular set of characteristics receiving the program would have, on average, experienced the same changes in their outcomes had they not received the program as individuals with their same fixed observable characteristics who did not receive the program.

 

To give one common example, suppose we were interested in the impact of beginning to export on firm productivity. Parallel trends may be unlikely to hold: firms that begin to export were likely already on very different productivity trajectories from the average firm. It is worth noting that these firms look very different from the average firm: they tend to be much larger at baseline, and likely operate in more dynamic industries. It may be much more likely that parallel trends holds conditional on baseline firm size and industry: this would be true if firms that began to export would have experienced the same productivity growth had they not begun to export as firms that were the same size at baseline and in the same industry, but did not begin to export. Alternatively, we may think parallel trends conditional on observables does not hold in this context: this would be the case if firms that begin to export would have experienced faster productivity growth, had they not begun to export, than other firms that are the same size at baseline and in the same industry that did not begin to export.

 

So suppose parallel trends holds, but only conditional on observables: in this case, how do we estimate difference-in-differences? We discuss 3 approaches below, and close with a point on inference. In each case, we discuss a paper that applies that approach. To focus entirely on allowing for parallel trends conditional on observables, we’ll only consider examples with two time periods, where one group receives the program in the second period; as discussed at the start of this blog post, additional concerns emerge with dynamic treatment effects and either multiple waves of adoption or when individuals that always receive the program are used as a comparison group.

 

Introducing observable controls

First, we can introduce time invariant controls into our difference-in-differences estimating equation. When doing this, it is insufficient to simply control for these observables in a linear model: this is because the difference-in-differences estimate involves differencing out any time-invariant observables or unobservables. Instead, the concern is that units with different observables may have different trends. Therefore, one needs to control for the interaction of the time invariant observables with time fixed effects.

 

For example, Duflo (2001) [working paper] estimated the impact of a national school construction program on educational attainment and wages. They implement difference-in-differences in cohort age (comparing younger cohorts directly impacted by the program to older cohorts that had already completed primary school) and district-of-birth (comparing districts with high program intensity to districts with low program intensity). However, a key concern is that parallel trends may not hold: districts with high program intensity tended to have lower baseline enrollment rates (as the program explicitly targeted low enrollment rate districts) and were likely to receive other government programs, and baseline enrollment and access to other government programs may predict growth in educational attainment and wages. Duflo (2001) therefore also presents specifications that control for the interaction of time fixed effects with baseline enrollment and the intensity of a national water and sanitation program, and shows that their results are unaffected.

 

Propensity score weighting

Second, we can combine difference-in-differences with propensity score weighting. As noted previously on this blog, propensity score weighting does not have the best reputation, as results can often be sensitive to seemingly arbitrary specification choices. However, matching-based approaches to causal inference in panel data have become more popular with the development of synthetic control methods (discussed on this blog here), leveraging the intuition that combining panel data methods with matching-based methods can provide robustness that neither set of methods alone can. With difference-in-differences, propensity score weighting involves the following steps:

  1. Estimate the probability of receiving the program as a function of observable characteristics.
  2. Reweight observations so that the treatment group and the control group have identical reweighted distributions of this probability.
  3. Run difference-in-differences using the weights above.

A wide variety of approaches have been proposed to Steps 1) and 2), well beyond the scope of this blog post. Importantly, after Step 2), standard placebo checks for both cross-sectional analysis (comparing characteristics of the reweighted treatment and control groups) and difference-in-differences (comparing trends in pre-periods between the reweighted treatment and control groups) can be run.

 

For example, Smith & Todd (2005) [working paper] apply difference-in-differences with propensity score weighting to evaluate the impacts of a job training program targeting unemployed workers in the United States on earnings. They revisited influential analysis by Lalonde (1986), who found that non-experimental methods (including difference-in-differences) failed to reproduce experimental estimates of the impacts of the same job training program. To do so, they build on Dehejia & Wabba (1999), who find non-experimental methods are more effective in that context with the inclusion of better control variables. They then implement a variety of approaches, and find that the combination of difference-in-differences and propensity score weighting is most robust to changes in specification.

 

Doubly robust methods

Lastly, controls interacted with time fixed effects (as in Duflo (2001)) and propensity score weighting (as in Smith & Todd (2005)) can be combined to produce estimates that are more robust to misspecification. In recent work, Sant’Anna & Zhao (2020) provide theory and simulation based justification for this intuition, along with an R package for implementing their preferred approach.

 

Inference can be messy

It’s common for applied researchers to work in settings where there is significant correlation across units, in which case using robust inference methods is important. This can be a challenge with multi-step propensity score based methods, where only using robust inference methods in the second step does not account for uncertainty introduced by the first step, but off-the-shelf propensity score based packages may not always allow for cluster robust inference. In general, block bootstrap approaches to inference can correct for both correlated errors and multi-step estimation procedures.