Difference-in-differences (DiD) analysis is one of the most widely applicable methods of analyzing the impact of a policy change. Moreover, the analysis seemed very straightforward. For example, in the two-period case, we simply estimate the linear regression:
Y = a + b*Treated + c*Post + d*Treated*Post + e
Where we observe all units before treatment and then again afterwards, Treated is a dummy variable indicating whether or not a unit is treated, Post is a dummy variable indicating the post treatment period, and d is our difference-in-difference estimator: the change in Y for treated units less the change in Y for control units. The most complicated thing people worried about until recently was typically getting the standard errors right.
The last couple of years have seen an explosion of new papers about difference-in-differences estimation (see for example, Pam Jakiela’s recent post on Goodman-Bacon’s decomposition). This new literature has been building up on my to-read list, so I thought I’d tackle a few at a time, and give you a flavor of what some of this new work means for applying difference-in-differences in practice. A first set of papers looks at the key underlying assumption of difference-in-differences, the parallel trends assumption. Recall that this assumption is that the untreated units provide the appropriate counterfactual of the trend that the treated units would have followed if they had not been treated – that is, that the two groups would have had parallel trends.
A recent example debating whether parallel trends hold
A recent example that gained some attention comes from a difference-in-difference analysis by Kearney and Levine (2015), published in the AER. Their study looks at the impact of MTV’s 16 and Pregnant TV show on teen pregnancies, essentially comparing areas which had high pre-program viewership of MTV to areas with low pre-program viewership. Their finding is that this program reduced teen pregnancy rates. The authors conduct a test of parallel trends in pre-treatment periods, and cannot reject this test, which they use to bolster their support for the parallel trends assumption.
Jaeger, Joyce and Kaestner (2019) then re-analyze this case, and argue that there are reasons to believe the parallel trends assumption may not hold. They note, first, that pre-treatment viewership rates of MTV are correlated with factors like race and unemployment rates, and that controlling for differential time trends by an area’s racial/ethnic composition or unemployment rate causes the result to disappear. Moreover, they show that the pre-treatment parallel trends test is rejected if one looks over longer periods pre-treatment. Kearney and Levine argue back here.
When will the parallel trends assumption be more plausible?
Kahn-Lane and Lang (2019) use the 16 and Pregnant debate to make some more general points about DiD analysis and pre-trends.
1. DiD will generally be more plausible if the treatment and control groups are similar in LEVELS to begin with, not just in TRENDS. They note that any paper should address why the original levels of the experimental and control groups differ, and why we shouldn’t think this same mechanism would not impact trends. For example, in the above case, why did areas differ in initial MTV viewership, and why should we believe this will be uncorrelated with future trends?
Implication for applied work: Always show a graph showing the levels of the two series you are comparing over time, not just their difference. I also prefer DiD on a matched sample for this reason – if you can make the levels more similar, I am more willing to think the trends will be too. Ryan et al. (2018) illustrate, via simulations, that matched DiD does well at dealing with non-parallel trends in a context of health policy interventions.
2. If the two groups aren’t similar ex ante in levels and distribution, then functional form assumptions matter a lot. As a simple case, if the teen pregnancy rates differ in levels between treatment and control areas beforehand, then parallel trends can not hold simultaneously for both the level and log of the pregnancy rate – so we need to take a stand on whether we think they will evolve with the same absolute changes or the same percentage changes.
Implication for applied work: Be careful about functional forms, and justify your choice.
What do we learn from failure to reject parallel trends in the pre-treatment data?
Kahn-Lang and Lang note that “Increasingly, researchers point to a statistically insignificant pre-trend test to argue that they therefore accept the null hypothesis of parallel trends. There is no doubt that testing for a common pre-trend plays an important role in validating the parallel trends assumption underlying DiD. However, failing to reject that outcomes in years prior to treatment exhibit parallel trends, should not be confused with establishing the validity of the parallel trends counterfactual. Moreover, clearly, not rejecting the null hypothesis is not equivalent to confirming it.”
Roth (2019) identifies a couple of key problems with the current practice of pre-trend testing for parallel trends, and offers an improved procedure.
1. These tests are often underpowered, and failure to reject parallel trends could mask important bias from non-parallel trends. Using data from 12 published DiD papers in AEA journals, he finds that the magnitude of violations of parallel trends against which there is 50 percent and 80 percent power can be sizeable, and often comparable in size to the estimated treatment effect.
2. Reporting DiD effects conditional on surviving a test of parallel trends introduces a pre-testing problem, which can exacerbate the bias from an underlying trend, and lead confidence intervals to have the wrong coverage rates. This is illustrated nicely in this three period simulation, in which there really are non-parallel trends, with the treatment group increasing linearly relative to the control group. Then with sample noise, the cases where the treatment and control difference is lower at baseline are ones which flatten this pre-trend and lead to non-rejection of parallel trends (a horizontal line between t=-1 and t=0 would mean no pre-trend), but this then also results in an overstatement of the treatment effect. The bottom figure illustrates this: in reality there is no treatment effect here, the pre-period DiD is the same as post-period, but the cases where there is an insignificant pre-trend lead to upward biased treatment effect estimates – and the confidence intervals will undercover the true value.
Note that, in contrast, when there really are parallel trends, conditioning on surviving a parallel trends pre-test does not induce bias, but does result in confidence intervals being slightly too conservative.
3. His paper provides a method for constructing “corrected event-study” plots that correct for this pre-testing process. The paper provides some details for a median-unbiased estimator to do this, but I suspect many readers will welcome provision of the code for doing this when it becomes available.
Pre-testing is not a substitute for logical reasoning
Kahn-Lang and Lang note that while Roth’s estimator is an improvement, it is not a substitute for the need for logical reasoning about why parallel trends should apply. They note that “authors should perform a thorough comparison of the differences between the treatment and control groups including demographic composition, other factors that could have differentially affected each group, and comparison of trends as far back as possible”.
Implication for applied work: if you are using DiD, you should have an explicit discussion in the paper of why it is reasonable to think the parallel trends assumption is justified, whether there were other policies or sectoral trends going on that might be a threat, etc. That is, don’t just say “we fail to reject parallel trends in the pre-period, suggesting that the DiD assumption is satisfied”.
Stay tuned for my next post, where I discuss several papers that discuss what to do when the parallel trends assumption does not seem to hold.