# An Adversarial or “Long and Squiggly” Test of the Plausibility of Parallel Trends in Difference-in-Differences Analysis

|

Don’t worry, this is not a complicated post on yet another of the expanding set of theory papers on difference-in-differences. Instead, I want to offer my heuristic thoughts on when I find graphs illustrating parallel trends to be more or less informative, and on what, in my mind, makes the parallel trends assumption more plausible. If you take anything from this post, it should be:

·       Plot the raw treatment and control series, not just their difference; and

·       The longer and squigglier your pre-treatment trends, the more plausible I find the parallel trends assumption.

To do this, I am going to consider the classic difference-in-differences case with no staggered timing, with a group that gets treated after time 0, and where the treated group trend jumps up by 2 units immediately after treatment. I will show four cases A, B, C and D that plot the treatment means and the means for a control or comparison group, and argue that the parallel trends assumption seems more and more plausible to me as we move from one case to the next – and then draw out the lessons we learn from this. While this post is about difference-in-differences, the same ideas apply to synthetic controls.

Recall that the parallel trends assumption is that the untreated units provide a good counterfactual of the trend that the treated units would have followed if they had not been treated. This is ultimately an untestable assumption, but there is a long tradition of viewing this assumption as being more plausible if the trends are parallel in the pre-treatment period. In a couple of previous posts I discuss formal approaches to pre-trend testing, and robustness approaches that allow for some deviations from parallel trends.

Case A vs Case B: Parallel Trends with and without level differences.

Let’s start by considering the following two cases. Both have three rounds of data pre-treatment, and show parallel linear trends with the exact same slope for the treatment and control groups pre-treatment. The only difference between the two cases is a level difference, and so the event study plot would be identical in both cases. But yet if you gave me a choice between these two control groups and asked whether I had a preference, I would always want Case B over Case A – and think almost all applied researchers would agree.

This point that the parallel trends assumption will be more plausible if the treatment and control groups are more similar in levels to begin with, and not just trends, is one made in Kahn-Lang and Lang (2019), and discussed in my previous post. They discuss it with regard to the concern that whatever led to the initial difference in levels in case A could very much also lead to future differences in trends.

Suppose the outcome here is income. Then your adversarial referee might ask what sort of process would lead us to think income would grow by a constant amount in levels year after year? Maybe it makes sense to think instead of a constant percentage growth in the absence of the intervention, and look at log incomes. Then in Case A, the control group is having faster percentage income growth pre-treatment than the treatment group, and so parallel trends is not holding. While the log mean and mean of the logs will differ unless the distributions are also the same, the percent growth rates are going to be much closer to parallel in Case B than Case A.

Case B vs Case C: more or less pre-treatment data

Let’s keep Case B, and compare it to a new Case C, which differs in having 10 periods of pre-treatment data compared to Case B. Again, if given the choice, I would always prefer to have the data in Case C than that in Case B. From a statistical viewpoint, you might want to do this because it gives you more power for testing for a parallel trend in the pre-treatment period, or because it gives you narrower confidence intervals for your bounds if you make assumptions on how sharply any differences in trends can evolve (as in the approach of  Rambachan and Roth (2019)). But the reason I find it more comforting is that Case C offers a lot more time periods for a difference in trends to have emerged. It also helps in ruling out the possibility of Ashenfelter dips or anticipation effects that might not get caught with only a few rounds of pre-treatment data.

Case C vs Case D: more or less “squiggly”

Finally, let’s compare Case C to another case that also has 10 periods of pre-trend data (Case D). The treatment-control difference is exactly the same in the two cases, so in an event study plot that just plots the treatment-control difference against time, you would not be able to tell these two cases apart. But I think case D gives me a lot more confidence in the plausibility of parallel trends than case C. It is often quite easy to mimic a linear trend. That is why we have fantastic websites that show lots of ridiculous associations like the strong association between per capita cheese consumption and the number of people who died by becoming tangled in their bedsheet. What we want to do is put the parallel trends assumption under stress. We want a bunch of shocks to have hit the treatment group before the intervention, and see whether the comparison group also trends in the same way during these shocks. So in Case D, we see that when the treatment group had a sudden decline in periods -8 and -2, so did the comparison group, and when it had a sudden increase in period -4, so did the comparison group. This will be even better if we can bring in some outside knowledge and understand whether these are simply seasonal patterns or actually reflect some shocks, but both are useful to observe.

Putting all this together, I am going to find parallel trends more convincing if the treatment and control series are similar in levels, track one another over a long number of periods before treatment, and the pre-trend lines that track one another are really squiggly. Plotting the raw data for the two series is crucial for being able to judge this – so don’t just plot the difference in means over time as in a standard event study plot.

In practice, using matching to help identify a subset of the observations that look more similar on levels and pre-trends will often increase plausibility according to these criteria.

A couple of practical examples

A first example comes from a paper I have forthcoming in the WBER with Gabriel Lara Ibarra and Claudia Ruiz Ortega. This evaluates the impact of financial literacy workshops given to credit card clients of a Mexican bank who are starting to run into difficulties with repayment.  Clients were randomized to treatment or control, but take-up of the workshop was very low. Those who took it up were already more likely to be paying more than the minimum payment to begin with than the full control group (the level difference seen in the left panel of Example 1 below), but we cannot reject that both have a common linear trend before the intervention. Nevertheless, we would be nervous about relying on the parallel trends assumption and DiD with this full sample. Using nearest neighbor matching, we instead match the treatment group to a subset of the control group that had similar repayment behavior in terms of both levels and trends for a full 16 months pre-intervention (right panel). This includes a reasonable amount of squiggliness, with both series showing a drop together in mid-2015, then recovery. Assuming these two groups who had tracked each other so closely for 16 months would continue to track each other for a few more months at least had the workshops not happened then seems a lot more plausible than making this assumption for the full sample.

Example 1: Making more than the minimum-payment on their credit card

A second example comes from a paper (ungated version) published last year in the AEJ Applied by Tatyana Deryugina, Alexander MacKay and Julian Reif. Their goal is to examine how the price elasticity of demand for electricity evolves over time, using a policy in Illinois that generated shocks to residential electricity prices – treated communities are ones in which a referendum was passed on whether to participate in an aggregation program that affected the price of electricity, and control communities were ones were this was not passed. They note that electricity usage is highly seasonal, and the degree of seasonality varies widely across different communities. Comparing all treated communities to all control communities then might be problematic. They then use nearest neighbor matching, matching each treated community to its nearest five neighbors.

Example 2 (figure 3 in their paper) illustrates the trends. Panel A shows the full sample. Unfortunately, they have already done some form of seasonal adjustment and removed level differences, so we cannot see how similar or dissimilar the communities are on levels. We see that the (adjusted) treatment and full control communities track each other still reasonably well in the pre-treatment (pre-2011/12) period, but panel C shows that in any given month the deviation can be quite large. Panel B shows the matched series (again, unfortunately, not showing the levels). A nice feature is that they use only the 2008 and 2009 data for this matching, so can show that the two series continue to track each other nicely in 2010 and early 2011, with panel D showing an almost zero difference – and then after treatment the two series start to diverge a bit. The series is long, and encompasses some big spikes/shocks (the authors note events like heatwaves can cause spikes).

The treatment effect is pretty small here, so having this exact tracking of treatment and control over so many months is essential for helping to convince us that the post-treatment divergence is a real treatment effect and not just the type of pre-treatment spike seen in panel C.

## Authors

Steven Stillman
March 10, 2021

David, I'm curious to your thoughts about using a matching approach like you highlight here to "pick" the best control group versus using a synthetic control approach to do this and then integrating this into a d-in-d analysis. The ideas are clearly closely related and I don't think I've seen anyone do a direct comparison.

March 10, 2021

Steve, my understanding is that this integration of the synthetic control and DiD methods together is part of the new synthetic difference-in-differences and matrix completion methods approaches. See e.g. Arkhangelsky et al. https://arxiv.org/abs/1812.09970. Another approach is the doubly-robust DiD method of Sant'Anna and Zhao: https://arxiv.org/abs/1812.01723. But I haven't seen work that compares the performance of these new approaches to the strategy of just using nearest neighbor to pick matches and applying DiD.

March 10, 2021

Jonathan Roth, one of the researchers doing a lot of this new interesting DiD work, shared these comments that he gave me permission to share with everyone:

• I think it's really important that the researcher has a mental model of the types of confounding that might produce non-parallel trends in the post-treatment period and how those confounds are related (or not) to the types of confounds that affect pre-treatment outcomes. When you're talking about the benefits of having more periods and "squigglier" trends in the pre-period, I think the implicit model you have in mind is that there are shocks to the outcome for both treatment and control that are coming from a relatively stationary process, and you want to make sure those shocks have the same effect on the treatment and the comparison groups. That's very natural in a lot of contexts. But there are other cases where this might not be so sensible. Say people enroll in a job training program in the few months after they lose their job. You could have two groups of people who all had jobs over a period of 30 years, and some of them lost their job towards the end of the sample. Their earnings would probably track each other pretty closely up until very close to the time that the subset of people lost their jobs. So if you analyzed the data at the annual level, you'd have long and squiggly parallel pre-treatment trends, but that doesn't mean your research design is valid! What you'd really need to do to detect a problem is get more granular data around the time that the job loss occurred, in which case you'd see divergence just before they enroll. It's also not hard to imagine cases where the pre-treatment shocks are very different from the post-treatment shocks -- e.g., if I'm interested in the effects of early lockdown policies on overall mortality rates during covid, it wouldn't be very re-assuring to show me that mortality had long, squiggly, parallel pre-trends from 2000-2019. So I would try to emphasize that the convincing-ness of the pre-trends really depends on the types of shocks that you're worried about, and you should think economically about what is reasonable or not in a given context.
• Pardon my somewhat shameless plug of our own work, but your discussion of cases A and B is related to this paper of Pedro and mine (When Is Parallel Trends Sensitive to Functional Form?) which studies when the validity of the parallel trends assumption depends on functional form. We show that parallel trends doesn't depend on functional form if and only if you have a parallel trends-type assumption for the full distribution of outcomes. We also show that this is essentially equivalent to having (quasi-)randomized treatment or stationary potential outcomes, or some "mixture" of the two. So when evaluating the plausibility of parallel trends, I think it's important to consider the full distributions of the outcomes, and whether the mechanisms that could lead to a "robust" parallel trends assumption are plausible.
• Re matching and synthetic controls: I think it's very important to do the pre-treatment validation in periods that you didn't match on. Otherwise, there are concerns that you're just over-fitting the control group, and this could actually make things worse in practice (Daw & Hatfield have a nice paper about this(https://pubmed.ncbi.nlm.nih.gov/29957834/); I have some stuff related to this in my pre-testing paper as well). Again, I think having a context-specific motivated model of the types of things that could go wrong helps in deciding whether this is a concern or not.