# What Are We Estimating When We Estimate Difference-in-Differences?

## This page in:

Difference-in-differences estimation is one of the most widely used quasi-experimental tools for measuring the impacts of development policies. In 2018, I calculate that more than 5 percent of articles published in the *Journal of Development Economics* used a difference-in-differences (or “DD”) methodology. In DD estimation, a researcher compares the change in outcomes in a (non-random) treatment group before vs. after treatment to the change in outcomes in a comparison group over the same time period (even though the comparison group never received treatment). The figure below illustrates the basic idea.

The DD estimate of the treatment effect is: (B – A) – (D – C). Intuitively, pre-treatment differences between the treatment group and the comparison group reflect selection bias, while pre-period vs. post-period changes in outcomes within the comparison group reflect time trends. The DD approach removes these confounds (under certain assumptions) by differencing them out, leaving us with a credible quasi-experimental estimate of the treatment effect of interest. A more detailed overview of DD methods is available here.

The DD approach is often used to estimate the impacts of policies that are implemented at different times in different regions, for example policies like Medicaid and food stamps that were implemented in different times in different U.S. states. In development, this approach was recently used by Jesse Antilla-Hughes, Lia Fernald, Paul Gertler, Patrick Krause, and Bruce Wydick to study the impacts of infant formula on child mortality in low- and middle-income countries. In such settings, researchers typically implement the DD using two-way fixed effects models controlling for both period-specific and unit-specific shocks.

Though the two-way fixed effects approach to DD is widely used, the formal justification for treating it as a DD estimator is often fairly ad hoc. A recent working paper by Andrew Goodman-Bacon, an Assistant Professor of Economics at Vanderbilt University, looks under the hood of the two-way fixed effects approach to DD estimation. Goodman-Bacon shows that any two-way fixed effects estimate of DD relying on variation in treatment timing can be decomposed into a weighted average of all possible two-by-two difference-in-differences estimators that can be constructed from the panel data set.

Consider a (hypothetical) data set comprising two types of countries: one group that made primary education free in 2000, and another group that made primary education free in 2005. Call the first group the “early” countries and the second group the “late” countries. Suppose you had data on primary school completion rates every year from 1990 to 2010. Such a data set allows for two distinct DD comparisons. First, we could focus on the period from 1990 to 2004 (the left panel in the figure below). In that time frame, the late adopter countries are “never treated” in the sense that they do not implement free primary education – so they can be used as a comparison group to estimate the impact of free primary on test scores in the early adopter countries. However, we can also construct a second DD estimate of the treatment effect of free primary school by focusing on the years 2001 to 2010. During that period, the treatment status of the early adopters never changes – they remain treated throughout – so they can be used as a comparison group to estimate the impact of free primary on test scores in the late adopting countries.

Goodman-Bacon shows that any two-way fixed effects estimate of DD with variation in treatment timing can be decomposed in this way. It is a weighted average of (1) comparisons between (relatively) early adopters and later adopters over the periods when the later adopters are not yet treated, (2) comparisons between early adopters and later adopters over the periods when the early adopters are treated – so that they can be used as a comparison group for the later adopters, and (3) comparisons between different timing groups (e.g., early adopters or later adopters) and the never-treated group, if there is one.

Though interesting in its own right, this decomposition has several important implications. First, two-way fixed effects estimates of DD that rely on variation in treatment timing only recover the average treatment effect when treatment effects are homogeneous. When treatment effects are heterogeneous across units, OLS over-weights units with more variance in treatment status in order to achieve a more precise estimate of the treatment effect. Hence, units that are treated near the middle of the evaluation window receive relatively more weight. If one wishes to estimate the average treatment effect on the treated, some sort of re-weighting is required.

Second, and more troublingly, DD estimates are biased when treatment effects change over time within unit. Intuitively, this occurs because already treated units serve as controls in some of the two-by-two DDs underlying the weighted average. When treatment effects are not constant over time (so, for example, the treatment effect in the first year after treatment differs from the treatment effect five years after treatment), using already treated units as controls necessarily biases estimates of the treatment effect (by introducing a term representing the change in the treatment effect on the already treated units). In such situations, Goodman-Bacon’s analysis shows that two-way fixed effects estimators are not appropriate, and alternative approaches (e.g., event study estimation) should be used.

In light of this, it is important to check whether common trends are satisfied in any panel data set being used for DD analysis. When treatment turns on at different times, one way to do this graphically is to re-center and stack all the possible two-by-two DD comparisons. Critically, identification relies on common trends both before and after treatment – as discussed above. The good news is that, because of the re-weighting of the underlying two-by-two DD estimators, some violations of common trends are worse than others. Goodman-Bacon provides instructions for testing “variance-weighted common trends” to assess the severity of any observed deviations.

If this all sounds like it raises the bar for DD analysis: it does. Fortunately, there’s already a Stata command to help you implement the Goodman-Bacon’s DD decomposition. Search for “bacondecomp” and you will find Goodman-Bacon, Goldring, and Nichols’ (2019) helpful tool for decomposing and plotting the underlying variation in your two-by-two DD estimates.

Thank you for writing this so concisely. I have been seeing snippet conversations of this on Twitter and was having a hard time following along.

Can you point me to background on how event study estimators differ from a D-in-D with two-way fixed effects? I think folks often refer to the latter (perhaps incorrectly) as an "event study D-in-D".

Hi Kelly!

Yes, people sometimes use the terms interchangeably, but I think of the event study specification as a diff-in-diff that includes lagged treatment dummies (and potentially leads as well) - in other words, separate dummies for the 1, 2, 3, etc. years/periods after treatment starts. A recent paper that focuses more on the event study specification is here: https://arxiv.org/pdf/1804.05785.pdf.

Pam

"When treatment effects are not constant over time (so, for example, the treatment effect in the first year after treatment differs from the treatment effect five years after treatment), using already treated units as controls necessarily biases estimates of the treatment effect (by introducing a term representing the change in the treatment effect on the already treated units)."

I do not think 'biases estimates is a good enough criticism. In what way does it bias the estimate. If the treatment effects are positive and increase over time, using already treated as control will, as far as I can see it, bias the estimates downwards. If the effects are still significant, one can always add the qualifier that the `true effects', if such a thing exists, must be higher.

You can see Andrew’s discussion of the bias on pages 11-13 of the paper: https://cdn.vanderbilt.edu/vu-my/wp-content/uploads/sites/2318/2019/07/….

If the treatment effect were monotonically increasing in magnitude, it’s true that you might be able to argue that the two-way FE estimator is a lower bound of some sort – though if this rate of increase were large enough, some of your two-way DD estimators might be signed incorrectly. And, of course, if treatment effects are initially large and then diminish as people adapt or equilibrium conditions change, then estimates might be biased in either direction.

Thanks Pamela! This is such great service. Question: what if the treatment is continuous (intensity rather than a binary on/off)? Is there a good solution out there (bacondecomp seems to only work with binary treatment)?

Have you seen this paper?

http://restud.com/wp-content/uploads/2017/08/MS19615manuscript.pdf

Don’t know of a good blog summary of that one, though...

Can these pages from the world bank blog offer printable versions

Hi Pam. Thanks for this great summary of Goodman-Bacon's paper. One thing that is not clear to me is What happen when we estimate a Two Way Fixed Effect Triple DD? I mean you mention that "First, two-way fixed effects estimates of DD that rely on variation in treatment timing only recover the average treatment effect when treatment effects are homogeneous. When treatment effects are heterogeneous across units, OLS over-weights units with more variance in treatment status in order to achieve a more precise estimate of the treatment effect." So, if we include another interaction (some X) with the aim to account for the treatment heterogeneity, may be able to correct this kind of bias?

This is an excellent blog, Pamela! Thank you so much for writing this up. I now finally understand all the implications of a two-way fixed effects approach to a DD estimate.

Question: Have you tried to estimate the "variance adjusted common trend", using the Stata code? It is not clear to me from the help file how to do that. According to the paper, one needs two types of weights. But the estimation only stores ones type of weights.

Hi Pamela, thanks very much for the blog post, super interesting! After reading this, I was wondering if you've come across situations where units may receive the treatment not just once over the total estimation window, but more than once. For instance, referring to the hypothetical dataset you discuss in the blog, this would be the case if a country made primary school free in 2000, then non-free again in 2002, and free again in 2005. When units are countries, this sounds unrealistic. But to use a more micro example, referring to the paper you cited in a comment above (https://arxiv.org/pdf/1804.05785.pdf), this would be the case if individuals are hospitalized more than once over the total window, as opposed to once only. To handle this, would you have to remove units whose treatment path isn't 0 up to an event date, and then 1 afterwards? Or is there a way to explore these multiple discontinuities within units? Many thanks!

Hi Jerome, I am facing the same problem, I have a treatment that varies in time (i.e, some years some municipalities are treated and next year can be treated or not) I am wondering whether you now have an informed answer about how to deal with this type of heterogeneity. Thank you!

Hi Ana. The basic intuition is the same is in the DD case. When you run two-way fixed effects (TWFE), it is equivalent (by the Frisch-Waugh-Lovell theorem) to running a regression of the residuals from a regression of y on your TWFE on the residuals from a regression of your treatment dummy on TWFE. The weight that any observation gets in your estimation is proportional to the residual from that latter regression (of treatment on the fixed effects). We get particularly worried when some treated observations receive *negative* weight, which can happen when the you have an observation that is a high-average-treatment unit observed in a high-average-treatment period - and that is *much* less likely when treatment turns off and on, because you don't see most of the units treated in the later periods. But in general, with TWFE you are assuming that there is a linear relationship between residualized outcomes and residualized treatment, and when that doesn't hold, you can get an estimated effect that is outside the support of the actual treatment effects on individual units. I've written a bit more on this here in case it is helpful: https://arxiv.org/abs/2103.13229.

Pam

THANK YOU!!!

Hi, Pamela. I‘m considering a DID style design that the post status is not absorbing like the recession episode, only lasting for several years(like t1 to t2). So in this case, can I claim that “I’m using a DID” design? or merely average treatment effect of an event-study? A related question is: if I add a triple-difference term in a further estimation exercise, can I claim that I’m using a DDD design, or just event-study with some heterogeneity? Thank you