Explaining why we should believe your DiD assumptions

|

This page in:

As a follow-up to my last post on new developments in difference-in-differences, and my ending comment that I would like to see more advice on how researchers should discuss the economic and rhetorical case for the validity of DiD, as well as just the statistical case, I thought I would see how recent papers in the literature have discussed the issue. I took a quick look at recent (2021 and Feb 2022) issues of the QJE, and focused on three papers that rely on difference-in-differences:

·       Bleemer (2022)’s paper on the effect of ending race-based affirmative action in public universities in California, which compares  changes in outcomes for Blacks and Hispanics (treated) to those for Whites and Asians (control).

·       Cantoni and Pons (2021)’s paper on the impact of strict ID laws on voting in the U.S., which uses the staggered introduction of strict ID requirements in 11 states, mostly with Republican state governments (treatment), and compares them to the remaining states (controls).

·       Cameron et al. (2021)’s paper on the impact of criminalizing sex work in Indonesia, which compares changes in outcomes in one district in East Java where sex work was suddenly criminalized (treatment), to changes in outcomes in two neighboring districts where the policy was not changed.

These papers all use a variety of statistical methods to assess and assert the validity of their DiD estimates – including some of the new methods to allow for staggered DiD, to look at robustness to violations from parallel trends, and wild bootstrap or permutation methods to account for smaller samples. But I found all three lacking in providing a discussion of why we should think the policy change is independent of future changes/why parallel trends should hold (identification) and of how much independent variation there is/what the equivalent randomized experiment would be (inference). Through these illustrations, my hope is to help spur more thinking on this issue in applied research, and get researchers to be clearer in writing how they think about these issues in their contexts. It is commonplace in papers that use IV to spend a lot of time arguing for the plausibility of the ultimately untestable exclusion restriction, but I think DiD papers have not done the same for the untestable assumption about future trends.

Why should we think parallel trends will hold/why should we think the policy change is not forward-looking?

While papers often look at pre-trends to provide some support for the parallel trends assumption, the concern is whether parallel trends would continue to hold in the future in the absence of treatment. This is not something that is statistically testable, and so must be argued through knowledge of the context. But none of the papers I looked at did this.

For example, I would have liked to see:

·       In the Bleemer (2022) paper, discussion of why we should think trends for blacks and Hispanics will parallel those of whites and Asians, especially given the initial levels are different. For example, one of the outcomes the author looks at is wages and employment. But there is research showing that Black and Hispanic workers have much more cyclical unemployment and earnings than white workers – so it is unclear I would expect the groups to react the same when the economy suffers shocks. This is a case where I would be more convinced if there was some matching on observables, to take a subset of both groups that seemed a lot more similar to one another, for whom labor market movements may be more plausibly parallel. Another concern would be one of concavity in gains, where perhaps white and Asian workers, who already have higher levels of education, might be expected to start leveling off in the future, since there is less room for improvement.

·       In the Cantoni and Pons (2021) paper, it is unclear why we should think changes in voting in largely Republican-run states should continue to parallel those in largely Democratic-run states. For example, one concern might be that Republican state law makers start imposing these laws because they see demographics starting to change in their states and they want to stop changes in vote shares before they start occurring. The paper provides no discussion of why particular states decided to introduce the laws when they did. It is also unclear whether, in the counterfactual where Republican legislatures did not impose stricter ID laws, we should also expect them not to have implemented other policies that might also influence voting.

·       In the Cameron et al. (2021) paper, there is at least a discussion of what the public rationale given for the policy change was (a “birthday present” for the district influenced by religious pressure), and survey evidence that it was unexpected. But we might expect lawmakers who start criminalizing sex work for religious reasons to either reflect changing attitudes in the district and/or to also be implementing other policies that might have affected sex work even if they hadn’t criminalized it. There is again no discussion in the paper of why we should think trends would have continued to be parallel.

As these examples hopefully illustrate, it is not enough to just look at pre-trends and statistical comparisons, but we want to have the authors explicitly note what some concerns might be to identification in their context, and then discuss what contextual or economic evidence they have for the policy change not being forward looking in any way, and for parallel trends continuing to hold in the absence of treatment. This echoes the conclusion of Jon Roth’s forthcoming AER Insights paper, where he concludes by saying:

“I urge researchers to use context-specific economic knowledge to inform the discussion and analysis of possible violations of parallel trends. Bringing economic knowledge to bear on how parallel trends might plausibly be violated in a given context will yield stronger, more credible inferences than relying on the statistical significance of pre-trends tests alone.”

How much independent variation is there really/what would the equivalent randomized experiment look like?

In my last blog post, I mentioned how Roth et al. (2022) discuss the classic example of DiD analysis of a minimum wage change in New Jersey vs Pennsylvania, where there are N=2 states. I found all three papers also a bit unclear in thinking about the policy changes from a design-based perspective, which has implications for inference. For example:

·       In the Bleemer et al. (2022) paper, the effective randomized experiment here would seem to be one which randomly chose whether to have affirmative action for one or zero of the two groups (Blacks and Hispanics, vs Whites and Asians). The policy is a state-wide policy change, and so this seems like the NJ/PA minimum wage case.

·       In the Cantoni and Pons (2021) paper, 11 out of the 50 states get treated. But they are mostly Republican-led states, and it does not seem like we should think of this as a situation where 11 out of 50 states get randomly chosen to have strict ID rules – it is unclear whether we should think of small groupings of states (e.g. the South/the Midwest/etc.)  or should think of this as something that only might plausibly be implemented in a subset of the states (those that aren’t heavily Democratic), or how we should think of this.

·       In the Cameron et al. (2021) paper, the authors survey in 17 worksites from their 3 districts, and so bootstrap standard errors among these 17 worksites. But the effective level of policy implementation here is at the district level, and there are only 3 districts, so it seems the equivalent randomized experiment is instead choosing  1 out of 3 districts.

Thus in all cases, it seems like we have to rely heavily on the functional form assumptions coming from correctly specified parallel trends, and like Roth et al. (2022) suggest, then be concerned about shocks at the level of effective policy decision-making as violations of this assumption, that should be addressed through methods that examine robustness to parallel trends. But again, this is an area where I think there is a lack of clarity in current discussion of DiD, and more advice is needed, and authors need to be more explicit about what they are assuming here.

Authors

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

An economist who uses DD
January 26, 2022

Hi David,

Thanks for this thoughtful and thought-provoking blog post. Two questions:

1) You say that "Thus in all cases, it seems like we have to rely heavily on the functional form assumptions coming from correctly specified parallel trends"

I didn't follow this part. Could you expand/clarify why from the examples you gave, it follows that identification relies on functional form assumptions? Isn't that pretty much always the cases in DD that use observational data (like in the three examples above)?

2) Do you have any examples of a paper that (i) Uses a DD strategy on observational data, and (ii) provides a thoughtful discussion of the two points you mention? (i.e., why parallel trends should hold (identification) and how much independent variation there is in the data (inference)).

Thanks!

January 26, 2022

Thanks, I was looking for a recent example that did this very well, and in my (admittedly quick) search, did not come across anything. One of the comments I received on twitter pointed to Gentzkow (2006) (https://web.stanford.edu/~gentzkow/research/tv_turnout.pdf). It does do a good job on discussing explicitly some of the threats to identification: e.g. "Although the panel design controls for both cross-sectional differences correlated with the timing of television’s introduction and common changes over time, the results could still be biased if there were negative shocks to information or voting that hit the largest and richest cities in the mid-1940s, medium-sized cities in the late-1940s, and small cities and rural areas in the early-1950s. One way to address this issue is to make use of observable
county characteristics."...(read to see what he does). He does not do so well on the inference discussion - standard errors are clustered by county, but tv markets are much bigger.