# Why is Difference-in-Difference Estimation Still so Popular in Experimental Analysis?

## This page in:

David McKenzie pops out from under many empirical questions that come up in my research projects, which has not yet ceased to be surprising every time it happens, despite his prolific production. The last time it happened was a teachable moment for me, so I thought I’d share it in a short post that fits nicely under our “Tools of the Trade” tag.

“Beyond Baseline and Follow-up: The Case for More T in Experiments,” is a paper David blogged about here more than three years ago. One of the implications of the analysis in that paper is as follows: “When autocorrelations are low, there are large improvements in power to be had from using ANCOVA instead of difference-in-differences in analysis.” Simply put, ANCOVA implies controlling for the baseline (lagged) value of the outcome variable in the regression rather than differencing it out in the more common difference-in-difference (DD) specification.

Despite the fact that this is a highly cited paper (108 times since 2012 according to Google Scholar), my impression is that using ANCOVA instead of DD has not yet become standard practice in the case of the typical scenario of an experiment with one baseline and one follow-up (or multiple follow-ups, each of which are analyzed separately to assess the trajectory of impacts). As the implications for power can REALLY matter when the autocorrelation for the outcome variable is low, I thought I’d give an example here from my own work to perhaps convert a few more applied researchers.

In a cluster-randomized experiment to improve the quality of caregiving at childcare centers in Malawi, we assigned 200 centers to four treatments and sampled 12 three and four year-old children from each center. While the final outcomes are developmental assessments at the child level, a plausible pathway towards such improvements is a transformation of the classrooms: how caregivers interact with the children, what activities are being conducted, what play and learning materials are available, etc. To measure these intermediate outcomes, we had two trained enumerators sit in each center for 1-2 hours and record a checklist of 30+ items. We collected these data at baseline before random assignment of schools into different treatment groups, then at first follow-up and second follow-up. The default plan was to conduct a DD analysis for both the final outcomes and the child level and the intermediate outcomes at the center level.

However, it turns out that while our child-level outcomes are highly autocorrelated – a common finding of studies with test scores – the index of classroom observations are not: the autocorrelation coefficient is less than 0.2. This suggests that a slight baseline imbalance between two treatment arms is not really predictive of that difference in follow-up data collection. David’s paper suggests that it is inefficient to fully correct for such baseline imbalances and the exact ratio of DD variance to ANCOVA variance is 2/(1+ρ), where ρ is the autocorrelation coefficient, meaning that the power loss from using DD is about 68% in my case, where ρ=0.19.

So, what ends up happening when I estimate the effect of combined treatment vs. the control group on the index of classroom observations is that I get a large standardized effect of 0.45 standard deviations that is not statistically significant at the 90% level of confidence (t-stat=1.46). Using ANCOVA, I get a similarly large and educationally meaningful effect of 0.58 SD that is statistically significant at the 99% level of confidence (t-stat=2.6). The effect sizes from the two specifications are not identical because of the small and insignificant imbalance at baseline of 0.15 SD (we blocked the school-level randomization on averages of child-level outcomes like test scores and anthropometrics rather than this variable, and hence the random variation. Note that the t-stat would have still gone down from 2.6 to 1.9 by moving from ANCOVA to DD even if the effect size remained identical at 0.58. Alternatively, the smaller effect size of 0.45 would still have a t-stat of 2 with the ANCOVA standard errors).

So, here is a case where what you tell your counterpart at the ministry of education depends on how you choose to analyze your data: is it a large effect that we don’t have the power to detect, or a similarly large effect that is very significant by conventional standards in economics and education? My interpretation of David’s paper is that it’s foolish to leave statistical power that you have unused just because DD has been the default specification for many of us analyzing such data in experiments. My solution so far in sharing the findings with colleagues and at a presentation to our counterparts in Malawi has been to present the ANCOVA results as the preferred estimates, but also mentioning the loss of precision when DD is employed – providing the most conservative estimate of the classroom effects.

The paper has much more practical advice that is a must-read for those designing studies or analyzing data in two-round RCTs with economic outcomes that are not highly autocorrelated. In particular, the paper discusses the choice to decide how many rounds of data vs. what cross-sectional sample size to collect or how to divvy up a fixed number of rounds between pre- and post-treatment. For example, in our case, with autocorrelation so low, even three pre-treatment rounds of data collection would not give us more power than a simple comparison of post-treatment outcomes, but more post-treatment rounds (perhaps centered around the one-year and the two-year follow-ups) would have led to more power by averaging out the random noise. Given that the two follow-ups are not so far apart from each other (one year), we might also present an average post-treatment effect rather than the more standard reporting of effects separately at first and second follow-up. The paper also makes a point that turned out to be inherent to our study: we’re interested in multiple outcomes, some of which are highly autocorrelated while others are not. We knew the former at the outset but only found out the latter after the first follow-up. In such cases, even if you knew everything before designing your study, your choices would involve some difficult trade-offs.

Next time you’re analyzing data from an RCT with T=2 using an outcome with low autocorrelation, remember that power gains from using ANCOVA are not just hypothetical: they can be quite large.

“Beyond Baseline and Follow-up: The Case for More T in Experiments,” is a paper David blogged about here more than three years ago. One of the implications of the analysis in that paper is as follows: “When autocorrelations are low, there are large improvements in power to be had from using ANCOVA instead of difference-in-differences in analysis.” Simply put, ANCOVA implies controlling for the baseline (lagged) value of the outcome variable in the regression rather than differencing it out in the more common difference-in-difference (DD) specification.

Despite the fact that this is a highly cited paper (108 times since 2012 according to Google Scholar), my impression is that using ANCOVA instead of DD has not yet become standard practice in the case of the typical scenario of an experiment with one baseline and one follow-up (or multiple follow-ups, each of which are analyzed separately to assess the trajectory of impacts). As the implications for power can REALLY matter when the autocorrelation for the outcome variable is low, I thought I’d give an example here from my own work to perhaps convert a few more applied researchers.

In a cluster-randomized experiment to improve the quality of caregiving at childcare centers in Malawi, we assigned 200 centers to four treatments and sampled 12 three and four year-old children from each center. While the final outcomes are developmental assessments at the child level, a plausible pathway towards such improvements is a transformation of the classrooms: how caregivers interact with the children, what activities are being conducted, what play and learning materials are available, etc. To measure these intermediate outcomes, we had two trained enumerators sit in each center for 1-2 hours and record a checklist of 30+ items. We collected these data at baseline before random assignment of schools into different treatment groups, then at first follow-up and second follow-up. The default plan was to conduct a DD analysis for both the final outcomes and the child level and the intermediate outcomes at the center level.

However, it turns out that while our child-level outcomes are highly autocorrelated – a common finding of studies with test scores – the index of classroom observations are not: the autocorrelation coefficient is less than 0.2. This suggests that a slight baseline imbalance between two treatment arms is not really predictive of that difference in follow-up data collection. David’s paper suggests that it is inefficient to fully correct for such baseline imbalances and the exact ratio of DD variance to ANCOVA variance is 2/(1+ρ), where ρ is the autocorrelation coefficient, meaning that the power loss from using DD is about 68% in my case, where ρ=0.19.

So, what ends up happening when I estimate the effect of combined treatment vs. the control group on the index of classroom observations is that I get a large standardized effect of 0.45 standard deviations that is not statistically significant at the 90% level of confidence (t-stat=1.46). Using ANCOVA, I get a similarly large and educationally meaningful effect of 0.58 SD that is statistically significant at the 99% level of confidence (t-stat=2.6). The effect sizes from the two specifications are not identical because of the small and insignificant imbalance at baseline of 0.15 SD (we blocked the school-level randomization on averages of child-level outcomes like test scores and anthropometrics rather than this variable, and hence the random variation. Note that the t-stat would have still gone down from 2.6 to 1.9 by moving from ANCOVA to DD even if the effect size remained identical at 0.58. Alternatively, the smaller effect size of 0.45 would still have a t-stat of 2 with the ANCOVA standard errors).

So, here is a case where what you tell your counterpart at the ministry of education depends on how you choose to analyze your data: is it a large effect that we don’t have the power to detect, or a similarly large effect that is very significant by conventional standards in economics and education? My interpretation of David’s paper is that it’s foolish to leave statistical power that you have unused just because DD has been the default specification for many of us analyzing such data in experiments. My solution so far in sharing the findings with colleagues and at a presentation to our counterparts in Malawi has been to present the ANCOVA results as the preferred estimates, but also mentioning the loss of precision when DD is employed – providing the most conservative estimate of the classroom effects.

The paper has much more practical advice that is a must-read for those designing studies or analyzing data in two-round RCTs with economic outcomes that are not highly autocorrelated. In particular, the paper discusses the choice to decide how many rounds of data vs. what cross-sectional sample size to collect or how to divvy up a fixed number of rounds between pre- and post-treatment. For example, in our case, with autocorrelation so low, even three pre-treatment rounds of data collection would not give us more power than a simple comparison of post-treatment outcomes, but more post-treatment rounds (perhaps centered around the one-year and the two-year follow-ups) would have led to more power by averaging out the random noise. Given that the two follow-ups are not so far apart from each other (one year), we might also present an average post-treatment effect rather than the more standard reporting of effects separately at first and second follow-up. The paper also makes a point that turned out to be inherent to our study: we’re interested in multiple outcomes, some of which are highly autocorrelated while others are not. We knew the former at the outset but only found out the latter after the first follow-up. In such cases, even if you knew everything before designing your study, your choices would involve some difficult trade-offs.

Next time you’re analyzing data from an RCT with T=2 using an outcome with low autocorrelation, remember that power gains from using ANCOVA are not just hypothetical: they can be quite large.

Topics

## Join the Conversation

Thanks!!! A good reminder now that we are writing a pre-analysis plan... Particularly worthy if you know the autocorrelation of outcomes after the baseline.

This is a great paper and I've read it many times. Especially well taken is the trade-off between n and T given a fixed number of surveys K=nT, but it's not always a rectangular problem. Additional rounds may come with considerable fixed costs.