World Bank Blogs
http://blogs.worldbank.org/planet.xml
IBRD and IDA: Working for a World Free of Poverty.enWhy is Difference-in-Difference Estimation Still so Popular in Experimental Analysis?
http://blogs.worldbank.org/impactevaluations/why-difference-difference-estimation-still-so-popular-experimental-analysis
David McKenzie pops out from under many empirical questions that come up in my research projects, which has not yet ceased to be surprising every time it happens, despite <a href="https://sites.google.com/site/decrgdmckenzie/publications-by-topic" rel="nofollow">his prolific production</a>. The last time it happened was a teachable moment for me, so I thought I’d share it in a short post that fits nicely under our “Tools of the Trade” tag.<br />
<br />
“<a href="http://siteresources.worldbank.org/DEC/Resources/Beyond_Baseline_and_FollowUpJDE_final.pdf" rel="nofollow">Beyond Baseline and Follow-up: The Case for More T in Experiments</a>,” is a paper David blogged about <a href="http://blogs.worldbank.org/impactevaluations/node/733" rel="nofollow">here</a> more than three years ago. One of the implications of the analysis in that paper is as follows: “When autocorrelations are low, there are large improvements in power to be had from using ANCOVA instead of difference-in-differences in analysis.” Simply put, ANCOVA implies controlling for the baseline (lagged) value of the outcome variable in the regression rather than differencing it out in the more common difference-in-difference (DD) specification.<br />
<br />
Despite the fact that this is a highly cited paper (108 times since 2012 according to <a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=EUhiltEAAAAJ&cstart=20&pagesize=80&citation_for_view=EUhiltEAAAAJ:j3f4tGmQtD8C" rel="nofollow">Google Scholar</a>), my impression is that using ANCOVA instead of DD has not yet become standard practice in the case of the typical scenario of an experiment with one baseline and one follow-up (or multiple follow-ups, each of which are analyzed separately to assess the trajectory of impacts). As the implications for power can REALLY matter when the autocorrelation for the outcome variable is low, I thought I’d give an example here from my own work to perhaps convert a few more applied researchers.<br />
<br />
In a cluster-randomized experiment to improve the quality of caregiving at childcare centers in Malawi, we assigned 200 centers to four treatments and sampled 12 three and four year-old children from each center. While the final outcomes are developmental assessments at the child level, a plausible pathway towards such improvements is a transformation of the classrooms: how caregivers interact with the children, what activities are being conducted, what play and learning materials are available, etc. To measure these intermediate outcomes, we had two trained enumerators sit in each center for 1-2 hours and record a checklist of 30+ items. We collected these data at baseline before random assignment of schools into different treatment groups, then at first follow-up and second follow-up. The default plan was to conduct a DD analysis for both the final outcomes and the child level and the intermediate outcomes at the center level.<br />
<br />
However, it turns out that while our child-level outcomes are highly autocorrelated – a common finding of studies with test scores – the index of classroom observations are not: the autocorrelation coefficient is less than 0.2. This suggests that a slight baseline imbalance between two treatment arms is not really predictive of that difference in follow-up data collection. David’s paper suggests that it is inefficient to fully correct for such baseline imbalances and the exact ratio of DD variance to ANCOVA variance is 2/(1+ρ), where ρ is the autocorrelation coefficient, meaning that the power loss from using DD is about 68% in my case, where ρ=0.19.<br />
<br />
So, what ends up happening when I estimate the effect of combined treatment vs. the control group on the index of classroom observations is that I get a large standardized effect of 0.45 standard deviations that is not statistically significant at the 90% level of confidence (t-stat=1.46). Using ANCOVA, I get a similarly large and educationally meaningful effect of 0.58 SD that is statistically significant at the 99% level of confidence (t-stat=2.6). The effect sizes from the two specifications are not identical because of the small and insignificant imbalance at baseline of 0.15 SD (we blocked the school-level randomization on averages of child-level outcomes like test scores and anthropometrics rather than this variable, and hence the random variation. Note that the t-stat would have still gone down from 2.6 to 1.9 by moving from ANCOVA to DD even if the effect size remained identical at 0.58. Alternatively, the smaller effect size of 0.45 would still have a t-stat of 2 with the ANCOVA standard errors).<br />
<br />
So, here is a case where what you tell your counterpart at the ministry of education depends on how you choose to analyze your data: is it a large effect that we don’t have the power to detect, or a similarly large effect that is very significant by conventional standards in economics and education? My interpretation of David’s paper is that it’s foolish to leave statistical power that you have unused just because DD has been the default specification for many of us analyzing such data in experiments. My solution so far in sharing the findings with colleagues and at a presentation to our counterparts in Malawi has been to present the ANCOVA results as the preferred estimates, but also mentioning the loss of precision when DD is employed – providing the most conservative estimate of the classroom effects.<br />
<br />
The paper has much more practical advice that is a must-read for those designing studies or analyzing data in two-round RCTs with economic outcomes that are not highly autocorrelated. In particular, the paper discusses the choice to decide how many rounds of data vs. what cross-sectional sample size to collect or how to divvy up a fixed number of rounds between pre- and post-treatment. For example, in our case, with autocorrelation so low, even three pre-treatment rounds of data collection would not give us more power than a simple comparison of post-treatment outcomes, but more post-treatment rounds (perhaps centered around the one-year and the two-year follow-ups) would have led to more power by averaging out the random noise. Given that the two follow-ups are not so far apart from each other (one year), we might also present an average post-treatment effect rather than the more standard reporting of effects separately at first and second follow-up. The paper also makes a point that turned out to be inherent to our study: we’re interested in multiple outcomes, some of which are highly autocorrelated while others are not. We knew the former at the outset but only found out the latter after the first follow-up. In such cases, even if you knew everything before designing your study, your choices would involve some difficult trade-offs.<br />
<br />
Next time you’re analyzing data from an RCT with T=2 using an outcome with low autocorrelation, remember that power gains from using ANCOVA are not just hypothetical: they can be quite large.<br />
Mon, 23 Feb 2015 09:23:00 -0500Berk OzlerA Curated List of Our Postings on Technical Topics – Your One-Stop Shop for Methodology
http://blogs.worldbank.org/impactevaluations/curated-list-our-postings-technical-topics-your-one-stop-shop-methodology
Rather than the usual list of Friday links, this week I thought I’d follow up on <a href="http://blogs.worldbank.org/impactevaluations/introducing-ask-guido" rel="nofollow">our post by Guido Imbens</a> yesterday on clustering and post earlier this week by <a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others" rel="nofollow">Dave Evans on Hawthorne effects</a> with a curated list of our technical postings, to serve as a one-stop shop for your technical reading. I’ve focused here on our posts on methodological issues in impact evaluation – we also have a whole lot of posts on how to conduct surveys and measure certain concepts that I’ll leave for another time.<br />
<strong>Random Assignment, Registration and Reporting</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata" rel="nofollow">Doing stratified randomization with uneven numbers in the Strata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios" rel="nofollow">How to randomize using many baseline variables</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/public-randomization-ceremonies" rel="nofollow">Public randomization ceremonies</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/designing-experiments-to-measure-spillover-effects" rel="nofollow">Designing experiments to measure spillover effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-are-mechanism-experiments-and-should-we-be-doing-more-of-them" rel="nofollow">Mechanism experiments</a> and <a href="http://blogs.worldbank.org/impactevaluations/inside-the-black-box-why-do-things-work-0" rel="nofollow">opening up the black box</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/sampling-weights-matter-for-rct-design" rel="nofollow">Sample weights and RCT design</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-pre-analysis-plan-checklist" rel="nofollow">A pre-analysis plan check-list</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries" rel="nofollow">The New Trial Registries</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-isn-t-reported-impact-evaluations" rel="nofollow">What isn’t reported in impact evaluations but maybe should be</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-joint-test-orthogonality-when-testing-balance">Randomization checks: testing for joint orthogonality</a><br />
<strong>Propensity Score Matching</strong><br />
Guido Imbens on <a href="https://blogs.worldbank.org/impactevaluations/introducing-ask-guido" rel="nofollow">clustering standard errors with matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-recent-tests-matching-estimators-through-evaluation-job-training-programs" rel="nofollow">Testing different matching estimators as applied to job training programs</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-the-covariate-balanced-propensity-score" rel="nofollow">The covariate balanced propensity score</a><br />
<strong>Difference-in-Differences</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice" rel="nofollow">The often unspoken assumptions behind diff-in-diff</a><br />
<strong>Other Evaluation Methods</strong><br />
The <a href="https://blogs.worldbank.org/impactevaluations/evaluating-regulatory-reforms-using-the-synthetic-control-method" rel="nofollow">synthetic control method</a>, as applied to regulatory reforms<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-alan-de-brauw-regression-discontinuity-impacts-with-an-implicit-index-evaluating-el-sa" rel="nofollow">Regression discontinuity with an implicit index</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/using-spatial-variation-program-performance-identify-causal-impact-0" rel="nofollow">Using spatial variation</a> in program performance to identify impacts<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-howard-white-can-we-do-small-n-impact-evaluations" rel="nofollow">Small n impact evaluation methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/can-we-trust-shoestring-evaluations" rel="nofollow">Can we trust shoestring evaluations?</a><br />
<strong>Analysis</strong><br />
Regression adjustment in randomized experiments (<a href="http://blogs.worldbank.org/impactevaluations/regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-than-the-disease" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-winston-lin-regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-0" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights" rel="nofollow">When to use survey weights</a> in analysis<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-a-quick-adjustment-for-multiple-hypothesis-testing" rel="nofollow">Adjustments for multiple hypothesis testing</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/help-for-attrition-is-just-a-phone-call-away-a-new-bounding-approach-to-help-deal-with-non-response" rel="nofollow">Bounding approaches to deal with attrition</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/whether-to-probit-or-to-probe-it-in-defense-of-the-linear-probability-model" rel="nofollow">Linear probability models versus probits</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-dealing-with-multiple-lotteries" rel="nofollow">Dealing with multiple lotteries</a><br />
Estimating standard errors with small clusters (<a href="http://blogs.worldbank.org/impactevaluations/annals-of-good-ie-practice-getting-those-standard-errors-correct-in-small-sample-clustered-studies" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-estimating-correct-standard-errors-in-small-sample-cluster-studies-another-take" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-beyond-mean-decompositions-with-an-application-to-the-gender-wage-gap-in-china" rel="nofollow">Decomposition methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/you-think-randomized-controlled-trials-are-great-actually-they-are-even-better-than-that-guest-post" rel="nofollow">Estimation of treatment effects with incomplete compliance</a><br />
<strong>Power Calculations and Improving Power</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/does-the-intra-class-correlation-matter-for-power-calculations-if-i-am-going-to-cluster-my-standard" rel="nofollow">Does the intra-cluster correlation matter for power calculations if I am going to cluster my standard errors?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-for-propensity-score-matching" rel="nofollow">Power calculations for propensity score matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up" rel="nofollow">Power calculations 101: dealing with incomplete take-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/collecting-more-rounds-of-data-to-boost-power-the-new-stuff" rel="nofollow">Collecting more rounds of data to boost power</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/on-improving-power-in-small-sample-studies" rel="nofollow">Improving power in small samples</a><br />
<strong>On External Validity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/weighting-for-external-validity-then-waiting-for-election-results" rel="nofollow">Weighting for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/will-successful-intervention-over-there-get-results-over-here-we-can-never-answer-full-certainty-few" rel="nofollow">Will that successful intervention over there get results here?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/learn-live-without-external-validity" rel="nofollow">Learn to live without external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/questioning-external-validity-regression-estimates-why-they-can-be-less-representative-you-think" rel="nofollow">Why the external validity of regression estimates can be less than you think</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-similarity-wrong-concept-external-validity" rel="nofollow">Why similarity is the wrong concept for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-rant-on-the-external-validity-double-double-standard" rel="nofollow">A rant on the external validity double standard</a><br />
<strong>Jargony Terms in Impact Evaluations</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others" rel="nofollow">The Hawthorne Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/are-john-henry-effects-as-apocryphal-as-their-eponym" rel="nofollow">The John Henry Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/is-it-the-program-or-is-it-participation-randomization-and-placebos" rel="nofollow">Placebo effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-should-we-understand-clinical-equipoise-when-doing-rcts-development" rel="nofollow">Clinical Equipoise</a><br />
<strong>Stata Tricks</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-graphing-impacts-standard-error-bars" rel="nofollow">Graphing impacts with Standard Error Bars</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-cluster-correlations" rel="nofollow">Calculating the intra-cluster correlation</a><br />
<a href="https://blogs.worldbank.org/impactevaluations/generating-regression-and-summary-statistics-tables-stata-checklist-and-code" rel="nofollow">Generating regression and summary statistics tables in Stata: A checklist and code</a><br />
Fri, 21 Feb 2014 07:46:00 -0500David McKenzieThe often (unspoken) assumptions behind the difference-in-difference estimator in practice
http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice
This post is co-written with <a href="http://www.eco.uc3m.es/~ricmora/" rel="nofollow">Ricardo Mora</a> and <a href="http://www.eco.uc3m.es/~ireggio/" rel="nofollow">Iliana Reggio</a><br />
<br />
The difference-in-difference (DID) evaluation method should be very familiar to our readers – a method that infers program impact by comparing the pre- to post-intervention change in the outcome of interest for the treated group relative to a comparison group. The key assumption here is what is known as the “Parallel Paths” assumption, which posits that the average change in the comparison group represents the counterfactual change in the treatment group if there were no treatment. It is a popular method in part because the data requirements are not particularly onerous – it requires data from only two points in time – and the results are robust to any possible confounder as long as it doesn’t violate the Parallel Paths assumption. When data on several pre-treatment periods exist, researchers like to check the Parallel Paths assumption by testing for differences in the pre-treatment trends of the treatment and comparison groups. Equality of pre-treatment trends may lend confidence but this can’t directly test the identifying assumption; by construction that is untestable. Researchers also tend to explicitly model the “natural dynamics” of the outcome variable by including flexible time dummies for the control group and a parametric time trend differential between the control and the treated in the estimating specification.<br />
<br />
Typically, the applied researcher’s practice of DID ends at this point. Yet <a href="http://e-archivo.uc3m.es/handle/10016/16065" rel="nofollow">a very recent working paper</a> by Ricardo Mora and Iliana Reggio (two co-authors of this post) points out that DID-as-commonly-practiced implicitly involves other assumptions instead of Parallel Paths, assumptions perhaps unknown to the researcher, which may influence the estimate of the treatment effect. These assumptions concern the dynamics of the outcome of interest, both before and after the introduction of treatment, and the implications of the particular dynamic specification for the Parallel Paths assumption.<br />
<!--break--> <br />
As stated, researchers often supplement the DID specification with a time trend of some parametric form such as a (perhaps group specific) linear trend. But by including this linear trend, the identifying assumption shifts from the standard Parallel Paths to what can be termed Parallel Growths, since now deviation from a trend line identifies impact (alternatively, we can think of Parallel Growths as a Parallel Path assumption in first differences).<br />
<br />
The switch from Parallel Paths to Parallel Growths highlights a line of reasoning that Ricardo and Iliana formally extend to a general family of Parallel Assumptions valid for higher order differencing such as a difference of double-differencing (what might be called a Parallel Accelerations assumption) and so on. Arguably higher order Parallel Assumptions present weaker identifying assumptions than Parallel Paths – we no longer need the trend in the comparison group to proxy for the counterfactual trend of the treatment group but rather the <em>growth</em> (i.e. the change in trend) in the comparison group to proxy for the counterfactual <em>growth</em>. But there is a trade-off in our empirical practice since differencing of data tends to exacerbate any measurement error present in the outcome measures. So the extent that we can benefit from higher order Parallel Assumptions is determined by our data on a case by case basis.<br />
<br />
Ricardo and Iliana then develop a general additive regression model with fully flexible dynamics – this has the advantage of being able to test for possible restrictions on the dynamics rather than simply positing a particular parametric form. The model also doesn’t impose equivalence between alternative parallel assumptions. In fact this model can test for such equivalence:<br />
<br />
<img alt="" src="https://blogs.worldbank.org/impactevaluations/files/impactevaluations/Equation_21Nov2013.PNG" style="height:50px; width:250px" /><br />
<br />
The framework above allows for fully flexible pre-treatment trend differentials between the treated and comparison group and also allows for a comparison of any two consecutive parallel assumptions such as Paths vs. Growths. Here <em>Y</em> is the outcome of interest and time runs from<em> t1</em> until <em>T</em> with the intervention beginning at some point between <em>t2</em> and <em>T</em>. The binary indicator variable <em>I </em>designates time-periods while <em>D</em> indicates treated units. In practice, researchers often estimate a more restrictive equation than this one – even when the data permit this more flexible model. Here is <a href="http://ideas.repec.org/a/uwp/jhriss/v40y2005i2p559-590.html" rel="nofollow">one paper that does use this specification</a> to look at the effects of school-desegregation in the U.S.<br />
<br />
Ricardo and Iliana then review all DiD papers published in ten well-known economic journals over the past three years and focus on those that (a) adopt a DiD framework with more than one pre-treatment time period and (b) have made the data publically available. There are nine papers that meet these criteria. The topics of study in these papers range from the effect of Daylight Savings Time on US residential electricity use to the effects of WWI related male mortality on marriage market outcomes in France. All of the nine papers adopt more restrictive estimating equations than the one above. In fact most of the 13 specifications in the nine papers restrict pre-treatment dynamics to be equivalent between treatment and comparison groups. Most also impose a constant treatment effect in post-treatment periods thus ignoring the possible dynamics of treatment.<br />
<br />
Eleven of the 13 specifications report significant treatment effects in the original papers. In contrast by applying the flexible model to the data Ricardo and Iliana find:<br />
<ul>
<li>
In the 11 cases that estimate significant impacts, once re-estimated with the fully flexible model and with an explicit Parallel Paths assumption, only 5 remain precisely estimated and many of the 11 have substantively different point estimates.</li>
<li>
With the Parallel Growths assumption this number falls to 3 of 11 cases.</li>
<li>
Tests for the constancy of post-treatment effects for 11 of the specifications wind up rejecting the absence of dynamic effects in 6 of the instances. It seems post-treatment dynamic effects often matter and ideally should be modeled in a more flexible manner.</li>
<li>
A test of the equivalence of Parallel Paths and Parallel Growth assumptions rejects equivalence in 5 out of the 13 specifications. In these cases the arguably weaker assumption of Parallel Growth results in significantly different findings than Parallel Paths.</li>
</ul>
<br />
Now it’s true that standard errors are higher in general with the fully-flexible model (especially with the Parallel Growths assumption tested with first-differenced data) and in many cases equality between the treatment effect reported in the published paper and the estimate under the flexible model cannot be rejected. As Ricardo and Iliana conclude, “with the fully flexible model we obtain results that coincide in sign and significance level with the original results in approximately one third of the cases. We interpret this outcome as suggesting that for many empirical applications, the models used are unduly restrictive.”<br />
<br />
Here is a call to think twice about our DiD specifications. Data permitting, the more flexible proposed model above can serve as a benchmark at the start of any DiD analysis to test the robustness of alternative Parallel Assumptions and alternative dynamic specifications. At the very least this exercise may serve to guide more informed parsimonious models.<br />
<br />
p.s. – Ricardo and Iliana are currently writing an ado file that would implement many of these tests on parallel assumption equivalence or dynamics. We’ll post a link when it is ready for sharing.<br />
Thu, 21 Nov 2013 07:41:00 -0500Jed Friedman