World Bank Blogs
http://blogs.worldbank.org/planet.xml
IBRD and IDA: Working for a World Free of Poverty.enA Curated List of Our Postings on Technical Topics – Your One-Stop Shop for Methodology
http://blogs.worldbank.org/impactevaluations/curated-list-our-postings-technical-topics-your-one-stop-shop-methodology
This is a curated list of our technical postings, to serve as a one-stop shop for your technical reading. I’ve focused here on our posts on methodological issues in impact evaluation – we also have a whole lot of posts on how to conduct surveys and measure certain concepts that I’ll leave for another time. <em>Updated August 20, 2015.</em><br />
<strong>Random Assignment</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/allocating-treatment-and-control-multiple-applications-applicant-and-ranked-choices" rel="nofollow">Allocating treatment and control with multiple applications per applicant and ranked choices</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/optimization-just-re-randomization-redux-thoughts-recent-don-t-randomize-optimize-papers" rel="nofollow">Is optimization just re-randomization redux? Thoughts on the "don't randomize, optimize" papers</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/be-optimista-not-randomista-when-you-have-small-samples" rel="nofollow">Be an optimista, not a randomista, when you have small samples</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tips-randomization-wild-adding-waitlist" rel="nofollow">Tips for randomization in the wild: adding a waitlist</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/my-email-correspondence-how-randomize-field" rel="nofollow">How to randomize in the field</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/stratified-randomization-and-fifa-world-cup" rel="nofollow">Stratified randomization and the FIFA world cup</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata" rel="nofollow">Doing stratified randomization with uneven numbers in the Strata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios" rel="nofollow">How to randomize using many baseline variables</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/public-randomization-ceremonies" rel="nofollow">Public randomization ceremonies</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/designing-experiments-to-measure-spillover-effects" rel="nofollow">Designing experiments to measure spillover effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-are-mechanism-experiments-and-should-we-be-doing-more-of-them" rel="nofollow">Mechanism experiments</a> and <a href="http://blogs.worldbank.org/impactevaluations/inside-the-black-box-why-do-things-work-0" rel="nofollow">opening up the black box</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/sampling-weights-matter-for-rct-design" rel="nofollow">Sample weights and RCT design</a><br />
<strong>Pre-analysis plans and reporting</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/preregistration-studies-avoid-fishing-and-allow-transparent-discovery" rel="nofollow">Pre-registration of studies to avoid fishing and allow transparent discovery</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-joint-test-orthogonality-when-testing-balance" rel="nofollow">A joint test of orthogonality when testing for baseline balance</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-pre-analysis-plan-checklist" rel="nofollow">A pre-analysis plan check-list</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries" rel="nofollow">The New Trial Registries</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-isn-t-reported-impact-evaluations" rel="nofollow">What isn’t reported in impact evaluations but maybe should be</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-joint-test-orthogonality-when-testing-balance" rel="nofollow">Randomization checks: testing for joint orthogonality</a><br />
<strong>Propensity Score Matching</strong><br />
Guido Imbens on <a href="https://blogs.worldbank.org/impactevaluations/introducing-ask-guido" rel="nofollow">clustering standard errors with matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-recent-tests-matching-estimators-through-evaluation-job-training-programs" rel="nofollow">Testing different matching estimators as applied to job training programs</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-the-covariate-balanced-propensity-score" rel="nofollow">The covariate balanced propensity score</a><br />
<strong>Difference-in-Differences</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice" rel="nofollow">The often unspoken assumptions behind diff-in-diff</a><br />
<strong>Regression Discontinuity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/curves-all-wrong-places-gelman-and-imbens-why-not-use-higher-order-polynomials-rd" rel="nofollow">Curves in all the wrong places: Gelman and Imbens on why not to use higher-order polynomials in RD</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-alan-de-brauw-regression-discontinuity-impacts-with-an-implicit-index-evaluating-el-sa" rel="nofollow">Regression discontinuity with an implicit index</a><br />
<strong>Other Evaluation Methods</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/evaluating-argentine-regional-tourism-policy-using-synthetic-controls-tan-linda-que-enamora" rel="nofollow">Evaluating an Argentine tourism policy using synthetic controls: tan linda que enamora?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/impact-narrative-guest-post-bruce-wydick" rel="nofollow">Impact as narrative</a><br />
The <a href="https://blogs.worldbank.org/impactevaluations/evaluating-regulatory-reforms-using-the-synthetic-control-method" rel="nofollow">synthetic control method</a>, as applied to regulatory reforms<br />
<a href="http://blogs.worldbank.org/impactevaluations/using-spatial-variation-program-performance-identify-causal-impact-0" rel="nofollow">Using spatial variation</a> in program performance to identify impacts<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-howard-white-can-we-do-small-n-impact-evaluations" rel="nofollow">Small n impact evaluation methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/can-we-trust-shoestring-evaluations" rel="nofollow">Can we trust shoestring evaluations?</a><br />
<strong>Analysis</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow" rel="nofollow">Another reason to prefer Ancova: dealing with measurement changes between baseline and follow-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/endogenous-stratification-surprisingly-easy-way-bias-your-heterogeneous-treatment-effect-results-and" rel="nofollow">Endogenous stratification: the surprisingly easy way to bias your heterogeneous treatment effects and what to do instead</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-difference-difference-estimation-still-so-popular-experimental-analysis" rel="nofollow">Why is difference-in-difference estimation still so popular in experimental analysis?</a><br />
Regression adjustment in randomized experiments (<a href="http://blogs.worldbank.org/impactevaluations/regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-than-the-disease" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-winston-lin-regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-0" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights" rel="nofollow">When to use survey weights</a> in analysis<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-a-quick-adjustment-for-multiple-hypothesis-testing" rel="nofollow">Adjustments for multiple hypothesis testing</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/help-for-attrition-is-just-a-phone-call-away-a-new-bounding-approach-to-help-deal-with-non-response" rel="nofollow">Bounding approaches to deal with attrition</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/whether-to-probit-or-to-probe-it-in-defense-of-the-linear-probability-model" rel="nofollow">Linear probability models versus probits</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-dealing-with-multiple-lotteries" rel="nofollow">Dealing with multiple lotteries</a><br />
Estimating standard errors with small clusters (<a href="http://blogs.worldbank.org/impactevaluations/annals-of-good-ie-practice-getting-those-standard-errors-correct-in-small-sample-clustered-studies" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-estimating-correct-standard-errors-in-small-sample-cluster-studies-another-take" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-beyond-mean-decompositions-with-an-application-to-the-gender-wage-gap-in-china" rel="nofollow">Decomposition methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/you-think-randomized-controlled-trials-are-great-actually-they-are-even-better-than-that-guest-post" rel="nofollow">Estimation of treatment effects with incomplete compliance</a><br />
<strong>Power Calculations and Improving Power</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/my-mailbox-should-i-work-only-subsample-my-control-group-if-i-have-big-take-problems" rel="nofollow">Should I work with only a subsample of my control group if I have take-up problems?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use" rel="nofollow">Power calculations: what software should I use?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/does-the-intra-class-correlation-matter-for-power-calculations-if-i-am-going-to-cluster-my-standard" rel="nofollow">Does the intra-cluster correlation matter for power calculations if I am going to cluster my standard errors?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-for-propensity-score-matching" rel="nofollow">Power calculations for propensity score matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up" rel="nofollow">Power calculations 101: dealing with incomplete take-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/collecting-more-rounds-of-data-to-boost-power-the-new-stuff" rel="nofollow">Collecting more rounds of data to boost power</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/on-improving-power-in-small-sample-studies" rel="nofollow">Improving power in small samples</a><br />
<strong>On External Validity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/getting-beyond-mirage-external-validity" rel="nofollow">Getting beyond the mirage of external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/all-those-external-validity-issues-impacts-they-apply-costs-too" rel="nofollow">All those external validity issues with impacts? They apply to costs too</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/external-validity-seen-other-quantitative-social-sciences-and-gaps-our-practice" rel="nofollow">External validity as seen from other quantitative social sciences and the gaps in our practices</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/towards-more-systematic-approach-external-validity-understanding-site-selection-bias" rel="nofollow">Towards a more systematic approach to external validity: understanding site selection bias</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/weighting-for-external-validity-then-waiting-for-election-results" rel="nofollow">Weighting for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/will-successful-intervention-over-there-get-results-over-here-we-can-never-answer-full-certainty-few" rel="nofollow">Will that successful intervention over there get results here?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/learn-live-without-external-validity" rel="nofollow">Learn to live without external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/questioning-external-validity-regression-estimates-why-they-can-be-less-representative-you-think" rel="nofollow">Why the external validity of regression estimates can be less than you think</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-similarity-wrong-concept-external-validity" rel="nofollow">Why similarity is the wrong concept for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-rant-on-the-external-validity-double-double-standard" rel="nofollow">A rant on the external validity double standard</a><br />
<strong>Jargony Terms in Impact Evaluations</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/proposed-taxonomy-behavioral-responses-evaluation" rel="nofollow">A proposed taxonomy of behavioral responses to evaluation</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/quantifying-hawthorne-effect" rel="nofollow">Quantifying the Hawthorne effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others" rel="nofollow">The Hawthorne Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/are-john-henry-effects-as-apocryphal-as-their-eponym" rel="nofollow">The John Henry Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/is-it-the-program-or-is-it-participation-randomization-and-placebos" rel="nofollow">Placebo effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-should-we-understand-clinical-equipoise-when-doing-rcts-development" rel="nofollow">Clinical Equipoise</a><br />
<strong>Stata Tricks</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/generating-regression-and-summary-statistics-tables-stata-checklist-and-code" rel="nofollow">Generating regression and summary statistics tables in Stata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-graphing-impacts-standard-error-bars" rel="nofollow">Graphing impacts with Standard Error Bars</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-cluster-correlations" rel="nofollow">Calculating the intra-cluster correlation</a><br />
<a href="https://blogs.worldbank.org/impactevaluations/generating-regression-and-summary-statistics-tables-stata-checklist-and-code" rel="nofollow">Generating regression and summary statistics tables in Stata: A checklist and code</a><br />
<strong>Replication</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/worm-wars-anthology" rel="nofollow">Worm wars: the anthology</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/worm-wars-review-reanalysis-miguel-and-kremer-s-deworming-study" rel="nofollow">Worm wars: a review of the reanalysis of the Miguel and Kremer deworming study</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/response-brown-and-woods-how-scientific-are-scientific-replications-response" rel="nofollow">Response to Brown and Wood's response </a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-scientific-are-scientific-replications-response-annette-n-brown-and-benjamin-dk-wood" rel="nofollow">Brown and Woods response on "how scientific are scientific replications"</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-scientific-are-scientific-replications" rel="nofollow">how scientific are scientific replications?</a><br />
<strong>Systematic reviews and meta-analysis</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-systematic-systematic-review-case-improving-learning-outcomes" rel="nofollow">how systematic is that systematic review? The case of learning outcomes</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-standard-standard-deviation-cautionary-note-using-sds-compare-across-impact-evaluations" rel="nofollow">How standard is a standard deviation? A cautionary note on using SDs to compare across impact evaluations</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/notes-aeas-present-bias-20-years-should-we-give-sds-effect-size" rel="nofollow">should we give up on SDs for measuring effect size?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-do-600-papers-20-types-interventions-tell-us-about-how-much-impact-evaluations-generalize-guest" rel="nofollow">What do 600 papers on 20 types of interventions tell us about what types of interventions generalize?</a><br />
Thu, 20 Aug 2015 07:46:00 -0400David McKenzieAnother reason to prefer Ancova: dealing with changes in measurement between baseline and follow-up
http://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow
A few months ago, Berk <a href="http://blogs.worldbank.org/impactevaluations/why-difference-difference-estimation-still-popular-experimental-analysis" rel="nofollow">blogged</a> about my paper on <a href="https://ideas.repec.org/a/eee/deveco/v99y2012i2p210-221.html" rel="nofollow">the case for more T</a>, and in particular, on the point that Ancova estimation can deliver a lot more power than difference-in-differences when outcomes are not strongly autocorrelated. I continue to get a number of questions about this paper, and some of them recently have lead me to emphasize another potential benefit of Ancova which I don’t discuss in the paper – namely, <strong>it can be a useful way of dealing with changes in measurement between baseline and follow-up.</strong><br />
<br />
Let me discuss three different types of changes in measurement, and how Ancova deals with them better relative to differences in differences.<br />
<br />
<em>1. Changes in who the outcome gets measured for between baseline and follow-up, because the baseline data are missing for some observations. </em>For example, in your baseline survey perhaps some firms didn’t report profits, or perhaps you only had enough funding to test half of the kids, etc.<br />
With Ancova, then you just want to dummy out the baseline data for these observations. So create a dummy variable <em>missingbaseline</em>, set <em>Y0</em> then to be zero if the baseline is missing, and then run Y = a + b*Treat + c*Y0 + d*missingbaseline + any other controls.<br />
<br />
With difference-in-differences you would typically throw away these observations, since you can’t take a difference if you don’t have the baseline.<br />
<br />
<em>2. Changes in the recall period</em> <em>for measurement. </em>This has happened in several projects recently. For example, the baseline data asked about monthly profits, but then in the follow-up because of high non-response for monthly recall (see change 1 above), we went to a weekly recall. In another example, we asked about whether the firm had done several innovative activities in the last year at baseline, but then since the first follow-up was only 6 months after treatment, asked about innovative activities in the last 6 months at follow-up.<br />
With Ancova, this creates no problems. We can run a regression like:<br />
Weekly profits = a +b*Treat + c*Baseline monthly profits + other controls<br />
And here c controls for baseline monthly profits to the extent that they are useful in explaining weekly profits at follow-up. But it is still very clear that we are estimating the treatment effect at follow-up on weekly profits.<br />
In contrast, with difference-in-difference, you might try and convert the monthly profits to a weekly figure and then take the difference, or do other changes, and then you are looking at the treatment effect in the difference between weekly profits now and some transform of monthly profits before, which is a less easy to explain outcome.<br />
<br />
3. <em>Changes in how an outcome is measured, especially when forming indices. </em>This covers a range of measurement changes. For example, you might change the wording of a question based on feedback from the baseline on some respondents getting confused. Or you might have an outcome which is an index of a whole bunch of questions, and you may change which precise questions are asked at follow-up versus baseline (e.g. you might not use the exact same test questions both times, or the exact same subset questions intended to measure some personality); or you may have respondents play a different game or activity to measure some behavior to avoid issues with them learning from their baseline attempt at this.<br />
This issue is then dealt with in the Ancova in the same way as the change in the recall period – namely you control for whatever measure you have at baseline, and then the Ancova decides how much to control for it by how useful it is in predicting the future outcome. In contrast DD gets worried about what exactly the difference is measuring.<br />
<br />
There is often this concern about changing how you measure an outcome from one survey round to another. This gets beaten into us when we study poverty measurement, where the concern is that changes in levels of poverty from one period to the next might be arising from changes in the measurement method rather than actual changes. But in an RCT world, where your interest is in treatment effects, then I think using Ancova can give you more latitude to make improvements or changes in your outcome measure between baseline and follow-up as you learn more information. (Note that the same does not apply when you are doing multiple follow-ups to allow estimating a pooled treatment effect or to estimate impact trajectories, which is another part of the same paper – then you do want to keep the outcome measure consistently measured across follow-up rounds, even if you have modified it from the baseline).<br />
Mon, 22 Jun 2015 07:50:00 -0400David McKenzieWhy is Difference-in-Difference Estimation Still so Popular in Experimental Analysis?
http://blogs.worldbank.org/impactevaluations/why-difference-difference-estimation-still-so-popular-experimental-analysis
David McKenzie pops out from under many empirical questions that come up in my research projects, which has not yet ceased to be surprising every time it happens, despite <a href="https://sites.google.com/site/decrgdmckenzie/publications-by-topic" rel="nofollow">his prolific production</a>. The last time it happened was a teachable moment for me, so I thought I’d share it in a short post that fits nicely under our “Tools of the Trade” tag.<br />
<br />
“<a href="http://siteresources.worldbank.org/DEC/Resources/Beyond_Baseline_and_FollowUpJDE_final.pdf" rel="nofollow">Beyond Baseline and Follow-up: The Case for More T in Experiments</a>,” is a paper David blogged about <a href="http://blogs.worldbank.org/impactevaluations/node/733" rel="nofollow">here</a> more than three years ago. One of the implications of the analysis in that paper is as follows: “When autocorrelations are low, there are large improvements in power to be had from using ANCOVA instead of difference-in-differences in analysis.” Simply put, ANCOVA implies controlling for the baseline (lagged) value of the outcome variable in the regression rather than differencing it out in the more common difference-in-difference (DD) specification.<br />
<br />
Despite the fact that this is a highly cited paper (108 times since 2012 according to <a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=EUhiltEAAAAJ&cstart=20&pagesize=80&citation_for_view=EUhiltEAAAAJ:j3f4tGmQtD8C" rel="nofollow">Google Scholar</a>), my impression is that using ANCOVA instead of DD has not yet become standard practice in the case of the typical scenario of an experiment with one baseline and one follow-up (or multiple follow-ups, each of which are analyzed separately to assess the trajectory of impacts). As the implications for power can REALLY matter when the autocorrelation for the outcome variable is low, I thought I’d give an example here from my own work to perhaps convert a few more applied researchers.<br />
<br />
In a cluster-randomized experiment to improve the quality of caregiving at childcare centers in Malawi, we assigned 200 centers to four treatments and sampled 12 three and four year-old children from each center. While the final outcomes are developmental assessments at the child level, a plausible pathway towards such improvements is a transformation of the classrooms: how caregivers interact with the children, what activities are being conducted, what play and learning materials are available, etc. To measure these intermediate outcomes, we had two trained enumerators sit in each center for 1-2 hours and record a checklist of 30+ items. We collected these data at baseline before random assignment of schools into different treatment groups, then at first follow-up and second follow-up. The default plan was to conduct a DD analysis for both the final outcomes and the child level and the intermediate outcomes at the center level.<br />
<br />
However, it turns out that while our child-level outcomes are highly autocorrelated – a common finding of studies with test scores – the index of classroom observations are not: the autocorrelation coefficient is less than 0.2. This suggests that a slight baseline imbalance between two treatment arms is not really predictive of that difference in follow-up data collection. David’s paper suggests that it is inefficient to fully correct for such baseline imbalances and the exact ratio of DD variance to ANCOVA variance is 2/(1+ρ), where ρ is the autocorrelation coefficient, meaning that the power loss from using DD is about 68% in my case, where ρ=0.19.<br />
<br />
So, what ends up happening when I estimate the effect of combined treatment vs. the control group on the index of classroom observations is that I get a large standardized effect of 0.45 standard deviations that is not statistically significant at the 90% level of confidence (t-stat=1.46). Using ANCOVA, I get a similarly large and educationally meaningful effect of 0.58 SD that is statistically significant at the 99% level of confidence (t-stat=2.6). The effect sizes from the two specifications are not identical because of the small and insignificant imbalance at baseline of 0.15 SD (we blocked the school-level randomization on averages of child-level outcomes like test scores and anthropometrics rather than this variable, and hence the random variation. Note that the t-stat would have still gone down from 2.6 to 1.9 by moving from ANCOVA to DD even if the effect size remained identical at 0.58. Alternatively, the smaller effect size of 0.45 would still have a t-stat of 2 with the ANCOVA standard errors).<br />
<br />
So, here is a case where what you tell your counterpart at the ministry of education depends on how you choose to analyze your data: is it a large effect that we don’t have the power to detect, or a similarly large effect that is very significant by conventional standards in economics and education? My interpretation of David’s paper is that it’s foolish to leave statistical power that you have unused just because DD has been the default specification for many of us analyzing such data in experiments. My solution so far in sharing the findings with colleagues and at a presentation to our counterparts in Malawi has been to present the ANCOVA results as the preferred estimates, but also mentioning the loss of precision when DD is employed – providing the most conservative estimate of the classroom effects.<br />
<br />
The paper has much more practical advice that is a must-read for those designing studies or analyzing data in two-round RCTs with economic outcomes that are not highly autocorrelated. In particular, the paper discusses the choice to decide how many rounds of data vs. what cross-sectional sample size to collect or how to divvy up a fixed number of rounds between pre- and post-treatment. For example, in our case, with autocorrelation so low, even three pre-treatment rounds of data collection would not give us more power than a simple comparison of post-treatment outcomes, but more post-treatment rounds (perhaps centered around the one-year and the two-year follow-ups) would have led to more power by averaging out the random noise. Given that the two follow-ups are not so far apart from each other (one year), we might also present an average post-treatment effect rather than the more standard reporting of effects separately at first and second follow-up. The paper also makes a point that turned out to be inherent to our study: we’re interested in multiple outcomes, some of which are highly autocorrelated while others are not. We knew the former at the outset but only found out the latter after the first follow-up. In such cases, even if you knew everything before designing your study, your choices would involve some difficult trade-offs.<br />
<br />
Next time you’re analyzing data from an RCT with T=2 using an outcome with low autocorrelation, remember that power gains from using ANCOVA are not just hypothetical: they can be quite large.<br />
Mon, 23 Feb 2015 09:23:00 -0500Berk OzlerThe often (unspoken) assumptions behind the difference-in-difference estimator in practice
http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice
This post is co-written with <a href="http://www.eco.uc3m.es/~ricmora/" rel="nofollow">Ricardo Mora</a> and <a href="http://www.eco.uc3m.es/~ireggio/" rel="nofollow">Iliana Reggio</a><br />
<br />
The difference-in-difference (DID) evaluation method should be very familiar to our readers – a method that infers program impact by comparing the pre- to post-intervention change in the outcome of interest for the treated group relative to a comparison group. The key assumption here is what is known as the “Parallel Paths” assumption, which posits that the average change in the comparison group represents the counterfactual change in the treatment group if there were no treatment. It is a popular method in part because the data requirements are not particularly onerous – it requires data from only two points in time – and the results are robust to any possible confounder as long as it doesn’t violate the Parallel Paths assumption. When data on several pre-treatment periods exist, researchers like to check the Parallel Paths assumption by testing for differences in the pre-treatment trends of the treatment and comparison groups. Equality of pre-treatment trends may lend confidence but this can’t directly test the identifying assumption; by construction that is untestable. Researchers also tend to explicitly model the “natural dynamics” of the outcome variable by including flexible time dummies for the control group and a parametric time trend differential between the control and the treated in the estimating specification.<br />
<br />
Typically, the applied researcher’s practice of DID ends at this point. Yet <a href="http://e-archivo.uc3m.es/handle/10016/16065" rel="nofollow">a very recent working paper</a> by Ricardo Mora and Iliana Reggio (two co-authors of this post) points out that DID-as-commonly-practiced implicitly involves other assumptions instead of Parallel Paths, assumptions perhaps unknown to the researcher, which may influence the estimate of the treatment effect. These assumptions concern the dynamics of the outcome of interest, both before and after the introduction of treatment, and the implications of the particular dynamic specification for the Parallel Paths assumption.<br />
<!--break--> <br />
As stated, researchers often supplement the DID specification with a time trend of some parametric form such as a (perhaps group specific) linear trend. But by including this linear trend, the identifying assumption shifts from the standard Parallel Paths to what can be termed Parallel Growths, since now deviation from a trend line identifies impact (alternatively, we can think of Parallel Growths as a Parallel Path assumption in first differences).<br />
<br />
The switch from Parallel Paths to Parallel Growths highlights a line of reasoning that Ricardo and Iliana formally extend to a general family of Parallel Assumptions valid for higher order differencing such as a difference of double-differencing (what might be called a Parallel Accelerations assumption) and so on. Arguably higher order Parallel Assumptions present weaker identifying assumptions than Parallel Paths – we no longer need the trend in the comparison group to proxy for the counterfactual trend of the treatment group but rather the <em>growth</em> (i.e. the change in trend) in the comparison group to proxy for the counterfactual <em>growth</em>. But there is a trade-off in our empirical practice since differencing of data tends to exacerbate any measurement error present in the outcome measures. So the extent that we can benefit from higher order Parallel Assumptions is determined by our data on a case by case basis.<br />
<br />
Ricardo and Iliana then develop a general additive regression model with fully flexible dynamics – this has the advantage of being able to test for possible restrictions on the dynamics rather than simply positing a particular parametric form. The model also doesn’t impose equivalence between alternative parallel assumptions. In fact this model can test for such equivalence:<br />
<br />
<img alt="" src="https://blogs.worldbank.org/impactevaluations/files/impactevaluations/Equation_21Nov2013.PNG" style="height:50px; width:250px" /><br />
<br />
The framework above allows for fully flexible pre-treatment trend differentials between the treated and comparison group and also allows for a comparison of any two consecutive parallel assumptions such as Paths vs. Growths. Here <em>Y</em> is the outcome of interest and time runs from<em> t1</em> until <em>T</em> with the intervention beginning at some point between <em>t2</em> and <em>T</em>. The binary indicator variable <em>I </em>designates time-periods while <em>D</em> indicates treated units. In practice, researchers often estimate a more restrictive equation than this one – even when the data permit this more flexible model. Here is <a href="http://ideas.repec.org/a/uwp/jhriss/v40y2005i2p559-590.html" rel="nofollow">one paper that does use this specification</a> to look at the effects of school-desegregation in the U.S.<br />
<br />
Ricardo and Iliana then review all DiD papers published in ten well-known economic journals over the past three years and focus on those that (a) adopt a DiD framework with more than one pre-treatment time period and (b) have made the data publically available. There are nine papers that meet these criteria. The topics of study in these papers range from the effect of Daylight Savings Time on US residential electricity use to the effects of WWI related male mortality on marriage market outcomes in France. All of the nine papers adopt more restrictive estimating equations than the one above. In fact most of the 13 specifications in the nine papers restrict pre-treatment dynamics to be equivalent between treatment and comparison groups. Most also impose a constant treatment effect in post-treatment periods thus ignoring the possible dynamics of treatment.<br />
<br />
Eleven of the 13 specifications report significant treatment effects in the original papers. In contrast by applying the flexible model to the data Ricardo and Iliana find:<br />
<ul>
<li>
In the 11 cases that estimate significant impacts, once re-estimated with the fully flexible model and with an explicit Parallel Paths assumption, only 5 remain precisely estimated and many of the 11 have substantively different point estimates.</li>
<li>
With the Parallel Growths assumption this number falls to 3 of 11 cases.</li>
<li>
Tests for the constancy of post-treatment effects for 11 of the specifications wind up rejecting the absence of dynamic effects in 6 of the instances. It seems post-treatment dynamic effects often matter and ideally should be modeled in a more flexible manner.</li>
<li>
A test of the equivalence of Parallel Paths and Parallel Growth assumptions rejects equivalence in 5 out of the 13 specifications. In these cases the arguably weaker assumption of Parallel Growth results in significantly different findings than Parallel Paths.</li>
</ul>
<br />
Now it’s true that standard errors are higher in general with the fully-flexible model (especially with the Parallel Growths assumption tested with first-differenced data) and in many cases equality between the treatment effect reported in the published paper and the estimate under the flexible model cannot be rejected. As Ricardo and Iliana conclude, “with the fully flexible model we obtain results that coincide in sign and significance level with the original results in approximately one third of the cases. We interpret this outcome as suggesting that for many empirical applications, the models used are unduly restrictive.”<br />
<br />
Here is a call to think twice about our DiD specifications. Data permitting, the more flexible proposed model above can serve as a benchmark at the start of any DiD analysis to test the robustness of alternative Parallel Assumptions and alternative dynamic specifications. At the very least this exercise may serve to guide more informed parsimonious models.<br />
<br />
p.s. – Ricardo and Iliana are currently writing an ado file that would implement many of these tests on parallel assumption equivalence or dynamics. We’ll post a link when it is ready for sharing.<br />
Thu, 21 Nov 2013 07:41:00 -0500Jed Friedman