This is a curated list of our technical postings, to serve as a one-stop shop for your technical reading. I’ve focused here on our posts on methodological issues in impact evaluation – we also have a whole lot of posts on how to conduct surveys and measure certain concepts that I’ll leave for another time. <em>Updated August 20, 2015.</em><br />
<strong>Random Assignment</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/allocating-treatment-and-control-multiple-applications-applicant-and-ranked-choices" rel="nofollow">Allocating treatment and control with multiple applications per applicant and ranked choices</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/optimization-just-re-randomization-redux-thoughts-recent-don-t-randomize-optimize-papers" rel="nofollow">Is optimization just re-randomization redux? Thoughts on the "don't randomize, optimize" papers</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/be-optimista-not-randomista-when-you-have-small-samples" rel="nofollow">Be an optimista, not a randomista, when you have small samples</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tips-randomization-wild-adding-waitlist" rel="nofollow">Tips for randomization in the wild: adding a waitlist</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/my-email-correspondence-how-randomize-field" rel="nofollow">How to randomize in the field</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/stratified-randomization-and-fifa-world-cup" rel="nofollow">Stratified randomization and the FIFA world cup</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata" rel="nofollow">Doing stratified randomization with uneven numbers in the Strata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios" rel="nofollow">How to randomize using many baseline variables</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/public-randomization-ceremonies" rel="nofollow">Public randomization ceremonies</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/designing-experiments-to-measure-spillover-effects" rel="nofollow">Designing experiments to measure spillover effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-are-mechanism-experiments-and-should-we-be-doing-more-of-them" rel="nofollow">Mechanism experiments</a> and <a href="http://blogs.worldbank.org/impactevaluations/inside-the-black-box-why-do-things-work-0" rel="nofollow">opening up the black box</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/sampling-weights-matter-for-rct-design" rel="nofollow">Sample weights and RCT design</a><br />
<strong>Pre-analysis plans and reporting</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/preregistration-studies-avoid-fishing-and-allow-transparent-discovery" rel="nofollow">Pre-registration of studies to avoid fishing and allow transparent discovery</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-joint-test-orthogonality-when-testing-balance" rel="nofollow">A joint test of orthogonality when testing for baseline balance</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-pre-analysis-plan-checklist" rel="nofollow">A pre-analysis plan check-list</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries" rel="nofollow">The New Trial Registries</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-isn-t-reported-impact-evaluations" rel="nofollow">What isn’t reported in impact evaluations but maybe should be</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-joint-test-orthogonality-when-testing-balance" rel="nofollow">Randomization checks: testing for joint orthogonality</a><br />
<strong>Propensity Score Matching</strong><br />
Guido Imbens on <a href="https://blogs.worldbank.org/impactevaluations/introducing-ask-guido" rel="nofollow">clustering standard errors with matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-recent-tests-matching-estimators-through-evaluation-job-training-programs" rel="nofollow">Testing different matching estimators as applied to job training programs</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-the-covariate-balanced-propensity-score" rel="nofollow">The covariate balanced propensity score</a><br />
<strong>Difference-in-Differences</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice" rel="nofollow">The often unspoken assumptions behind diff-in-diff</a><br />
<strong>Regression Discontinuity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/curves-all-wrong-places-gelman-and-imbens-why-not-use-higher-order-polynomials-rd" rel="nofollow">Curves in all the wrong places: Gelman and Imbens on why not to use higher-order polynomials in RD</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-alan-de-brauw-regression-discontinuity-impacts-with-an-implicit-index-evaluating-el-sa" rel="nofollow">Regression discontinuity with an implicit index</a><br />
<strong>Other Evaluation Methods</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/evaluating-argentine-regional-tourism-policy-using-synthetic-controls-tan-linda-que-enamora" rel="nofollow">Evaluating an Argentine tourism policy using synthetic controls: tan linda que enamora?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/impact-narrative-guest-post-bruce-wydick" rel="nofollow">Impact as narrative</a><br />
The <a href="https://blogs.worldbank.org/impactevaluations/evaluating-regulatory-reforms-using-the-synthetic-control-method" rel="nofollow">synthetic control method</a>, as applied to regulatory reforms<br />
<a href="http://blogs.worldbank.org/impactevaluations/using-spatial-variation-program-performance-identify-causal-impact-0" rel="nofollow">Using spatial variation</a> in program performance to identify impacts<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-howard-white-can-we-do-small-n-impact-evaluations" rel="nofollow">Small n impact evaluation methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/can-we-trust-shoestring-evaluations" rel="nofollow">Can we trust shoestring evaluations?</a><br />
<strong>Analysis</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/another-reason-prefer-ancova-dealing-changes-measurement-between-baseline-and-follow" rel="nofollow">Another reason to prefer Ancova: dealing with measurement changes between baseline and follow-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/endogenous-stratification-surprisingly-easy-way-bias-your-heterogeneous-treatment-effect-results-and" rel="nofollow">Endogenous stratification: the surprisingly easy way to bias your heterogeneous treatment effects and what to do instead</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-difference-difference-estimation-still-so-popular-experimental-analysis" rel="nofollow">Why is difference-in-difference estimation still so popular in experimental analysis?</a><br />
Regression adjustment in randomized experiments (<a href="http://blogs.worldbank.org/impactevaluations/regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-than-the-disease" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-winston-lin-regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-0" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights" rel="nofollow">When to use survey weights</a> in analysis<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-a-quick-adjustment-for-multiple-hypothesis-testing" rel="nofollow">Adjustments for multiple hypothesis testing</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/help-for-attrition-is-just-a-phone-call-away-a-new-bounding-approach-to-help-deal-with-non-response" rel="nofollow">Bounding approaches to deal with attrition</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/whether-to-probit-or-to-probe-it-in-defense-of-the-linear-probability-model" rel="nofollow">Linear probability models versus probits</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-dealing-with-multiple-lotteries" rel="nofollow">Dealing with multiple lotteries</a><br />
Estimating standard errors with small clusters (<a href="http://blogs.worldbank.org/impactevaluations/annals-of-good-ie-practice-getting-those-standard-errors-correct-in-small-sample-clustered-studies" rel="nofollow">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-estimating-correct-standard-errors-in-small-sample-cluster-studies-another-take" rel="nofollow">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-beyond-mean-decompositions-with-an-application-to-the-gender-wage-gap-in-china" rel="nofollow">Decomposition methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/you-think-randomized-controlled-trials-are-great-actually-they-are-even-better-than-that-guest-post" rel="nofollow">Estimation of treatment effects with incomplete compliance</a><br />
<strong>Power Calculations and Improving Power</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/my-mailbox-should-i-work-only-subsample-my-control-group-if-i-have-big-take-problems" rel="nofollow">Should I work with only a subsample of my control group if I have take-up problems?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-what-software-should-i-use" rel="nofollow">Power calculations: what software should I use?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/does-the-intra-class-correlation-matter-for-power-calculations-if-i-am-going-to-cluster-my-standard" rel="nofollow">Does the intra-cluster correlation matter for power calculations if I am going to cluster my standard errors?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-for-propensity-score-matching" rel="nofollow">Power calculations for propensity score matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up" rel="nofollow">Power calculations 101: dealing with incomplete take-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/collecting-more-rounds-of-data-to-boost-power-the-new-stuff" rel="nofollow">Collecting more rounds of data to boost power</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/on-improving-power-in-small-sample-studies" rel="nofollow">Improving power in small samples</a><br />
<strong>On External Validity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/getting-beyond-mirage-external-validity" rel="nofollow">Getting beyond the mirage of external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/all-those-external-validity-issues-impacts-they-apply-costs-too" rel="nofollow">All those external validity issues with impacts? They apply to costs too</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/external-validity-seen-other-quantitative-social-sciences-and-gaps-our-practice" rel="nofollow">External validity as seen from other quantitative social sciences and the gaps in our practices</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/towards-more-systematic-approach-external-validity-understanding-site-selection-bias" rel="nofollow">Towards a more systematic approach to external validity: understanding site selection bias</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/weighting-for-external-validity-then-waiting-for-election-results" rel="nofollow">Weighting for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/will-successful-intervention-over-there-get-results-over-here-we-can-never-answer-full-certainty-few" rel="nofollow">Will that successful intervention over there get results here?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/learn-live-without-external-validity" rel="nofollow">Learn to live without external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/questioning-external-validity-regression-estimates-why-they-can-be-less-representative-you-think" rel="nofollow">Why the external validity of regression estimates can be less than you think</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-similarity-wrong-concept-external-validity" rel="nofollow">Why similarity is the wrong concept for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-rant-on-the-external-validity-double-double-standard" rel="nofollow">A rant on the external validity double standard</a><br />
<strong>Jargony Terms in Impact Evaluations</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/proposed-taxonomy-behavioral-responses-evaluation" rel="nofollow">A proposed taxonomy of behavioral responses to evaluation</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/quantifying-hawthorne-effect" rel="nofollow">Quantifying the Hawthorne effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others" rel="nofollow">The Hawthorne Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/are-john-henry-effects-as-apocryphal-as-their-eponym" rel="nofollow">The John Henry Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/is-it-the-program-or-is-it-participation-randomization-and-placebos" rel="nofollow">Placebo effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-should-we-understand-clinical-equipoise-when-doing-rcts-development" rel="nofollow">Clinical Equipoise</a><br />
<strong>Stata Tricks</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/generating-regression-and-summary-statistics-tables-stata-checklist-and-code" rel="nofollow">Generating regression and summary statistics tables in Stata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-graphing-impacts-standard-error-bars" rel="nofollow">Graphing impacts with Standard Error Bars</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-cluster-correlations" rel="nofollow">Calculating the intra-cluster correlation</a><br />
<a href="https://blogs.worldbank.org/impactevaluations/generating-regression-and-summary-statistics-tables-stata-checklist-and-code" rel="nofollow">Generating regression and summary statistics tables in Stata: A checklist and code</a><br />
<strong>Replication</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/worm-wars-anthology" rel="nofollow">Worm wars: the anthology</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/worm-wars-review-reanalysis-miguel-and-kremer-s-deworming-study" rel="nofollow">Worm wars: a review of the reanalysis of the Miguel and Kremer deworming study</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/response-brown-and-woods-how-scientific-are-scientific-replications-response" rel="nofollow">Response to Brown and Wood's response </a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-scientific-are-scientific-replications-response-annette-n-brown-and-benjamin-dk-wood" rel="nofollow">Brown and Woods response on "how scientific are scientific replications"</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-scientific-are-scientific-replications" rel="nofollow">how scientific are scientific replications?</a><br />
<strong>Systematic reviews and meta-analysis</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-systematic-systematic-review-case-improving-learning-outcomes" rel="nofollow">how systematic is that systematic review? The case of learning outcomes</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-standard-standard-deviation-cautionary-note-using-sds-compare-across-impact-evaluations" rel="nofollow">How standard is a standard deviation? A cautionary note on using SDs to compare across impact evaluations</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/notes-aeas-present-bias-20-years-should-we-give-sds-effect-size" rel="nofollow">should we give up on SDs for measuring effect size?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-do-600-papers-20-types-interventions-tell-us-about-how-much-impact-evaluations-generalize-guest" rel="nofollow">What do 600 papers on 20 types of interventions tell us about what types of interventions generalize?</a><br />
Tools of the trade: The covariate balanced propensity score
<p><SPAN style="FONT-SIZE: 10pt">The primary goal of an impact evaluation study is to estimate the causal effect of a program, policy, or intervention. Randomized assignment of treatment enables the researcher to draw causal inference in a relatively assumption free manner. If randomization is not feasible there are more assumption driven methods, termed quasi-experimental, such as regression discontinuity or propensity score matching. For many of our readers this summary is nothing new. But fortunately in our “community of practice” new statistical tools are developed at a rapid rate. Here is another exciting tool, a methodological extension of matching, termed:</SPAN></p>
<p><I><SPAN style="FONT-SIZE: 10pt">The covariate balanced propensity score</SPAN></I></p>
<p><SPAN style="FONT-SIZE: 10pt">Now a matching estimator is considered by many to be the least preferred quasi-experimental IE method because of the strong identifying assumptions that this method requires, especially in settings where participants have a choice to participate. I share this view. However the machinery that facilitates matching – the propensity score – is elegant and can be useful in a variety of settings. The propensity score makes matching a practical exercise as it reduces the likely insurmountable problem of matching on many dimensions to a straightforward match on only one dimension (first demonstrated in the seminal <A href="http://www.jstor.org/discover/10.2307/2335942?uid=3738344&uid=2&uid=4&sid=21101102166263">1983 Rosenbaum and Rubin paper</A>). Many neat applications of the propensity score have been worked out, included the propensity weighted regression estimate by <A href="http://www.jstor.org/discover/10.2307/1555493?uid=3738344&uid=2&uid=4&sid=21101099490363"><FONT color=#0000ff>Hirano, Imbens, and Ridder</FONT></A> that yields an estimate of the average treatment effect.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">So propensity score estimates are used widely. There is a catch though – the propensity score must be estimated. And there is no theoretical guidance over how best to do this. Practitioners usually estimate a logit or probit to predict treatment assignation and then check the covariate balance given by the resulting propensity score. If the researcher isn’t satisfied with the balance, then she will likely re-estimate with a somewhat different specification. The drawback with this approach, however, is that different specifications of the propensity score can result in very different estimates of the treatment effect (for one example, <A href="http://www.sciencedirect.com/science/article/pii/S030440760400082X">the 2005 paper by Jeffrey Smith and Petra Todd</A> revisits a seminal labor training experiment <A href="http://www.jstor.org/discover/10.2307/1806062?uid=3738344&uid=2&uid=4&sid=21101102166263">first discussed by Robert Lalonde in 1986</A>, and finds the matching estimate of impact highly sensitive to specification).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">This back and forth specification search underscores the dual purpose of the propensity score: 1. It is meant to predict treatment assignation among the study subjects, i.e. it estimates the likelihood of treatment as a function of observable information. 2. It is meant to balance covariates so that two study subjects with the same propensity score are appreciably similar in observed dimensions. So in our everyday practice, we look for a specification that by design maximizes (1) and by hope satisfies (2).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">A new paper by Kosuke Imai and Marc Ratkovic introduces some useful structure to the propensity score estimation by <A href="http://www.princeton.edu/~ratkovic/CBPS.pdf"><FONT color=#0000ff>formally combining the dual purposes of the propensity score in one estimation framework</FONT></A> and appropriately enough calls this new approach the covariate balancing propensity score (CBPS). With CBPS, a single estimate determines both the treatment assignment mechanism and the covariate balancing weights. (A side note: this is not the only method to automate covariate balancing but these other methods, <A href="http://web.mit.edu/~jhainm/www/Paper/eb.pdf">such as Hainmueller (2012)</A>, do not explicitly link the covariate balancing weights with the propensity score).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">The CBPS estimation details are described in the linked paper but, in brief, the authors stipulate the covariate balancing condition as well as the first order conditions from the propensity score likelihood function. This creates a system of equations that can be estimated jointly by generalized method of moments (GMM) since the number of moment conditions exceeds the number of parameters to be estimated. (It’s important to note that the balancing condition here is generalizable to higher moments of the covariate distribution, not only the first moment. Thus it can accommodate balance in the variances of the covariates as well as the means).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">In standard propensity score matching, the empirical fit of the likelihood function is maximized so that it does the best possible job of predicting treatment status, but covariate balance is not explicitly addressed. In essence, the CBPS framework works by trading off some of this accuracy of prediction (the “likelihood”) to ensure a better balance of covariates.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">So how does the CBPS perform in relation to standard matching vis-à-vis estimates of causal impact? Imai and Ratkovic work through two empirical examples where the CBPS does a substantially better job at minimizing bias and the root mean squared error (RMSE) – a summary of bias and variance – than the standard propensity score. For example in revisiting the Lalonde labor experiment, the CBPS estimate comes much closer to the experimental estimate of impact (an $886 gain in annual earnings from training) than the standard matching. With a 1 to N matching estimator, the standard propensity score understates the true (experimental) income gain by $805 while CBPS understates it by only $93.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">CBPS can be extended to non-binary treatment outcomes, longitudinal data, and other common cases. And I am sure that further investigation of this method, and the conditions where it is most applicable, awaits. There is one caveat: this method assumes that no propensity score falls at either extreme of zero or one. The authors do not discuss the implications if this assumption is violated (and it speaks to the need to carefully apply this method, perhaps on a truncated data set) but, as practitioners know, propensity scores fall at either extreme at a disturbingly high frequency.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">CBPS appears to be an interesting and promising new extension of familiar propensity score matching methods. If you review the paper and wish to implement CBPS, the authors have generously made <A href="http://imai.princeton.edu/software/CBPS.html"><FONT color=#0000ff>a stats package for CBPS available in R</FONT></A>.</SPAN></p>Wed, 03 Oct 2012 08:53:24 -0400Jed Friedman