World Bank Blogs
http://blogs.worldbank.org/planet.xml
IBRD and IDA: Working for a World Free of Poverty.enA Curated List of Our Postings on Technical Topics – Your One-Stop Shop for Methodology
http://blogs.worldbank.org/impactevaluations/curated-list-our-postings-technical-topics-your-one-stop-shop-methodology
Rather than the usual list of Friday links, this week I thought I’d follow up on <a href="http://blogs.worldbank.org/impactevaluations/introducing-ask-guido">our post by Guido Imbens</a> yesterday on clustering and post earlier this week by <a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others">Dave Evans on Hawthorne effects</a> with a curated list of our technical postings, to serve as a one-stop shop for your technical reading. I’ve focused here on our posts on methodological issues in impact evaluation – we also have a whole lot of posts on how to conduct surveys and measure certain concepts that I’ll leave for another time.<br />
<strong>Random Assignment, Registration and Reporting</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-doing-stratified-randomization-with-uneven-numbers-in-some-strata">Doing stratified randomization with uneven numbers in the Strata</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-randomize-using-many-baseline-variables-guest-post-thomas-barrios">How to randomize using many baseline variables</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/public-randomization-ceremonies">Public randomization ceremonies</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/designing-experiments-to-measure-spillover-effects">Designing experiments to measure spillover effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-are-mechanism-experiments-and-should-we-be-doing-more-of-them">Mechanism experiments</a> and <a href="http://blogs.worldbank.org/impactevaluations/inside-the-black-box-why-do-things-work-0">opening up the black box</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/sampling-weights-matter-for-rct-design">Sample weights and RCT design</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-pre-analysis-plan-checklist">A pre-analysis plan check-list</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/trying-out-new-trial-registries">The New Trial Registries</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/what-isn-t-reported-impact-evaluations">What isn’t reported in impact evaluations but maybe should be</a><br />
<strong>Propensity Score Matching</strong><br />
Guido Imbens on <a href="https://blogs.worldbank.org/impactevaluations/introducing-ask-guido">clustering standard errors with matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-recent-tests-matching-estimators-through-evaluation-job-training-programs">Testing different matching estimators as applied to job training programs</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-the-covariate-balanced-propensity-score">The covariate balanced propensity score</a><br />
<strong>Difference-in-Differences</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/often-unspoken-assumptions-behind-difference-difference-estimator-practice">The often unspoken assumptions behind diff-in-diff</a><br />
<strong>Other Evaluation Methods</strong><br />
The <a href="https://blogs.worldbank.org/impactevaluations/evaluating-regulatory-reforms-using-the-synthetic-control-method">synthetic control method</a>, as applied to regulatory reforms<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-alan-de-brauw-regression-discontinuity-impacts-with-an-implicit-index-evaluating-el-sa">Regression discontinuity with an implicit index</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/using-spatial-variation-program-performance-identify-causal-impact-0">Using spatial variation</a> in program performance to identify impacts<br />
<a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-howard-white-can-we-do-small-n-impact-evaluations">Small n impact evaluation methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/can-we-trust-shoestring-evaluations">Can we trust shoestring evaluations?</a><br />
<strong>Analysis</strong><br />
Regression adjustment in randomized experiments (<a href="http://blogs.worldbank.org/impactevaluations/regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-than-the-disease">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/guest-post-by-winston-lin-regression-adjustment-in-randomized-experiments-is-the-cure-really-worse-0">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-when-to-use-those-sample-weights">When to use survey weights</a> in analysis<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-a-quick-adjustment-for-multiple-hypothesis-testing">Adjustments for multiple hypothesis testing</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/help-for-attrition-is-just-a-phone-call-away-a-new-bounding-approach-to-help-deal-with-non-response">Bounding approaches to deal with attrition</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/whether-to-probit-or-to-probe-it-in-defense-of-the-linear-probability-model">Linear probability models versus probits</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-dealing-with-multiple-lotteries">Dealing with multiple lotteries</a><br />
Estimating standard errors with small clusters (<a href="http://blogs.worldbank.org/impactevaluations/annals-of-good-ie-practice-getting-those-standard-errors-correct-in-small-sample-clustered-studies">part one</a>, <a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-estimating-correct-standard-errors-in-small-sample-cluster-studies-another-take">part two</a>)<br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-beyond-mean-decompositions-with-an-application-to-the-gender-wage-gap-in-china">Decomposition methods</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/you-think-randomized-controlled-trials-are-great-actually-they-are-even-better-than-that-guest-post">Estimation of treatment effects with incomplete compliance</a><br />
<strong>Power Calculations and Improving Power</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/does-the-intra-class-correlation-matter-for-power-calculations-if-i-am-going-to-cluster-my-standard">Does the intra-cluster correlation matter for power calculations if I am going to cluster my standard errors?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-for-propensity-score-matching">Power calculations for propensity score matching</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up">Power calculations 101: dealing with incomplete take-up</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/collecting-more-rounds-of-data-to-boost-power-the-new-stuff">Collecting more rounds of data to boost power</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/on-improving-power-in-small-sample-studies">Improving power in small samples</a><br />
<strong>On External Validity</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/weighting-for-external-validity-then-waiting-for-election-results">Weighting for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/will-successful-intervention-over-there-get-results-over-here-we-can-never-answer-full-certainty-few">Will that successful intervention over there get results here?</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/learn-live-without-external-validity">Learn to live without external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/questioning-external-validity-regression-estimates-why-they-can-be-less-representative-you-think">Why the external validity of regression estimates can be less than you think</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/why-similarity-wrong-concept-external-validity">Why similarity is the wrong concept for external validity</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/a-rant-on-the-external-validity-double-double-standard">A rant on the external validity double standard</a><br />
<strong>Jargony Terms in Impact Evaluations</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/hawthorne-effect-what-do-we-really-learn-watching-teachers-and-others">The Hawthorne Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/are-john-henry-effects-as-apocryphal-as-their-eponym">The John Henry Effect</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/is-it-the-program-or-is-it-participation-randomization-and-placebos">Placebo effects</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/how-should-we-understand-clinical-equipoise-when-doing-rcts-development">Clinical Equipoise</a><br />
<strong>Stata Tricks</strong><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-trade-graphing-impacts-standard-error-bars">Graphing impacts with Standard Error Bars</a><br />
<a href="http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-cluster-correlations">Calculating the intra-cluster correlation</a><br />
Fri, 21 Feb 2014 07:46:26 -0500David McKenzieTools of the trade: The covariate balanced propensity score
http://blogs.worldbank.org/impactevaluations/tools-of-the-trade-the-covariate-balanced-propensity-score
<p><SPAN style="FONT-SIZE: 10pt">The primary goal of an impact evaluation study is to estimate the causal effect of a program, policy, or intervention. Randomized assignment of treatment enables the researcher to draw causal inference in a relatively assumption free manner. If randomization is not feasible there are more assumption driven methods, termed quasi-experimental, such as regression discontinuity or propensity score matching. For many of our readers this summary is nothing new. But fortunately in our “community of practice” new statistical tools are developed at a rapid rate. Here is another exciting tool, a methodological extension of matching, termed:</SPAN></p>
<p><I><SPAN style="FONT-SIZE: 10pt">The covariate balanced propensity score</SPAN></I></p>
<p><SPAN style="FONT-SIZE: 10pt">Now a matching estimator is considered by many to be the least preferred quasi-experimental IE method because of the strong identifying assumptions that this method requires, especially in settings where participants have a choice to participate. I share this view. However the machinery that facilitates matching – the propensity score – is elegant and can be useful in a variety of settings. The propensity score makes matching a practical exercise as it reduces the likely insurmountable problem of matching on many dimensions to a straightforward match on only one dimension (first demonstrated in the seminal <A href="http://www.jstor.org/discover/10.2307/2335942?uid=3738344&uid=2&uid=4&sid=21101102166263">1983 Rosenbaum and Rubin paper</A>). Many neat applications of the propensity score have been worked out, included the propensity weighted regression estimate by <A href="http://www.jstor.org/discover/10.2307/1555493?uid=3738344&uid=2&uid=4&sid=21101099490363"><FONT color=#0000ff>Hirano, Imbens, and Ridder</FONT></A> that yields an estimate of the average treatment effect.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">So propensity score estimates are used widely. There is a catch though – the propensity score must be estimated. And there is no theoretical guidance over how best to do this. Practitioners usually estimate a logit or probit to predict treatment assignation and then check the covariate balance given by the resulting propensity score. If the researcher isn’t satisfied with the balance, then she will likely re-estimate with a somewhat different specification. The drawback with this approach, however, is that different specifications of the propensity score can result in very different estimates of the treatment effect (for one example, <A href="http://www.sciencedirect.com/science/article/pii/S030440760400082X">the 2005 paper by Jeffrey Smith and Petra Todd</A> revisits a seminal labor training experiment <A href="http://www.jstor.org/discover/10.2307/1806062?uid=3738344&uid=2&uid=4&sid=21101102166263">first discussed by Robert Lalonde in 1986</A>, and finds the matching estimate of impact highly sensitive to specification).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">This back and forth specification search underscores the dual purpose of the propensity score: 1. It is meant to predict treatment assignation among the study subjects, i.e. it estimates the likelihood of treatment as a function of observable information. 2. It is meant to balance covariates so that two study subjects with the same propensity score are appreciably similar in observed dimensions. So in our everyday practice, we look for a specification that by design maximizes (1) and by hope satisfies (2).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">A new paper by Kosuke Imai and Marc Ratkovic introduces some useful structure to the propensity score estimation by <A href="http://www.princeton.edu/~ratkovic/CBPS.pdf"><FONT color=#0000ff>formally combining the dual purposes of the propensity score in one estimation framework</FONT></A> and appropriately enough calls this new approach the covariate balancing propensity score (CBPS). With CBPS, a single estimate determines both the treatment assignment mechanism and the covariate balancing weights. (A side note: this is not the only method to automate covariate balancing but these other methods, <A href="http://web.mit.edu/~jhainm/www/Paper/eb.pdf">such as Hainmueller (2012)</A>, do not explicitly link the covariate balancing weights with the propensity score).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">The CBPS estimation details are described in the linked paper but, in brief, the authors stipulate the covariate balancing condition as well as the first order conditions from the propensity score likelihood function. This creates a system of equations that can be estimated jointly by generalized method of moments (GMM) since the number of moment conditions exceeds the number of parameters to be estimated. (It’s important to note that the balancing condition here is generalizable to higher moments of the covariate distribution, not only the first moment. Thus it can accommodate balance in the variances of the covariates as well as the means).</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">In standard propensity score matching, the empirical fit of the likelihood function is maximized so that it does the best possible job of predicting treatment status, but covariate balance is not explicitly addressed. In essence, the CBPS framework works by trading off some of this accuracy of prediction (the “likelihood”) to ensure a better balance of covariates.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">So how does the CBPS perform in relation to standard matching vis-à-vis estimates of causal impact? Imai and Ratkovic work through two empirical examples where the CBPS does a substantially better job at minimizing bias and the root mean squared error (RMSE) – a summary of bias and variance – than the standard propensity score. For example in revisiting the Lalonde labor experiment, the CBPS estimate comes much closer to the experimental estimate of impact (an $886 gain in annual earnings from training) than the standard matching. With a 1 to N matching estimator, the standard propensity score understates the true (experimental) income gain by $805 while CBPS understates it by only $93.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">CBPS can be extended to non-binary treatment outcomes, longitudinal data, and other common cases. And I am sure that further investigation of this method, and the conditions where it is most applicable, awaits. There is one caveat: this method assumes that no propensity score falls at either extreme of zero or one. The authors do not discuss the implications if this assumption is violated (and it speaks to the need to carefully apply this method, perhaps on a truncated data set) but, as practitioners know, propensity scores fall at either extreme at a disturbingly high frequency.</SPAN></p>
<p><SPAN style="FONT-SIZE: 10pt">CBPS appears to be an interesting and promising new extension of familiar propensity score matching methods. If you review the paper and wish to implement CBPS, the authors have generously made <A href="http://imai.princeton.edu/software/CBPS.html"><FONT color=#0000ff>a stats package for CBPS available in R</FONT></A>.</SPAN></p>Wed, 03 Oct 2012 08:53:24 -0400Jed Friedman