Tools of the trade: The covariate balanced propensity score


This page in:

The primary goal of an impact evaluation study is to estimate the causal effect of a program, policy, or intervention. Randomized assignment of treatment enables the researcher to draw causal inference in a relatively assumption free manner. If randomization is not feasible there are more assumption driven methods, termed quasi-experimental, such as regression discontinuity or propensity score matching. For many of our readers this summary is nothing new. But fortunately in our “community of practice” new statistical tools are developed at a rapid rate. Here is another exciting tool, a methodological extension of matching, termed:

The covariate balanced propensity score

Now a matching estimator is considered by many to be the least preferred quasi-experimental IE method because of the strong identifying assumptions that this method requires, especially in settings where participants have a choice to participate. I share this view. However the machinery that facilitates matching – the propensity score – is elegant and can be useful in a variety of settings. The propensity score makes matching a practical exercise as it reduces the likely insurmountable problem of matching on many dimensions to a straightforward match on only one dimension (first demonstrated in the seminal 1983 Rosenbaum and Rubin paper). Many neat applications of the propensity score have been worked out, included the propensity weighted regression estimate by Hirano, Imbens, and Ridder that yields an estimate of the average treatment effect.

So propensity score estimates are used widely. There is a catch though – the propensity score must be estimated. And there is no theoretical guidance over how best to do this. Practitioners usually estimate a logit or probit to predict treatment assignation and then check the covariate balance given by the resulting propensity score. If the researcher isn’t satisfied with the balance, then she will likely re-estimate with a somewhat different specification. The drawback with this approach, however, is that different specifications of the propensity score can result in very different estimates of the treatment effect (for one example, the 2005 paper by Jeffrey Smith and Petra Todd revisits a seminal labor training experiment first discussed by Robert Lalonde in 1986, and finds the matching estimate of impact highly sensitive to specification).

This back and forth specification search underscores the dual purpose of the propensity score: 1. It is meant to predict treatment assignation among the study subjects, i.e. it estimates the likelihood of treatment as a function of observable information. 2. It is meant to balance covariates so that two study subjects with the same propensity score are appreciably similar in observed dimensions. So in our everyday practice, we look for a specification that by design maximizes (1) and by hope satisfies (2).

A new paper by Kosuke Imai and Marc Ratkovic introduces some useful structure to the propensity score estimation by formally combining the dual purposes of the propensity score in one estimation framework and appropriately enough calls this new approach the covariate balancing propensity score (CBPS). With CBPS, a single estimate determines both the treatment assignment mechanism and the covariate balancing weights. (A side note: this is not the only method to automate covariate balancing but these other methods, such as Hainmueller (2012), do not explicitly link the covariate balancing weights with the propensity score).

The CBPS estimation details are described in the linked paper but, in brief, the authors stipulate the covariate balancing condition as well as the first order conditions from the propensity score likelihood function. This creates a system of equations that can be estimated jointly by generalized method of moments (GMM) since the number of moment conditions exceeds the number of parameters to be estimated. (It’s important to note that the balancing condition here is generalizable to higher moments of the covariate distribution, not only the first moment. Thus it can accommodate balance in the variances of the covariates as well as the means).

In standard propensity score matching, the empirical fit of the likelihood function is maximized so that it does the best possible job of predicting treatment status, but covariate balance is not explicitly addressed. In essence, the CBPS framework works by trading off some of this accuracy of prediction (the “likelihood”) to ensure a better balance of covariates.

So how does the CBPS perform in relation to standard matching vis-à-vis estimates of causal impact? Imai and Ratkovic work through two empirical examples where the CBPS does a substantially better job at minimizing bias and the root mean squared error (RMSE) – a summary of bias and variance – than the standard propensity score. For example in revisiting the Lalonde labor experiment, the CBPS estimate comes much closer to the experimental estimate of impact (an $886 gain in annual earnings from training) than the standard matching. With a 1 to N matching estimator, the standard propensity score understates the true (experimental) income gain by $805 while CBPS understates it by only $93.

CBPS can be extended to non-binary treatment outcomes, longitudinal data, and other common cases. And I am sure that further investigation of this method, and the conditions where it is most applicable, awaits. There is one caveat: this method assumes that no propensity score falls at either extreme of zero or one. The authors do not discuss the implications if this assumption is violated (and it speaks to the need to carefully apply this method, perhaps on a truncated data set) but, as practitioners know, propensity scores fall at either extreme at a disturbingly high frequency.

CBPS appears to be an interesting and promising new extension of familiar propensity score matching methods. If you review the paper and wish to implement CBPS, the authors have generously made a stats package for CBPS available in R.



Jed Friedman

Senior Economist, Development Research Group, World Bank

Join the Conversation

Jed Friedman
October 04, 2012

my first thought is that the concern you mention still applies here. The CBPS is an alternative method to estimate the propensity score, but it is still an estimation technique and when applying the propensity score to the outcome model, the same issues will apply. However I will dig a little further and see if I can learn anything new! Best, Jed.

Aaka Pande
October 03, 2012

Thanks for sharing this new extension of propensity score matching, which is very relevant to the body of quasi experimental impact evalautions that the Bank is undertaking with increasing frequency.

My questions regards calculating standard errors via this method. In the "standard" propensity score method, since the propensity score is estimated, standard errors are too which often leads to very large confidence intervals. One way to overcome this has been to bootstrap SE's, but as pointed out by Imbens (2004), "justification for bootstrap estimators is limited; however, because the estimators are symptotically linear,bootstrapping will likely lead to valid standard errors and confidence intervals."

Your thoughts on if this new method would address this would be appreciated?

Thank you

Marc Ratkovic
October 06, 2012


Thank you for the insightful question (and, Jed, thank you for the informative, clear, and fair posting!).

Your question concerns the calculation of valid standard errors of treatment effects, say in a program evaluation, when the effect estimates are constructed using estimated propensity scores. I would recommend that you use the function IPW in our package CBPS to estimate treatment effects, and their standard errors. Details follow.

There are two commonly encountered means of constructing effect estimates: through weighting and matching. Weighting estimators are analytically simpler, and our R package, CBPS, contains a function (IPW) that takes an outcome variable, treatment variable, and propensity scores and returns asymptotically valid standard error estimates. We implement the method from Lunceford and Davidian (2004), citation below.

Standard errors with matching are a bit more difficult. Matching can be thought of as estimating balancing weights, though the weights are constrained to be 0 or 1. Bootstrapping standard errors with matching estimates are not valid. Basically, validity of the bootstrap requires at least one derivative, and since matching generates weights that are discontinuous (0 or 1), the bootstrap fails. For a similar reason, the delta method, which is commonly used to estimate variances, also fails.

Abadie and Imbens have a means of calculating consistent standard errors with matching estimators, and though I have coded it up, we have not yet released it in CBPS--I have found these standard errors to be a bit too large to be practical, so I want to run some further simulations and assessments before we include them. The Abadie-Imbens standard errors are the ones reported in our paper when we use matching estimates in the Lalonde data.

I hope this helps, and provides some rough guidance.

If you have any further questions about CBPS, or propensity scores in general, please don't hesitate to ask.



Lunceford and Davidian (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine.