This week we're introducing our new series that we decided to call 'Ask Guido.' Guido Imbens has kindly agreed to answer technical questions every so often and we are thrilled. For this first installment, Guido starts by answering a question about standard errors and the appropriate level of clustering in matching.
One question that often comes up in empirical work concerns the appropriate way to calculate standard errors, and in particular the correct level of clustering. Here is a specific version of the question that someone posed, slightly paraphrased:
Let's consider a situation where you would like to evaluate a program that is taking place in a single state, but you have available individual-level outcomes. You decide to individually match program beneficiaries with similar people outside of the state. You assume each person is an individual unit and their errors are iid, and your power is going to steadily increase in the number of observations. Now consider a similar situation where you have 10 treatment states scattered about. In this case, assuming iid errors does not seem right: instead, one might cluster the errors by state. Now when above a certain amount increasing within-state sample size won't do much for power. But this is where I get confused. Consider an exercise where we start with one treatment state and then increase the number of states slowly. At some point, may be N=2, or N=3, we would start clustering the standard errors, and we would be in the weird situation where increasing the number of clusters decreases power. This clearly can't be right, but I can't quite figure out where the logic is incorrect.
To make sense of this it is useful to separate two questions. First, how should we estimate the estimand, say the average effect of the program? Second, how credible/precise is that estimate?
It is also useful to embed the problem into a single model. Suppose we think of the outcomes being affected by individual level characteristics as well as state effects, with the latter modeled as random effects, independent of the treatment.
In the first scenario with a single treated state and a single control state, matching individuals from the treated and control states in terms of individual level covariates will take care of differences in the individual level characteristics, but that will not do anything about the random state effects. If we have large samples in both states the covariate-adjusted difference in average outcomes will capture both the random state effects and the treatment effect of interest. The first approach outlined in the quote suggests ignoring the random state effects, in which case we estimate the treatment effect precisely in large samples. This is partly motivated by the fact that the treatment effect is not identified in this scenario: having more individuals does not pin down the estimand. However, the fact that the estimand is not identified is of course not a reason to make assumptions that are not plausible, a point Manski has stressed. So, in this scenario matching treated and control individuals may lead to an estimate that is reasonable, and unbiased (assuming that which state got treated is not related to the state effects --- which itself is a big assumption), but the variance estimates based on assuming there are no state effects would be underestimating the true variance if in fact there were such state effects. So, while it may be common practice to ignore state effects for the purpose of calculating standard errors if one has only data from a single treated and a single control state, it does not mean that those standard errors are credible.
An alternative way of thinking about the problem is from a Bayesian perspective (e.g., Gelman and Hill, 2008). Assuming normality, which is probably not a big deal in this case, the key issue is the choice of prior distribution for the variance of the random state effect. If one has data for many states, using a conventional inverse gamma prior would lead to results where the posterior distribution would likely be dominated by the data rather than the prior distribution. However, with only two states, one treated and one control, the posterior for this variance would in many cases be identical to the prior distribution, and so the choice of prior will matter substantially. The suggestion of ignoring the state effects amounts to fixing the prior to have point mass at zero, which a priori seems implausible.
Back to the problem at hand
Given that valid confidence intervals would seem to require a consistent estimator for the variance of the state effects, which is not available, what should one do in this case with few states? I want to make three suggestions.
First, one can assess the sensitivity of the variance estimates to postulated values for the variance of the state effects. That is, pick a range of possible values for the variance of the state effects, and inspect the corresponding range of confidence intervals. If one is not willing or unable to pick a plausible range of possible values for the variance, this will not limit the range of confidence intervals much. In the Bayesian approach this would correspond to considering a range of prior distributions, in the spirit of Leamer (1981).
Second, one can try to assess directly the assumption that the covariates are sufficiently rich to eliminate differences between the two states in the absence of the treatment. Suppose for example one has lagged outcomes. One can test whether adjusting for the remaining covariates removes the average difference in lagged outcomes between the treated and control states. If one rejects that hypothesis, this would cast doubt on the assumption that there are no differences between the two states in the absence of the treatment. This type of falsification test is becoming more common in many empirical studies.
Third, suppose one has at least two control states, even if one only has a single treated state. Then one can compare outcomes for the two control states, adjusting for covariates. The assumption that there are no state effects implies that the average adjusted outcomes should be the same in the two control states, and this can be a powerful way to assess that assumption directly. If one finds that there are differences, it will provide some evidence regarding the magnitude of the state effects that can help with implementing the first suggestion. With multiple control states one might also wish to consider the Abadie-Diamond-Hainmueller synthetic control group approach.
In some cases, there may be data available on individuals from the same state who are not treated. Using such individuals as controls would eliminate the problem with the state effects that plagues the comparisons of individuals in different states, and clustering adjustments would not matter much. However, different concerns might arise with such within-state comparisons. Now the question is why individuals who had the choice to participate in the program did not do so. This might be indicative of other, unobserved differences that make them inappropriate comparisons.
Abadie, Diamond, and Hainmueller (2010) ``Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California
Gelman and Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, 2006
Leamer, E. (1981), ``Sets of Posterior Means with Bounded Variance Priors,'' Econometrica, Vol. 50(3): 725-736.
Guido Imbens is Professor of Economics at the Graduate School of Business at Stanford University.