*This week we're introducing our new series that we decided to call 'Ask Guido.' Guido Imbens has kindly agreed to answer technical questions every so often and we are thrilled. For this first installment, Guido starts by answering a question about standard errors and the appropriate level of clustering in matching.*

One question that often comes up in empirical work concerns the appropriate way to calculate standard errors, and in particular the correct level of clustering. Here is a specific version of the question that someone posed, slightly paraphrased:

Let's consider a situation where you would like to evaluate a program that is taking place in a single state, but you have available individual-level outcomes. You decide to individually match program beneficiaries with similar people outside of the state. You assume each person is an individual unit and their errors are iid, and your power is going to steadily increase in the number of observations. Now consider a similar situation where you have 10 treatment states scattered about. In this case, assuming iid errors does not seem right: instead, one might cluster the errors by state. Now when above a certain amount increasing within-state sample size won't do much for power. But this is where I get confused. Consider an exercise where we start with one treatment state and then increase the number of states slowly. At some point, may be N=2, or N=3, we would start clustering the standard errors, and we would be in the weird situation where increasing the number of clusters decreases power. This clearly can't be right, but I can't quite figure out where the logic is incorrect.

To make sense of this it is useful to separate two questions. First, how should we estimate the estimand, say the average effect of the program? Second, how credible/precise is that estimate?

It is also useful to embed the problem into a single model. Suppose we think of the outcomes being affected by individual level characteristics as well as state effects, with the latter modeled as random effects, independent of the treatment.

In the first scenario with a single treated state and a single control state, matching individuals from the treated and control states in terms of individual level covariates will take care of differences in the individual level characteristics, but that will not do anything about the random state effects. If we have large samples in both states the covariate-adjusted difference in average outcomes will capture both the random state effects and the treatment effect of interest. The first approach outlined in the quote suggests ignoring the random state effects, in which case we estimate the treatment effect precisely in large samples. This is partly motivated by the fact that the treatment effect is not identified in this scenario: having more individuals does not pin down the estimand. However, the fact that the estimand is not identified is of course not a reason to make assumptions that are not plausible, a point Manski has stressed. So, in this scenario matching treated and control individuals may lead to an estimate that is reasonable, and unbiased (assuming that which state got treated is not related to the state effects --- which itself is a big assumption), but the variance estimates based on assuming there are no state effects would be underestimating the true variance if in fact there were such state effects. So, while it may be common practice to ignore state effects for the purpose of calculating standard errors if one has only data from a single treated and a single control state, it does not mean that those standard errors are credible.

An alternative way of thinking about the problem is from a Bayesian perspective (e.g., Gelman and Hill, 2008). Assuming normality, which is probably not a big deal in this case, the key issue is the choice of prior distribution for the variance of the random state effect. If one has data for many states, using a conventional inverse gamma prior would lead to results where the posterior distribution would likely be dominated by the data rather than the prior distribution. However, with only two states, one treated and one control, the posterior for this variance would in many cases be identical to the prior distribution, and so the choice of prior will matter substantially. The suggestion of ignoring the state effects amounts to fixing the prior to have point mass at zero, which a priori seems implausible.

*Back to the problem at hand*

Given that valid confidence intervals would seem to require a consistent estimator for the variance of the state effects, which is not available, what should one do in this case with few states? I want to make three suggestions.

**First**, one can assess the sensitivity of the variance estimates to postulated values for the variance of the state effects. That is, pick a range of possible values for the variance of the state effects, and inspect the corresponding range of confidence intervals. If one is not willing or unable to pick a plausible range of possible values for the variance, this will not limit the range of confidence intervals much. In the Bayesian approach this would correspond to considering a range of prior distributions, in the spirit of Leamer (1981).

**Second**, one can try to assess directly the assumption that the covariates are sufficiently rich to eliminate differences between the two states in the absence of the treatment. Suppose for example one has lagged outcomes. One can test whether adjusting for the remaining covariates removes the average difference in lagged outcomes between the treated and control states. If one rejects that hypothesis, this would cast doubt on the assumption that there are no differences between the two states in the absence of the treatment. This type of falsification test is becoming more common in many empirical studies.

**Third**, suppose one has at least two control states, even if one only has a single treated state. Then one can compare outcomes for the two control states, adjusting for covariates. The assumption that there are no state effects implies that the average adjusted outcomes should be the same in the two control states, and this can be a powerful way to assess that assumption directly. If one finds that there are differences, it will provide some evidence regarding the magnitude of the state effects that can help with implementing the first suggestion. With multiple control states one might also wish to consider the Abadie-Diamond-Hainmueller synthetic control group approach.

In some cases, there may be data available on individuals from the same state who are not treated. Using such individuals as controls would eliminate the problem with the state effects that plagues the comparisons of individuals in different states, and clustering adjustments would not matter much. However, different concerns might arise with such within-state comparisons. Now the question is why individuals who had the choice to participate in the program did not do so. This might be indicative of other, unobserved differences that make them inappropriate comparisons.

**References:**

Abadie, Diamond, and Hainmueller (2010) ``Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California

Gelman and Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, 2006

Leamer, E. (1981), ``Sets of Posterior Means with Bounded Variance Priors,'' Econometrica, Vol. 50(3): 725-736.

Guido Imbens is Professor of Economics at the Graduate School of Business at Stanford University.

http://www.gsb.stanford.edu/users/imbens

## Comments

## Thank you, Guido, Berk et al.

Thank you, Guido, Berk et al., for the great idea of this series and for this superb post in particular. It happens to be immediately useful to me. Do you have a submission mechanism for future editions?

## Hi Michael,

Hi Michael,

Thanks. You can email [email protected] We can't guarantee success, but we will pass on good questions...

## This is very helpful, and an

This is very helpful, and an issue I'm struggling with currently. My program is evaluating individuals nested within village clusters within districts, and in our particular setting we believe there is clustering at both the village cluster and district levels. Our comparison individuals are both within the treatment district and in neighboring untreated districts, so it's nice to hear that "clustering adjustments may not matter much."

For a typical impact measure in our data, the normal and robust standard errors are identical at .120, the village-clustered standard errors go up to .157, and the district-clustered standard errors go up to .178 (which pushes the statistical significance of the result from .001 to .014 to .081, effect size .2). Does this count as not mattering much? If I am unsure from a theoretical / program setting point of view, how do I decide which to use from there? Should I choose the clustering that gives the largest standard errors to play it safe?

I've tentatively decided to use village-clustered errors based on our best intuition of the program setting, but is that just one more researcher degree of freedom that I'm tweaking for my own benefit?

## Thank you guys, this is

Thank you guys, this is really helpful and interesting! For the other Dan above, I think a good rule of thumb is to cluster on the level at which the intervention takes place. Given that within a treatment district you have both control and treatment individuals (which I assume correspond to treatment and control villages?), I think clustering at the village level is appropriate.

## I'm so happy to read this

Great to read this post. I've been worried about this issue since I read the interesting discussion in this post:

http://chrisblattman.com/2011/11/29/the-millennium-villages-evaluated-a-skeptical-view/

With only two states, one cannot separately identify treatment and state effects, as Prof Imbens points out.

For more information about selecting prior distributions for the state effects variance parameters, see "Prior distributions for variance parameters in hierarchical models" A Gelman, Bayesian Analysis, 2006, 1(3) p.515-533. With small numbers of states it is better to use Uniform priors on these variance parameters, if one wants a noninformative prior.

I think that the multilevel modeling approach is to model until units are exchangeable. In other words, in Dan Killian's example, one would add levels for both the villages and districts, because villages may not be considered exchangeable without conditioning on district. I think that framing the question of "when to cluster" in terms of exchangeability is useful. This is discussed a lot in both the Gelman-Hill book cited in this post, and Bayesian Data Analysis by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin.

## Add new comment