Published on Development Impact

When should you cluster standard errors? New wisdom from the econometrics oracle

This page in:

In ancient Greek times, important decisions were never made without consulting the high priestess at the Oracle of Delphi.  She would deliver wisdom from the gods, although this advice was sometimes vague or confusing, and was often misinterpreted by mortals. Today I bring word that the high priestess and priests (Athey, Abadie, Imbens and Wooldridge) have delivered new wisdom from the god of econometrics on the important decision of when should you cluster standard errors. This is definitely one of life’s most important questions, as any keen player of seminar bingo can surely attest. In case their paper is all greek to you (half of it literally is), I will attempt to summarize their recommendations, so that your standard errors may be heavenly.

The authors argue that there are two reasons for clustering standard errors: a sampling design reason, which arises because you have sampled data from a population using clustered sampling, and want to say something about the broader population; and an experimental design reason, where the assignment mechanism for some causal treatment of interest is clustered. Let me go through each in turn, by way of examples, and end with some of their takeaways.

The Sampling Design reason for clustering
Consider running a simple Mincer earnings regression of the form:
Log(wages) = a + b*years of schooling + c*experience + d*experience^2 + e

You present this model, and are deciding whether to cluster the standard errors. Referee 1 tells you “the wage residual is likely to be correlated within local labor markets, so you should cluster your standard errors by state or village.”. But referee 2 argues “The wage residual is likely to be correlated for people working in the same industry, so you should cluster your standard errors by industry”, and referee 3 argues that “the wage residual is likely to be correlated by age cohort, so you should cluster your standard errors by cohort”. What should you do?

You could try estimating your model with these three different clustering approaches, and see what difference this makes.

Their advice: whether or not clustering makes a difference to the standard errors should not be the basis for deciding whether or not to cluster. They note there is a misconception that if clustering matters, one should cluster.

Instead, under the sampling perspective, what matters for clustering is how the sample was selected and whether there are clusters in the population of interest that are not represented in the sample. So, we can imagine different scenarios here:

  1. You want to say something about the association between schooling and wages in a particular population, and are using a random sample of workers from this population. Then there is no need to adjust the standard errors for clustering at all, even if clustering would change the standard errors.
  2. The sample was selected by randomly sampling 100 towns and villages from within the country, and then randomly sampling people in each; and your goal is to say something about the return to education in the overall population. Here you should cluster standard errors by village, since there are villages in the population of interest beyond those seen in the sample.
  3. This same logic makes it clear why you generally wouldn’t cluster by age cohort (it seems unlikely that we would randomly sample some age cohorts and not others, and then try and say something about all ages); and that we would only want to cluster by industry if the sample was drawn by randomly selecting a sample of industries, and then sampling individuals from within each.
Even in the second case, Abadie et al. note that both the usual robust (Eicker-Huber-White or EHW) standard errors, and the clustered standard errors (which they call Liang-Zeger or LZ standard errors) can both be correct, it is just that they are correct for different estimands. That is, if you are content on just saying something about the particular sample of individuals you have, without trying to generalize to the population, the EHW standard errors are all you need; but if you want to say something about the broader population, the LZ standard errors are necessary.

Special case: even when the sampling is clustered, the EHW and LZ standard errors will be the same if there is no heterogeneity in the treatment effects.

Sidenote 1: this reminds me also of propensity score matching command nnmatch of Abadie (with a different et al.), where you can get the narrower SATE standard errors for the sample, or the wider PATE errors for the population.

Sidenote 2: This reason is hardly ever a rationale for clustering in an impact evaluation. But Rosenzweig and Udry’s paper on external validity does make the point that we only observe treatment effects for specific points in time, and that if we want to say something more general about how our treatment behaves in other points in time, we need wider standard errors than we use for just saying something about our specific sample – which is very related to the point here about being very clear what your estimand is.

The Experimental Design Reason for Clustering
The second reason for clustering is the one we are probably more familiar with, which is when clusters of units, rather than individual units, are assigned to a treatment. Let’s take the same equation as above, but assume that we have a binary treatment that assigns more schooling to people. So now we have:
Log(wages) = a +b*Treatment + e

Then if the treatment is assigned at the individual level, there is no need to cluster (*). There has been much confusion about this, as Chris Blattman explored in two earlier posts about this issue (the fabulously titled clusterjerk and clusterjerk the sequel), and I still occasionally get referees suggesting I try clustering by industry or something similar in an individually-randomized experiment. This Abadie et al. paper is now finally a good reference to explain why this is not necessary.
(*) unless you are using multiple time periods, and then you will want to cluster by individual, since the unit of randomization is individual, and not individual-time period.

What about if your treatment is assigned at the village level. Then cluster by village. This is also why you want to cluster difference-in-differences at the state-level when you have a source of variation that comes from differences across states, and why a “treatment” like being on one side of a border vs the other is problematic (because you have only 2 clusters).

Adding fixed effects
What if we sample at the level of cities, but then add city fixed effects to our Mincer regression. Or we randomize at the city level, but add city fixed effects. Do we still need to cluster at the city level? 
The authors note that there is a lot of confusion about using clustering with fixed effects. The general rule is that you still need to cluster if either the sampling or assignment to treatment was clustered. However, the authors show that cluster adjustments will only make an adjustment with fixed effects if there is heterogeneity in treatment effects.

How to cluster?
This is largely a paper about when to cluster, not how to cluster. There is of course a whole other debate about when you can rely on asymptotics, vs bootstrapping, vs randomization inference approaches. They show with asymptotic approximations that the standard Liang-Zeger cluster adjustment is generally conservative, and offer an alternative cluster-adjusted variance estimator that can be used if there is variation in treatment assignment within clusters and you know the fraction of clusters sampled. But since with the sample sizes used in many experiments the concern is now that asymptotic standard errors may not be conservative enough, you should be careful about using such an adjustment with typical sample sizes.
 

Authors

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000