When should you cluster standard errors? New wisdom from the econometrics oracle


This page in:

In ancient Greek times, important decisions were never made without consulting the high priestess at the Oracle of Delphi.  She would deliver wisdom from the gods, although this advice was sometimes vague or confusing, and was often misinterpreted by mortals. Today I bring word that the high priestess and priests (Athey, Abadie, Imbens and Wooldridge) have delivered new wisdom from the god of econometrics on the important decision of when should you cluster standard errors. This is definitely one of life’s most important questions, as any keen player of seminar bingo can surely attest. In case their paper is all greek to you (half of it literally is), I will attempt to summarize their recommendations, so that your standard errors may be heavenly.

The authors argue that there are two reasons for clustering standard errors: a sampling design reason, which arises because you have sampled data from a population using clustered sampling, and want to say something about the broader population; and an experimental design reason, where the assignment mechanism for some causal treatment of interest is clustered. Let me go through each in turn, by way of examples, and end with some of their takeaways.

The Sampling Design reason for clustering
Consider running a simple Mincer earnings regression of the form:
Log(wages) = a + b*years of schooling + c*experience + d*experience^2 + e

You present this model, and are deciding whether to cluster the standard errors. Referee 1 tells you “the wage residual is likely to be correlated within local labor markets, so you should cluster your standard errors by state or village.”. But referee 2 argues “The wage residual is likely to be correlated for people working in the same industry, so you should cluster your standard errors by industry”, and referee 3 argues that “the wage residual is likely to be correlated by age cohort, so you should cluster your standard errors by cohort”. What should you do?

You could try estimating your model with these three different clustering approaches, and see what difference this makes.

Their advice: whether or not clustering makes a difference to the standard errors should not be the basis for deciding whether or not to cluster. They note there is a misconception that if clustering matters, one should cluster.

Instead, under the sampling perspective, what matters for clustering is how the sample was selected and whether there are clusters in the population of interest that are not represented in the sample. So, we can imagine different scenarios here:

  1. You want to say something about the association between schooling and wages in a particular population, and are using a random sample of workers from this population. Then there is no need to adjust the standard errors for clustering at all, even if clustering would change the standard errors.
  2. The sample was selected by randomly sampling 100 towns and villages from within the country, and then randomly sampling people in each; and your goal is to say something about the return to education in the overall population. Here you should cluster standard errors by village, since there are villages in the population of interest beyond those seen in the sample.
  3. This same logic makes it clear why you generally wouldn’t cluster by age cohort (it seems unlikely that we would randomly sample some age cohorts and not others, and then try and say something about all ages); and that we would only want to cluster by industry if the sample was drawn by randomly selecting a sample of industries, and then sampling individuals from within each.
Even in the second case, Abadie et al. note that both the usual robust (Eicker-Huber-White or EHW) standard errors, and the clustered standard errors (which they call Liang-Zeger or LZ standard errors) can both be correct, it is just that they are correct for different estimands. That is, if you are content on just saying something about the particular sample of individuals you have, without trying to generalize to the population, the EHW standard errors are all you need; but if you want to say something about the broader population, the LZ standard errors are necessary.

Special case: even when the sampling is clustered, the EHW and LZ standard errors will be the same if there is no heterogeneity in the treatment effects.

Sidenote 1: this reminds me also of propensity score matching command nnmatch of Abadie (with a different et al.), where you can get the narrower SATE standard errors for the sample, or the wider PATE errors for the population.

Sidenote 2: This reason is hardly ever a rationale for clustering in an impact evaluation. But Rosenzweig and Udry’s paper on external validity does make the point that we only observe treatment effects for specific points in time, and that if we want to say something more general about how our treatment behaves in other points in time, we need wider standard errors than we use for just saying something about our specific sample – which is very related to the point here about being very clear what your estimand is.

The Experimental Design Reason for Clustering
The second reason for clustering is the one we are probably more familiar with, which is when clusters of units, rather than individual units, are assigned to a treatment. Let’s take the same equation as above, but assume that we have a binary treatment that assigns more schooling to people. So now we have:
Log(wages) = a +b*Treatment + e

Then if the treatment is assigned at the individual level, there is no need to cluster (*). There has been much confusion about this, as Chris Blattman explored in two earlier posts about this issue (the fabulously titled clusterjerk and clusterjerk the sequel), and I still occasionally get referees suggesting I try clustering by industry or something similar in an individually-randomized experiment. This Abadie et al. paper is now finally a good reference to explain why this is not necessary.
(*) unless you are using multiple time periods, and then you will want to cluster by individual, since the unit of randomization is individual, and not individual-time period.

What about if your treatment is assigned at the village level. Then cluster by village. This is also why you want to cluster difference-in-differences at the state-level when you have a source of variation that comes from differences across states, and why a “treatment” like being on one side of a border vs the other is problematic (because you have only 2 clusters).

Adding fixed effects
What if we sample at the level of cities, but then add city fixed effects to our Mincer regression. Or we randomize at the city level, but add city fixed effects. Do we still need to cluster at the city level? 
The authors note that there is a lot of confusion about using clustering with fixed effects. The general rule is that you still need to cluster if either the sampling or assignment to treatment was clustered. However, the authors show that cluster adjustments will only make an adjustment with fixed effects if there is heterogeneity in treatment effects.

How to cluster?
This is largely a paper about when to cluster, not how to cluster. There is of course a whole other debate about when you can rely on asymptotics, vs bootstrapping, vs randomization inference approaches. They show with asymptotic approximations that the standard Liang-Zeger cluster adjustment is generally conservative, and offer an alternative cluster-adjusted variance estimator that can be used if there is variation in treatment assignment within clusters and you know the fraction of clusters sampled. But since with the sample sizes used in many experiments the concern is now that asymptotic standard errors may not be conservative enough, you should be careful about using such an adjustment with typical sample sizes.


David McKenzie

Lead Economist, Development Research Group, World Bank

October 16, 2017

a very useful source

Dimitriy Masterov
October 17, 2017

This is an excellent summary of this paper. I have a follow-up question about DDD. In Jeff Wooldridge's Econometric Analysis (2nd edition), he gives an example of a difference-in-difference-in-differences (DDD) estimator on page 151 for the two period case where state B implements a health care policy change aimed at the elderly. If I want to extrapolate to what this would do to the elderly in other states, how should I cluster?

October 18, 2017

This is a fantastic blog. (When to) Cluster made easy. Thank you!

February 13, 2018

This is incredibly useful--and the first paragraph is a work of art! Thanks.

March 17, 2018

Thanks for this and all your other posts - they have been a great help!

November 21, 2018

Thank you for the very comprehensible summary. So rare to find something which non-econometricians can understand.

May 21, 2019

such an intuitively communicated summary!

May 22, 2019

Such a useful summary. Thanks

December 02, 2019

Thanks for sharing this, Dr. McKenzie. This blog helps a lot when I am struggling with the clustered standard error in my paper.

December 12, 2019

Hi David, this is a very stupid question from me. Do you need to use clustered standard error when conducting a regression on census data. Thanks

December 12, 2019

Hi Imam,
It depends on what sort of regression equation you are trying to estimate. See this paper by the same co-authors https://www.nber.org/papers/w20325 which discusses how to interpret standard errors in census data. Basically, if you are trying to estimate a causal effect, and the source of variation is at a clustered level then you still need to cluster. E.g. if you use U.S. census data on wages, and examine state-level minimum wage policies, your hypothetical experiment is one in which treatment is varying at the state level, and so you would still cluster standard errors by state.

May 03, 2021

Dear David,
Thanks for all your explanations, great service to the community. Following up on your example, I assume that the example you had in mind (in the above answer) was cross-sectional? Should we cluster at state-period level, if this example was in a panel format?

May 03, 2021

Hi Carlo. The classic how much should we trust difference-in-differences paper (https://economics.mit.edu/files/750) makes clear the need to cluster at the state-level, not state-period level, when you have panel data. This is because the policies are typically not randomly re-assigned each period, but instead are correlated over time in a state. If somehow you were in a setting where every year states all randomly chose their policies, completely ignoring what their policies were in the years before, then clustering at the state-period level makes sense. I cannot think of any applications where this would be the case.

February 23, 2021

Hi David,

Thanks for this summary!

I was wondering, what happens when both cases appear at once:
I am working with microdata where the sample was selected by randomly sampling clusters (=villages) and then select respondents within these clusters. I would therefore cluster SE at village/cluster-level.

But I run a diff-in-diff and evaluate a policy that was implemented at district-level, which would suggest clustering at district-level.

Is there a suggestion on how to proceed in such a case?

February 23, 2021

I would use the sampling weights to reweight the data if you want it to be representative of a broader population rather than just the sample you have, but then since your question of interest is a causal one relying on a district-level policy, cluster at the district level.

February 24, 2021

Great, thanks for your answer!

June 01, 2021

Hi David, thanks for a very nice summary.
You state about clustering at the village level
"This is also why you want to cluster difference-in-differences at the state-level when you have a source of variation that comes from differences across states, and why a “treatment” like being on one side of a border vs the other is problematic (because you have only 2 clusters).

Are you implying you wouldn't cluster at the state level if you have only 2 clusters? I am performing a policy evaluation using where i try to run my regression with only 2 states (one treated, one non-treated), and as of now i am getting really small standard errors

August 18, 2021

Very useful! Thanks for sharing!

March 14, 2022

Dear David! Thanks for your post. I do use an DiD over European Social Survey where I build from (external) country characteristics a treatment and control group. However, if I cluster my standard errors by countries they loose significance and enlarge up to a factor of 40. Can you recommend a good paper on how to interpret such an difference. I never experienced a case where the clustered standard errors drop like that. Thanks and keep up this interesting blog!

March 14, 2022

Hi Martin,
This suggests a really high intra-cluster correlation. You could look and see whether this is the case - see e.g. https://blogs.worldbank.org/impactevaluations/tools-of-the-trade-intra-…

March 14, 2022

This is such a great summary! Thanks David. I have a question regarding clustering for DiD model. I am running a DiD regression for two periods and for two states but using ZIP code level data. The policy differs at the state level. Should I cluster standard errors at the zip code level? Thank you!

March 14, 2022

Hi Zoey, this is a tricky case, since your variation in policy is actually at the state level, which would suggest clustering at that level - but with only 2 states, this won't work. You are in a similar situation to the original minimum wage work - see the discussion here https://blogs.worldbank.org/impactevaluations/explaining-why-we-should-… and Roth (2022) - basically a design-based approach won't work, and you need to make modelling assumptions that there are no state-level shocks apart from the policy change - and just be clear about what you are assuming.