Beware of studies with a small number of clusters


This page in:

While some of us get to conduct individually randomized trials, I’d say that cluster randomized trials are pretty much the norm in field experiments in economics. Add to that the increase in the level of ambition we recently acquired to have interventions with multiple treatment arms (rather than one treatment and one control group) and mix it with a pinch of logistical and budgetary constraints, we have a non-negligible number of trials with small numbers of clusters (schools, clinics, villages, etc.). While we’re generally aware that high intra-cluster correlation combined with a small number of clusters leads to low statistical power, it is likely the case that the confidence intervals in many such studies are still too narrow. In practice, this means that we’re getting too many parameter estimates statistically different than zero or from each other. Worse, I suspect that the packages that we use ex-ante to calculate statistical power are also too optimistic.

Before I get into the details, let me first say that none of this is very new. The paper I will mainly draw from here was published in 2008. The reason I want to talk about this issue here in 2012 is because (a) there are probably many studies out there that need to do finite sample corrections that don’t and (b) there is likely a gap between what we do to estimate statistical power ex-ante vs. what we can now do ex-post. So, if you’re designing a study with a small number of clusters, say less than 30, analyzing cross-sectional or panel data that have this feature, or refereeing a paper that presents results from such a study, you have to pay a little more attention to make sure that the standard errors are correct.

In a paper (gated, ungated preprint here) published in the Review of Economics and Statistics in 2008 titled “Bootstrap-based improvements for inference with clustered errors,” Cameron, Gelbach, and Miller (CGM from hereon) nicely summarize the problem with OLS estimation whenerrors are correlated within a cluster (but independent across clusters). Ignoring such clustering usually underestimates the true standard errors, meaning that we tend to over-reject the null hypotheses. The common practice to deal with this problem, which is very well known and almost universally adopted in economics, is to use what are known in statistics as sandwich estimators, which permit for the errors to be heteroskedastic and also be correlated quite flexibly with each other within clusters. One likely reason that this is so commonly used is the fact that statistical packages like Stata or SAS have commands that implement these adjustments.

However, there is a small problem that comes with this convenient and flexible approach: it is technically only correct as the number of clusters approaches infinity. That’s not as bad as it sounds: in practice, having 30 to 40 clusters is like approaching infinity (in fact, even when the number of clusters is not small, this standard error can still be biased, but there are corrections suggested in the literature, such as jackknife estimation).

So far so good: if you designed a cluster randomized trial (or analyzing clustered data – cross-sectional or panel) with a sufficient number of clusters, you can use standard commands in Stata, such as “cluster” or “jackknife” to calculate cluster-robust variance estimates. You can also, with confidence, conduct two-sided Wald tests of the kind: Hα:β= 0; Ha: β≠0. But if the number of clusters is too few, because this Wald statistic is correct only asymptotically, the critical values used for rejecting the null hypothesis will be a poor approximation to the correct finite sample critical values for this test. This is true “even if an unbiased variance matrix estimator is used in calculating sβˆ” (CGM, ppg. 416).

This is a problem. Why? CGM describe it clearly: “In practice, as a small-sample correction some programs use a T-distribution to form critical values and p-values. STATA uses the T(G - 1) distribution, which may be better than the standard normal, but may still not be conservative enough to avoid over-rejection.” At least one researcher I talked to confirmed this to be the case in her data: in their study (number of clusters less than 30), moving from cluster-robust standard errors to using a T-distribution made the standard errors larger but nowhere near what they became once they used the bootstrap correction procedure suggested by CGM.

CGM focus on a bootstrap-t procedure (apparently emphasized by theoretical econometricians and statisticians), in particular what they call a wild cluster bootstrap-t procedure. It’s a bootstrap that relaxes some restrictions of the more obvious resampling with replacement procedures and the details are in their paper, which is very clearly written and quite accessible to a non-specialist like me. They show, using Monte Carlo simulations as well as real data, that this procedure performs quite well even when the number of clusters is as few as six. And, not to worry, someone made sure to write the Stata program to implement CGM’s wild cluster bootstrap-t procedure, called cgmwildboot.ado. So, if you have a study with too few clusters, you can use it to correct your standard errors (if you’re a referee of such a paper, you can suggest that the authors utilize it if they have not). The paper has already been cited hundreds of times in the past four years according to Google Scholar (and 49 times according to CrossRef).

So, even if you have a study with a small number of clusters, there is no excuse to not calculate the standard errors correctly. But, that’s little solace when these corrections make your standard errors so large that you can’t detect even large and economically meaningful program effects with any sort of statistical confidence. What would prevent that from happening is to have ex-ante power calculations that told you this would be the case so that you designed the study with more clusters or designed it differently (less treatment arms or more baseline data predictive of the outcomes, etc.).

However, my sense is that the packages that we use to do power calculations use T- rather than normal distributions, which, as discussed above, may not be conservative enough. In such circumstances, it may be wise to use a T-distribution with lower degrees of freedom. CGM talk about using T(G-k) (where k is the number of regressors, usually equal to two – the constant and the regressor of interest). To force yourself to be conservative, you could also increase your guess of the intra-cluster correlation coefficient that go into these calculations. It seems better to spend marginally more at the outset to have more clusters (or to collect more data) than to end up having not enough power to detect meaningful impacts at the end.

As I am not at all an expert on the topic, I’d like readers who are more adept at statistics and econometrics to chime in, especially concerning the power calculations. Is it true that we don’t have the tools available to us to correctly calculate the number of clusters needed for a study – assuming that we are correctly guessing the other necessary parameters, like the intra-cluster correlation? If not, can you point us in the right direction?


Berk Özler

Lead Economist, Development Research Group, World Bank

June 21, 2012

I agree that ex-ante power calculations are vital in just these circumstances, but generally not used as much as they should be. Just out of curiosity. What packages are you using to calculate power? Stata's sampsi + sampclus command? Optimal Design?

June 22, 2012

Would it not be feasible to estimate power using a simulation approach? Or is it the case that the distribution of errors tends to so weird and idiosyncratic that you really wouldn't know what sort of distribution to assume for the errors?

June 22, 2012

Good points, Berk. It got me to think through some things that I've posted to my blog (blog is accessed via the link associated with my name above).

I don't quite share the enthusiasm regarding bootstrapped methods. The accuracy of the bootstrap depends on how representative is your sample of the target population, and so this too is only reliable with large sample sizes. So, it's not generally the case that bootstrapped confidence intervals are a conservative alternative to analytic ones. It depends, although in large samples they should agree so long as the analytic one is a good approximation (which is the case in this context; see my blog post for more details). Also, if you notice, in Cameron, Gelbach, and Miller (2008), the results when using t(G-k) were actually pretty good, at least for the case where they looked at them (see Table 4, estimator 4, column 2). This is very close to what Stata does by default, as you mention.

I agree that a fruitful thing to do would be to adjust the reference distribution used in canned power calc routines--e.g., using t(G-k); sampclus doesn't make this adjustment as far as I understand. Also, there are ways to incorporate simulation---e.g. by simulating sampling and randomization designs with different sample sizes on auxiliary data (e.g., census data or from other large studies). This is an underused approach to evaluating design alternatives.

Berk Ozler
June 22, 2012

Hi Michael,

Thanks for the comment/question. I have primarily used OD in the past (as most of my studies are multi-site randomized cluster trials these days) but I also use Stata from time to time.


Berk Ozler
June 22, 2012

Hi Doug,

Thanks for the comment/question. I think that Cyrus Samii agrees with you in his post (please click on his name below to take you to his very useful blog).


Berk Ozler
June 22, 2012

Hi Cyrus,

Thanks so much for the comments and the excellent postin your blog in response to our post. I found it very useful, so have some on Twitter, and I will link to it in the Friday links.

I will try both the t(G-k) and the simulations approaches in mind next time I am doing power calculations (although I have to admit that most of the studies I am involved in designing draw stratified random samples from the target population and are large enough to not warrant small sample corrections). One question, however:

You mention that you're not a big fan of the bootstrap methods and mention that if you play with cgmwildboot.ado, you can sometimes get narrower CIs that if you were to use "regress Y X, cluster(ID)". Does this mean that CGM did not present enough examples in their paper (they do Monte Carlos and a couple of replications) or is your objection independent of their work (perhaps something to do with representative samples from target population as opposed to the convenience sample).

Thanks agains for the response. Cheers,


June 25, 2012

I think we need more examples than what CGM present. I am mostly interested confidence intervals, not null hypothesis tests, and so we have to make assumptions about how their results for rejection rates on null hypothesis tests will translate with respect to confidence interval coverage. By my understanding, the key results for extrapolating to intervals are the wild cluster-se results, for which the nominal test sizes are sometimes very low. Also, only on Table 4 is the t(C-k) procedure evaluated as far as I can tell, and there, the test with CR3 has good performance, but again, I can't be sure how well this translates to the intervals scenario. It would be better to have more direct evidence for the coverage properties of intervals.

Some other relevant points were raised in a detailed comment on my blog post as well---may be of interest.

Aprajit Mahajan
June 28, 2012

Two other approaches (that we used in a recent paper) are:

(a) using permutation tests -- these will have exact size irrespective of the number of clusters (or the number of treatment units within a cluster). in addition, permutation tests impose very weak conditions on the correlation across treatment units within a cluster which is something else that is sometimes a concern in cluster trials. There is also recent work that sets out the use of permutation tests in a multiple outcome setting in a cluster randomized trial (Shaikh and Soo (2012)).

(b) using a procedure proposed by Ibragimov and Mueller that also works for any number of clusters but requires that there be a reasonable number of observations within a cluster that are not too strongly correlated with each other. their critical values are from a t-distribution with d.f. equal to min(T clusters,C clusters)

Also, the wild bootstrap can also be used in models (e.g. probit or logit) without conventional residuals

June 30, 2012

I discuss limitations of permutation tests in the "power for cluster randomized studies" blog post that I posted in response to Berk's post above ( There is also an older post that included an exchange with Guido Imbens on this issue: (see especially the comments). I am not yet convinced that they provide a solution to the problem, given that the "sharp null" is not usually what we want to test. However, there is current research happening on adjusted permutation tests that may provide more flexible solutions.

Procedure (b) sounds equivalent to a conservative Welch approximation for heteroskedastic normal data, a point that was raised in a comment to the "power for cluster randomized studies" blog post. It may nonetheless be a good alternative to t(C-k).

Weng Balino
June 12, 2019

What is the rationale that we use at least 30 clusters?