While some of us get to conduct individually randomized trials, I’d say that cluster randomized trials are pretty much the norm in field experiments in economics. Add to that the increase in the level of ambition we recently acquired to have interventions with multiple treatment arms (rather than one treatment and one control group) and mix it with a pinch of logistical and budgetary constraints, we have a non-negligible number of trials with small numbers of clusters (schools, clinics, villages, etc.). While we’re generally aware that high intra-cluster correlation combined with a small number of clusters leads to low statistical power, it is likely the case that the confidence intervals in many such studies are still too narrow. In practice, this means that we’re getting too many parameter estimates statistically different than zero or from each other. Worse, I suspect that the packages that we use ex-ante to calculate statistical power are also too optimistic.
Before I get into the details, let me first say that none of this is very new. The paper I will mainly draw from here was published in 2008. The reason I want to talk about this issue here in 2012 is because (a) there are probably many studies out there that need to do finite sample corrections that don’t and (b) there is likely a gap between what we do to estimate statistical power ex-ante vs. what we can now do ex-post. So, if you’re designing a study with a small number of clusters, say less than 30, analyzing cross-sectional or panel data that have this feature, or refereeing a paper that presents results from such a study, you have to pay a little more attention to make sure that the standard errors are correct.
In a paper  (gated, ungated preprint here ) published in the Review of Economics and Statistics in 2008 titled “Bootstrap-based improvements for inference with clustered errors,” Cameron, Gelbach, and Miller (CGM from hereon) nicely summarize the problem with OLS estimation whenerrors are correlated within a cluster (but independent across clusters). Ignoring such clustering usually underestimates the true standard errors, meaning that we tend to over-reject the null hypotheses. The common practice to deal with this problem, which is very well known and almost universally adopted in economics, is to use what are known in statistics as sandwich estimators, which permit for the errors to be heteroskedastic and also be correlated quite flexibly with each other within clusters. One likely reason that this is so commonly used is the fact that statistical packages like Stata or SAS have commands that implement these adjustments.
However, there is a small problem that comes with this convenient and flexible approach: it is technically only correct as the number of clusters approaches infinity. That’s not as bad as it sounds: in practice, having 30 to 40 clusters is like approaching infinity (in fact, even when the number of clusters is not small, this standard error can still be biased, but there are corrections suggested in the literature, such as jackknife estimation).
So far so good: if you designed a cluster randomized trial (or analyzing clustered data – cross-sectional or panel) with a sufficient number of clusters, you can use standard commands in Stata, such as “cluster” or “jackknife” to calculate cluster-robust variance estimates. You can also, with confidence, conduct two-sided Wald tests of the kind: Hα:β= 0; Ha: β≠0. But if the number of clusters is too few, because this Wald statistic is correct only asymptotically, the critical values used for rejecting the null hypothesis will be a poor approximation to the correct finite sample critical values for this test. This is true “even if an unbiased variance matrix estimator is used in calculating sβˆ” (CGM, ppg. 416).
This is a problem. Why? CGM describe it clearly: “In practice, as a small-sample correction some programs use a T-distribution to form critical values and p-values. STATA uses the T(G - 1) distribution, which may be better than the standard normal, but may still not be conservative enough to avoid over-rejection.” At least one researcher I talked to confirmed this to be the case in her data: in their study (number of clusters less than 30), moving from cluster-robust standard errors to using a T-distribution made the standard errors larger but nowhere near what they became once they used the bootstrap correction procedure suggested by CGM.
CGM focus on a bootstrap-t procedure (apparently emphasized by theoretical econometricians and statisticians), in particular what they call a wild cluster bootstrap-t procedure. It’s a bootstrap that relaxes some restrictions of the more obvious resampling with replacement procedures and the details are in their paper, which is very clearly written and quite accessible to a non-specialist like me. They show, using Monte Carlo simulations as well as real data, that this procedure performs quite well even when the number of clusters is as few as six. And, not to worry, someone  made sure to write the Stata program to implement CGM’s wild cluster bootstrap-t procedure, called cgmwildboot.ado . So, if you have a study with too few clusters, you can use it to correct your standard errors (if you’re a referee of such a paper, you can suggest that the authors utilize it if they have not). The paper has already been cited hundreds of times in the past four years according to Google Scholar (and 49 times according to CrossRef).
So, even if you have a study with a small number of clusters, there is no excuse to not calculate the standard errors correctly. But, that’s little solace when these corrections make your standard errors so large that you can’t detect even large and economically meaningful program effects with any sort of statistical confidence. What would prevent that from happening is to have ex-ante power calculations that told you this would be the case so that you designed the study with more clusters or designed it differently (less treatment arms or more baseline data predictive of the outcomes, etc.).
However, my sense is that the packages that we use to do power calculations use T- rather than normal distributions, which, as discussed above, may not be conservative enough. In such circumstances, it may be wise to use a T-distribution with lower degrees of freedom. CGM talk about using T(G-k) (where k is the number of regressors, usually equal to two – the constant and the regressor of interest). To force yourself to be conservative, you could also increase your guess of the intra-cluster correlation coefficient that go into these calculations. It seems better to spend marginally more at the outset to have more clusters (or to collect more data) than to end up having not enough power to detect meaningful impacts at the end.
As I am not at all an expert on the topic, I’d like readers who are more adept at statistics and econometrics to chime in, especially concerning the power calculations. Is it true that we don’t have the tools available to us to correctly calculate the number of clusters needed for a study – assuming that we are correctly guessing the other necessary parameters, like the intra-cluster correlation? If not, can you point us in the right direction?