Tools of the Trade: estimating correct standard errors in small sample cluster studies, another take


This page in:

For many years, researchers have recognized the need to correct standard error estimates for observational dependence within clusters. An earlier post contrasted the typical approach to this matter, the cluster robust standard error (CRSE), and various methods to cluster bootstrap the standard error. One long recognized problem with CRSE is that when the clusters are few in number (say 30 or less) the CRSE is downward biased and tends to over-reject the null of no effect. Fortunately various forms of cluster bootstrapped SEs result in significantly less finite sample bias than CRSE. Here in this post I discuss another technique for consistently estimating standard errors in finite samples, one based on a method called “randomization inference”.

Randomization inference (RI) switches the inferential basis of statistical testing from the standard thought experiment of (repeatedly) drawing samples of observations from a larger known population to a thought experiment where the population is fixed to what is observed in the data, but treatment assignation itself is imagined to be repeatedly sampled. In this approach, the treatment value is the only random variable in the data and randomization creates the reference distribution of test statistics. All observed outcomes and covariates are assumed fixed.

How might this work in practice? Well for starters we define our test statistic of interest, such as the difference in mean outcomes between treatment and control or a Wilcoxon rank sum test (which may be more robust to the presence of outliers than the difference in means). The exact distribution of the test statistic is then derived by calculating the test statistic under each possible permutation of randomized assignation.

In the example of the evaluation of pharmaceutical supply chain interventions that I previously discussed when exploring challenges of evaluations with few clusters, there were 9 matched pairs of districts where in each pair one district was randomly selected for treatment and one for control. If I apply RI to this example, it requires that I calculate the test statistic – the mean difference in clinic drug availability for treatment and control – for the possible 512 permutations of randomized assignment (since there are nine pairs, the number of possible permutations in this example is calculated from: choose (2,1) ^ 9).

The actual test statistic observed in the evaluation is then compared against the distribution of all conceivable test statistics, and where the actual statistic falls in this distribution determines the exact p-value. This exact p-value is interpreted as simply the proportion of possible treatment assignments that yield a test statistic greater than or equal to the observed test statistic. This one-tailed hypothesis test is termed an exact test because it does not require a large-sample approximation since randomization is the basis for inference and we have calculated all possible permutations. An exact test has the added benefit that it does not impose distributional assumptions that are often behind approximations of reference distributions in standard hypothesis testing.

You can imagine that in certain study settings, the number of permutations of treatment assignation can be quite large. When there are a very large number of permutations the test statistic from all permutations does not need to be calculated – a random sub-sample of permutations can generate “close-to-exact” p-values.

Right now there are relatively few examples in the discipline that base their inference on RI, and several of these studies report both the exact p-value (from RI) as well as cluster-adjusted standard errors. Here is a brief summary of the recent work that I have found:

-          Bloom and co-authors report permutation based standard errors from a health contracting experiment at the district level conducted over a small number of districts in Cambodia. The study reports both CRSE and RI-derived p-values. In general it finds the RI p-values to be more conservative than CRSE (although, as discussed previously, these CRSEs are possibly biased).

-          Bloom and co-authors conduct a field experiment that evaluates the introduction of management practices at Indian textile firms. The study collects data on 28 plants across 17 firms. The randomized inference p-values for the intention-to-treat estimates correspond quite well to the bootstrapped clustered standard errors.

-          Ho and Imai investigate the effect of ballot order (specifically whether the candidate was listed on the first page of a long ballot) on vote outcomes in the 2003 California gubernatorial election. The authors take advantage of the random order of candidate listing across districts, as mandated by California election law, to demonstrate that being listed on the first page of the ballot results in more votes for lesser known candidates, but not for the most popular candidates. The authors use RI to generate candidate specific exact confidence intervals.

-          Small, Ten Have, and Rosenbaum employ RI in their investigation of the effectiveness of a depression treatment introduced on a matched randomized basis across 10 pairs of health clinics. They also demonstrate how RI inferential methods can be adapted to involve covariate adjustment and non-compliance of treatment.

-          Barrios and co-authors use RI to explore the inferential consequences of spatial correlation across clusters.

I have yet to apply RI in my own work, but I hope to do so fairly soon. I’ll post the Stata code in this blog if I succeed. In the interim, here is an alpha-version of RI tools from Ben Hansen, Jake Bowers, and Mark Fredrickson that has been developed for the programming language R. If anyone is aware of other publicly available code that facilitates RI, please let our readers know.




Jed Friedman

Senior Economist, Development Research Group, World Bank

Join the Conversation

Mark M. Fredrickson
January 25, 2012

Thanks for mentioning RItools. A more direct link to alpha code can be found at:…

While the documentation is lacking, perhaps the current best (though I hesitate to call it that) is to look at the functionality can be found in the tests. Here is an example that recreates R's built-in Wilcoxon test:…

In addition to your excellent write up, I'd like to add that this form of analysis can be used for large studies as well. Some test statistics (such as the statistic in the Wilcoxon test) have nice large sample approximations. RItools uses these approximations on-demand when available, though the results are not exact in the way the complete enumeration or sampled-enumeration results would be. I have a brief piece on using the released version of RItools to compute outcomes using asymptotic approximations:

Two other R packages that may be of some use:

- COIN ( has some built-in tests
- permute ( could be used for more complex randomization schemes for roll-your-own analysis (RItools will likely support this package in the future)

- Mark M. Fredrickson

Jed Friedman
January 25, 2012

... I completely forgot about the Cohen and Dupas paper (apologies to Jessica!). I'm indeed familiar with the paper and was remiss to not mention it above. It is a very good example. They implement a "standard" RI and call it non-parametric because all RI are by definition non-parametric - there are no parametric or distributional assumptions.

January 25, 2012

Hi Jed,

With randomization inference, do you have to specify the exact outcome for each unit in the null hypothesis? (e.g. -- rather than just specifying that the average treatment effect is 0 instead specifying that the treatment effect for each unit is 0.) If so, that seems kind of weird to me. (In most cases, I am more interested in the average treatment effect.)

Also, are there any guidelines for how many number of units you can do this for before it becomes computationally infeasible?

Jake Bowers
January 27, 2012

Hi Jed and Doug,

Just chiming in to confirm that constant treatment effects are not required. My paper with Ben Hansen, for example, allowed each individual to have his or her own unique effect, but focused inference on sums of those effects (see the cites therein to Rosenbaum's idea of "attributable effects"):



Jed Friedman
January 27, 2012

Jake, thanks very much indeed for the clarification, and the directions to your further work with Ben... very valuable.

Winston Lin
January 27, 2012

Jed, thanks for the very useful post and examples.

Gail et al. (1996, Stat Med) is very relevant to Doug's question.…

Chung and Romano (and the Janssen papers they cite) show how to get a test that's exact for the strong null (no treatment effects) and asymptotically valid for the weak null (zero avg treatment effect).

There are two traditions of RI, Fisher's (exact inference under a specific model, e.g. constant treatment effects) and Neyman's (asymptotically valid inference for avg treatment effects or related estimands). Jake's very good paper with Ben Hansen is mostly in the Neyman tradition.

Jed Friedman
January 29, 2012

These collective responses really inform us (bloggers and readers) on these relatively "new" topics and applications.

Juan Jose
January 25, 2012

I think another interesting paper using randomized inference is by Cohen and Dupas (2010), "Free Distribution or Cost Sharing? Evidence from a Randomized Malaria Prevention Experiment" (Link:, since they had very low number of clusters (16). See page 15-16. I would like your thoughts regarding their way to implement their non-parametric inference.

Jed Friedman
January 26, 2012

Hi Doug, thanks so much for the questions. I believe there are only two main assumptions/assertions for RI: you need to specify a null hypothesis (typically that the test statistic equals zero) and that any treatment effect is constant across units. Perhaps this last assumption of constant treatment effects can be relaxed (and RI can certainly be used to test a hypothesis of heterogeneity of treatment across strata). But I'm also new to this method, so I will keep learning/confirming.

In terms of computational infeasibilty, I suppose it's a function of two arguments: 1. your computing power at hand, and 2.your level of patience :)) I imagine that with standard desktops and a reasonably sized data set, calculating RI standard errors off of 100,000 permutations wouldn't take TOO long...

Jed Friedman
January 26, 2012

... the generous sharing of code. It's very useful and I trust many readers will agree. Much appreciation and look forward to further discussion.

January 31, 2012

Is this RI synonymous to A non-parametric approach and some sort of one sided "fishers" exact test for "underpowered" clustered studies?