For many years, researchers have recognized the need to correct standard error estimates for observational dependence within clusters. An earlier post contrasted the typical approach to this matter, the cluster robust standard error (CRSE), and various methods to cluster bootstrap the standard error . One long recognized problem with CRSE is that when the clusters are few in number (say 30 or less) the CRSE is downward biased and tends to over-reject the null of no effect. Fortunately various forms of cluster bootstrapped SEs result in significantly less finite sample bias than CRSE. Here in this post I discuss another technique for consistently estimating standard errors in finite samples, one based on a method called “randomization inference”.
Randomization inference (RI) switches the inferential basis of statistical testing from the standard thought experiment of (repeatedly) drawing samples of observations from a larger known population to a thought experiment where the population is fixed to what is observed in the data, but treatment assignation itself is imagined to be repeatedly sampled. In this approach, the treatment value is the only random variable in the data and randomization creates the reference distribution of test statistics. All observed outcomes and covariates are assumed fixed.
How might this work in practice? Well for starters we define our test statistic of interest, such as the difference in mean outcomes between treatment and control or a Wilcoxon rank sum test  (which may be more robust to the presence of outliers than the difference in means). The exact distribution of the test statistic is then derived by calculating the test statistic under each possible permutation of randomized assignation.
In the example of the evaluation of pharmaceutical supply chain interventions  that I previously discussed when exploring challenges of evaluations with few clusters , there were 9 matched pairs of districts where in each pair one district was randomly selected for treatment and one for control. If I apply RI to this example, it requires that I calculate the test statistic – the mean difference in clinic drug availability for treatment and control – for the possible 512 permutations of randomized assignment (since there are nine pairs, the number of possible permutations in this example is calculated from: choose (2,1) ^ 9).
The actual test statistic observed in the evaluation is then compared against the distribution of all conceivable test statistics, and where the actual statistic falls in this distribution determines the exact p-value. This exact p-value is interpreted as simply the proportion of possible treatment assignments that yield a test statistic greater than or equal to the observed test statistic. This one-tailed hypothesis test is termed an exact test because it does not require a large-sample approximation since randomization is the basis for inference and we have calculated all possible permutations. An exact test has the added benefit that it does not impose distributional assumptions that are often behind approximations of reference distributions in standard hypothesis testing.
You can imagine that in certain study settings, the number of permutations of treatment assignation can be quite large. When there are a very large number of permutations the test statistic from all permutations does not need to be calculated – a random sub-sample of permutations can generate “close-to-exact” p-values.
Right now there are relatively few examples in the discipline that base their inference on RI, and several of these studies report both the exact p-value (from RI) as well as cluster-adjusted standard errors. Here is a brief summary of the recent work that I have found:
- Bloom and co-authors report permutation based standard errors from a health contracting experiment at the district level conducted over a small number of districts in Cambodia . The study reports both CRSE and RI-derived p-values. In general it finds the RI p-values to be more conservative than CRSE (although, as discussed previously, these CRSEs are possibly biased).
- Bloom and co-authors conduct a field experiment that evaluates the introduction of management practices at Indian textile firms . The study collects data on 28 plants across 17 firms. The randomized inference p-values for the intention-to-treat estimates correspond quite well to the bootstrapped clustered standard errors.
- Ho and Imai investigate the effect of ballot order  (specifically whether the candidate was listed on the first page of a long ballot) on vote outcomes in the 2003 California gubernatorial election. The authors take advantage of the random order of candidate listing across districts, as mandated by California election law, to demonstrate that being listed on the first page of the ballot results in more votes for lesser known candidates, but not for the most popular candidates. The authors use RI to generate candidate specific exact confidence intervals.
- Small, Ten Have, and Rosenbaum employ RI in their investigation of the effectiveness of a depression treatment  introduced on a matched randomized basis across 10 pairs of health clinics. They also demonstrate how RI inferential methods can be adapted to involve covariate adjustment and non-compliance of treatment.
- Barrios and co-authors use RI to explore the inferential consequences of spatial correlation across clusters .
I have yet to apply RI in my own work, but I hope to do so fairly soon. I’ll post the Stata code in this blog if I succeed. In the interim, here is an alpha-version of RI tools  from Ben Hansen, Jake Bowers, and Mark Fredrickson that has been developed for the programming language R. If anyone is aware of other publicly available code that facilitates RI, please let our readers know.