Field experiments often require group-level randomization, even when the individual or household is the unit of interest. Sometimes this is necessitated by the design of the program/ treatment (e.g., training programs that are delivered to groups of individuals; infrastructure development targets a whole village) or due to administrative or ethical reasons. In some cases, researchers may in fact prefer to randomize at a “higher” level to limit spillovers to control units.
Our experience is that basic questions on whether and how to stratify always crop back up when implementing a group-level randomization. We realized that we had not addressed this head-on in the blog—so we wanted to make it our focus today. To discuss some of the ins-and-outs of stratifying in group randomized experiments, it’s useful to start with both the benefits of stratifying in experiments and potential pitfalls.
In simple terms, researchers often stratify because they’re concerned about balance. Stratification, especially when there aren’t many units of randomization (such as group randomized experiments), eliminates the possibility that all or most units of a certain type are assigned to treatment or control.
This can lead to gains in power when the variables chosen for stratification are correlated with the outcome. Furthermore, while some warn that stratification can lead to reduced power in small samples when variables are only weakly correlated to the outcome, this only applies to T-statistic based tests. Tests directly based on the randomization distribution are not impacted and are usually preferable in small samples.
Moreover, as mentioned in Bruhn and McKenzie (2009), stratification can help improve precision and power in subgroup analyses. Specifically, if we believe that treatment effects are likely to be heterogeneous, and that part of this heterogeneity can be explained by observed variables, then stratifying based on those variables can improve precision in subgroups defined by those variables.
Finally, the survey in Bruhn and McKenzie (2009) suggests that typically the number of variables used in forming strata is small and includes geographic location. In a finite population setting, this type of stratification can offer some additional benefits. Suppose that you have a group randomized trial with villages assigned to either treatment or control and treatment is stratified at a more aggregated geographical region. Deeb and de Chaisemartin (2022) suggest that if there are region level stochastic shocks affecting treatment and control units, and the sample includes a sufficient number of geographical regions and villages per region, then by clustering standard errors at the region level you can draw inference on the average treatment effect (ATE) netted out of region level shocks. Furthermore, if you are not concerned about stochastic shocks, both the village-level clustered standard errors and region-level clustered standard errors are conservative for the variance of the estimator for ATE. Which of the two will be more conservative is going to depend on whether treatment effects vary more across regions or across villages.
We now move to the potential pitfalls of stratifying and analyzing group randomized trials.
Remember that we usually stratify in order to achieve balance along certain variables we think are strongly correlated with our outcomes of interest. This means that in cases where there is a lot of variation across groups for a variable, a simple stratification procedure, such as splitting the groups into two strata based on the whether the group average is below or above the median, might not guarantee the desired balance. To check for that, you could divide the groups into two bins and check whether there is large heterogeneity within bins.
A similar concern arises when you also want to ensure balance in terms of both group-level and individual-level variables. If there is a lot of heterogeneity for a given variable within groups, then a simple stratification using group-level averages might again not guarantee the desired balance. In such cases, you might want to look into more complex procedures such as re-randomization which we touch on at the end of this blog.
Finally, de Chaisemartin and Ramirez-Cuellar also raise an important issue to keep in mind when analyzing group randomized trials with more than 10 observations per group and with small strata (fewer than 10 groups per stratum). In such cases, treatment assignment within the strata is correlated, and failing to account for this correlation can lead to incorrect inference. To see this, consider the simple case of grouping villages into pairs and assigning one village in the pair to treatment and the other to control. In this case, the first village receiving the treatment means that it is impossible for the second village to receive it, their treatment status is perfectly negatively correlated. Practically, when regressing the outcome on the treatment and stratum fixed effects, clustering your standard errors at the group level as opposed to the stratum level and using a 5%-level t-test based on this regression will lead to a type 1 error rate that is much larger than 5%. David’s blog also provides an in-depth discussion of this as he enjoins us to be wary of using pairwise randomization.
Here is where we end up and what we would suggest, should you walk into our offices with this blog's namesake for question:
- Choose wisely. Akin to David’s take on this, consider which variables you would like to stratify on. To do so, consider which variables are likely to be strongly correlated with the outcome of interest and which variables are likely to be correlated with treatment effect heterogeneity. A good example of that would be the baseline values of the outcomes. If your experiment is theory-motivated, your model may tell you what dimension of heterogeneity you should care about most.
- Assess the damage. It's a good idea to ask yourself, do I really need to stratify, and how? With the stratifying variables chosen, examine the heterogeneity across and within groups in terms of those variables. As mentioned earlier, an easy way to do so is to split the groups into bins and then examine the across-group variance within each bin, and to look at within group variation as well. If the groups aren’t too heterogeneous then a simple stratification procedure using a binary decision rule (above/below median) to form strata may be sufficient to guarantee balance while avoiding having uneven strata sizes. For more complex cases, a re-randomization approach might be preferable to stratification.
- Layer? Stratification can also be done in layers. First stratifying along a factor such as geographic region where a binary decision rule often isn’t sufficient, then stratifying along one or two additional variables within region using a simpler rule. As things get more complicated, however, don't forget to verify that you actually are randomizing assignment …
- Think ahead. With the strata formed, think carefully about how stratification can impact inference when analyzing results. When the groups are large (>10 obs) and strata small (< 10 groups per strata) we need to account for correlations in treatment assignment within strata by clustering standard errors at the strata level. When we have too few strata to cluster at that level, we can turn to randomization inference. On the other hand, when there are many large strata, there might be benefits to clustering at the strata level as we mentioned earlier.