At the Chilean Budget Office, Ryan is starting an experiment with Eliana Carranza from World Bank, Leonardo González from the Budget Office (and with the participation of other colleagues from the World Bank, Ministry of Finance and Social Security Superintendence), that intends to measure the effect of the delivery mechanism of a cash transfer scheme for workers on their welfare and labor market decisions. They will randomize a set of 5,172 employers (with a total of 31068 workers that are eligible for the cash transfer program) to two treatment groups and a control group. One important characteristic of this sample is that the size of clusters is unequal and highly skewed. Most employers hire 1 or 2 eligible workers, while there are also a few large firms, including firms hiring more than 1000 eligible workers (Figure 1). Firms with 5 or fewer workers constitute 81% of firms, but only employ 29% of workers, while firms with 50+ workers account of only 1% of firms, but hire 34% of eligible workers. While this level of skew is extreme, it is more commonplace than not for clusters to be of different sizes – cities and firms have both been observed to follow Zipf’s law in many places, and some schools often have more students than others. This blog post explains some consequences of these unequal cluster sizes for doing power calculations and experimental analysis.
Figure 1: Histogram of Cluster Sizes
Unequally-sized clusters and statistical power
Given the sample size of over 30,000 individuals in over 5,000 clusters, Ryan was confident that statistical power would be high. Using the standard formula for the minimal detectable effect (MDE) of a clustered experiment, as e.g. given in the Duflo et al. (2008) toolkit, or used in the software Optimal Design:
With alpha=0.05, power=0.8, and P (proportion allocated to treatment) of 0.5, N=31,068, n (the number of individuals per cluster) = 31,068/5,172 = 6.0, and an intra-cluster correlation of rho=0.39 for worker wages, gives a MDE of 0.055s.d.
However, when he ran simulations, he was surprised to find power was much worse than this formula would suggest, with a MDE of 0.26s.d. Alternatively, while the simple formula would suggest he would have at least 80% power for detecting a 0.055 s.d. change in wages, the simulations show that actual power for this size change would be only 12% (Figure 2).
Figure 2: Power according to formula versus simulation
Why do unequally sized clusters reduce power for individual-level outcomes? Eldridge et al. (2006) provide some nice intuition. They note that with clusters of different sizes, estimates of sample means from the smaller clusters will be less precise and estimates from the larger clusters more precise, but that there are diminishing gains in precision from adding additional individuals as cluster sizes increase. This means the addition of individuals to larger clusters does not compensate for the loss of precision in smaller clusters, so that are cluster sizes become more unbalanced, power decreases.
So what can be done to solve this problem? In the simple formula above, with equally-sized clusters, the design effect (DE) is the last term:
This is how much the standard errors need to be inflated compared to individual level randomization. In this case it is 1.72. The biomedical literature notes that when clusters are unequally sized, this is not enough of an adjustment, and adjustment factors that also reflect the variation in cluster size are needed. For example, Eldridge et al. (2006) gives the formula below, where CV is the coefficient of variation in cluster size:
Note that if the clusters are equal size, then CV is 0, and this reduces to the design effect in the standard formula. In this firm-worker experiment, with the skew seen in Figure 1, the CV is 5.16. The design effect is then 8.08, or more than four times what it would be with equal-sized clusters, and so the MDE according to this formula will be 0.26s.d.
Power can be improved through the usual methods of stratifying on baseline variables that predict the outcome and controlling for baseline levels of the outcome in an ANCOVA-type estimation. But note here that power depends also critically on the CV. So excluding some of the largest clusters (that is throwing away sample) could actually improve power. As an example here, throwing away the top 1% of firms reduces the sample size by about 10,000 workers, but reduces the CV to 1.38 instead of 5.16, and the MDE falls to 0.09 s.d.
Balance tests and statistical inference.
A second surprise for Ryan in doing his simulations was that a test of baseline balance seemed to have approximately the correct size for firm-level (cluster-level) outcomes, but over-rejected the null of no effect for worker-level (individual-level) outcomes. For example, the first row of Table 1 show that in the full sample of firms, a test of equality of firm size has a p-value of 0.05 or under 4.5% of the time, and p-value of 0.10 or under 8.8% of the time. This is approximately the right size. In contrast, it rejects equality of worker wages 14.6% of the time at the 10% level and 7.5% of the time at the 5% level. This is a problem, since over-rejecting the null hypothesis of no effect makes it more likely that the experiment will find a spurious treatment effect when none is there.
What is going on here? To understand why unequal cluster size may cause issues, look at the following illustration.
Suppose you have four baskets of fruit. Two baskets have an orange each, one basket has an apple and a fourth basket has eight apples. Treating an apple basket and an orange basket would therefore provide balanced characteristics at a cluster level, however, at an individual level, given that 8/11 of the fruit are in one basket, the pursuit of balance is challenged as a high proportion of the sample will be randomized by a small number of clusters. This also happens in the case of employers and workers. When few employers hire many workers, the randomization of a small number of clusters will determine to a large extent the degree of balance and statistical power of the experiment.
Does it help to stratify by cluster size?
A first recommendation David suggested to Ryan was to stratify the randomization by cluster size. Imai et al. (2009) recommend matching on cluster size. In particular, David suggested forming matched quadruplets by firm size. This would make it less likely that all the large firms ended up in one treatment group, but instead create strata of more equal sizes. The bottom of Table 1 shows that this suggestion did not help at all – again the size was approximately correct for firm-level analysis, but actually seemed worse when doing worker-level analysis and controlling for stratification fixed effects and clustering standard errors at the firm-level – a p-value of 0.05 or below occurred in 13.9% of simulations, and one below 0.10 in 21.5% of simulations. One reason for this is a point made in a recent paper by de Chaise-martin and Ramirez-Cueller – they show that in matched pair or small strata size clustered experiments, you need to cluster the standard errors at the level of the strata. Here this means clustering at the quadruplet, not at the firm level. This helps a bit, but still leads to over-rejection of the null hypothesis at least as much in the first place.
Table 1: Empirical Size of Balance Tests of Firm-level and worker-level outcomes
Note: results are from 3000 simulations, except for randomization inference, which does 1000 simulations of 1000 replications each.
What about just throwing away the largest clusters?
A second suggestion was to just not include the largest clusters in the experiment. For example, 95.5% of firms have 15 or fewer workers. If the experiment was just conducted on this subsample, the last two columns of Table 1 show that this would indeed give the correct size for both firm and worker-level outcomes, and would also do so for worker-level outcomes with matched quadruplets provided standard errors are clustered at the quadruplet level. So, this is one solution. But a downside is that getting rid of 5% of the firms would get rid of 49% of the workers!
Randomization inference to the rescue?
Another solution is to change the hypothesis that is being tested. Rather than testing whether the mean wage is the same in treatment and control groups, one can assume that treatment had no effect on wages for any worker, and then do randomization inference. This is time-consuming to simulate, so we tested this just for the full sample with no stratification. We see that this appears to give the correct size for the worker wage comparisons, rejecting the null 4.5% of the time at the 5% level. But randomization inference makes it harder to use tools like post-double selection lasso to choose control variables to deal with improving power and attrition, and can be conceptually harder to interpret with multiple treatments. So this solution may not always be desirable either.
The issue is not just unequal cluster sizes, but outcomes that are correlated with cluster size.
The individual-level imbalance arises here because outcomes are correlated within clusters. In particular, in the case of wages, larger firms tend to pay higher wages, so getting a bigger cluster size assigned to treatment will tend to push up the treatment group mean. If the intra-cluster correlation is close to zero, then the design effect won’t change much with more variability in cluster size, so power won’t be affected too much, and likewise, cluster size won’t be correlated with outcomes much, and so the size problem should be better too.
With very unequal clusters, you may want to change the estimand
The solution we offered above, of simply dropping some of the largest clusters, effectively changes the estimand to the average effect of treatment on the sample of clusters that aren’t extreme in size. We discussed this with Guido Imbens, who suggested that a more systematic way of implementing this idea rather than dropping some clusters is to weight each observation by the inverse of its cluster size. This changes the estimand from the population average effect to the cluster average effect. This stops a few large clusters dominating the calculation of the average, since you are instead calculating the average for each cluster, and then averaging those. In our context, this means calculating an average wage for each firm, and then averaging this across firms, and comparing the average firm wages across treatment and control groups. Guido discusses this in his lecture notes on clustered experiments, which he has graciously allowed us to share with readers.
We re-ran our simulations for the full sample and no stratification case to compare the size and power for estimating the cluster-average effect compared to the population-average effect. The size is correct, even in the full-sample using regression-based clustered standard errors, for this cluster-average treatment effect (0.05 of our 3000 simulations are rejected at the 5% level, and 0.097 at the 10% level). Power is much higher for estimating the cluster average effect than the population average effect too (Figure 3). The MDE for the cluster-average effect is about 0.08, rather than the 0.26s.d. MDE for the population average effect.
Figure 3: Power is much higher for estimating the cluster average effect than the population average effect
Takeaways
· Unequally-sized clusters reduce power, and can make your regression-based tests have incorrect size.
· Trimming the top tail of clusters helps a lot with both problems – if you have a few clusters that are much larger than the rest, you may want to not include them in the experiment - or at the very least create a strata of the outlier clusters and have the possibility to estimate an effect with and without those outliers. Alternatively, change the estimand to the cluster average effect (weighting each observation by the inverse of cluster size).
· Stratifying by cluster size did not solve the problem of incorrect size/over-rejecting balance
· Power will be higher for estimating the cluster-average effect size (weighting each observation by the inverse of cluster size) than for the population-average effect size (where each observation gets equal weight).
· Randomization inference does give the correct size with unequally sized clusters, but doesn’t solve the problem of reduced power.
· Ryan would never have discovered these problems ex ante if not for doing simulations – reiterating the importance of simulations for power calculations when your data don’t exactly match the assumptions of standard formula – but also if you have unequally sized clusters, the adaption to the standard formula that we note above (including the CV) is useful.
Join the Conversation