Power Calculation Software for Randomized Saturation Experiments
This page in:
One of the things I get asked when people are designing experiments – when they are either interested in or worried about spillover effects – is how to divvy up the clusters into treatment and control and what share of individuals within treatment clusters to assign withincluster controls. The answer seems straightforward – it may look intuitive to assign a third to each group and I have seen a few designs that have done this, but it turns out that it’s a bit more complicated than that. There was no software that I am aware of that helped you with such power calculations, until now...
As a companion to our significantly revamped paper, titled “Optimal Design of Experiments in the Presence of Interference” (Baird, Bohren, McIntosh, and Özler 2016), we have a developed software to help researchers conduct power calculations when their experimental design calls for a twostage randomization (first randomize clusters into different treatment saturations or intensities, followed by assigning individuals to treatment – based on the realized saturations in the first stage):
 The dedicated webpage, courtesy of the Policy Design and Evaluation Lab of UC San Diego, comes with a graphical user interface (GUI) for ease of use.
 A video provides a tutorial on how to use it.
 We have also supplied Python, R, and MATLAB code, all of which allow the reader to replicate our findings in the paper, as well as to improve upon our code to conduct optimization that would not be easy or possible to do with the GUI.
Behind the tool is a revamped paper with two new contributions. First, we set up a potential outcomes foundation for our model and map this into the regression models commonly used by applied economists to analyze randomized controlled trials. It is analogous to Athey and Imbens (2016), see Section 2.5 in particular, but for a setting with intracluster correlation and partial interference. The potential outcomes framework allows us to anchor our results firmly in the existing statistics and econometrics of experiments literatures, and provides a bridge between these literatures and the linear regression models used to analyze randomized saturation (RS) designs in practice. As Athey and Imbens (2016) state, sometimes “…it is helpful to take an explicitly causal perspective [on linear regression]. This perspective also clarifies how the assumptions underlying identification of causal effects relate to the assumptions often made in least squares approaches to estimation.”
Second, we added an application section that uses numerical simulations to illustrate the theoretical tools we develop using hypothetical and published study designs. First, we explicitly define and estimate optimal designs for objective functions that include different individual saturation, slope and pooled estimands. We demonstrate the power tradeoffs that arise – based on which estimands the researcher would like to identify and estimate, as well as the relative weights that ze puts on each estimand. We calculate MDEs for randomized saturation designs in published papers and show how these designs affect the power tradeoff between different estimands. For example, we are able to show that if we knew what we know now back in 2007, we could have designed our own Malawi cash transfer experiment (comparing CCTs to UCTs) differently, which would have produced lower MDEs for all estimands of interest.
What are some practical takeaways from the paper? These aren’t easy to summarize without getting technical, so I’ll try my best, but I suggest reading the application section for more clarity:
 If you’re equally interested in identifying the treatment and spillover effect at each saturation (treatment intensity), then you need to allocate more clusters to the extreme saturations. For example, if the treatment saturations in your study are 0, 0.2, 0.4, 0.6, and 0.8, then you need to allocate more clusters to 0.2 and 0.8 than 0.4 and 0.6. This disparity declines with the intracluster correlation (ICC). Given ICC and cluster size, numerical simulations can provide the optimal allocation.
 If you’re only interested in detecting a slope effect, you don’t need a pure control group. In this case, you should have saturations that are pretty extreme and symmetrical about 0.5 – somewhere around 0.1 and 0.9 depending on the ICC. You can add more saturations to test linearity, curvature, etc.

One of the main messages of our paper remains the same as before, but with a new insight: sometimes a researcher will ambitiously design a RS experiment, only to find that treatment or spillover effects do not vary by treatment saturation. This means that presenting the treatment effects by saturation is not that interesting – they’re all more or less the same. The instinct then is to pool all treatment (spillover) observations together to estimate an average (or pooled) treatment (spillover) effect. As the inherent heteroskedasticity of this regression model is no longer an issue when there is no heterogeneity in treatment effects (see Corollary 3 in the paper), it looks like the researcher can recover an average treatment effect at no cost – with the increased power that comes from pooling observations. Unfortunately, this is not the case:
 We show that for the pooled ITT, a partial population experiment, in which there is a pure control group and a single treatment saturation, is optimal. Any deviation from the constant treatment probability reduces power or increases the MDE. This is true even when the errors are homoscedastic…
 If the researcher cares equally about treatment and spillover effects, the treatment probability for that single interior saturation is 0.5.
 As for the pure control group, it is never optimal to assign only a third of the clusters to pure control. The optimal range is between 0.41 and 0.5 – again depending on the ICC.
 Bottom line: if the researcher a priori believes that slope effects are small and ICC is high, she is best off selecting a partial population design. More ambitious designs allow you to identify more estimands, but come with a risk of reduced power for pooled effects – should you wish/need to estimate them ex post…
Hi Ben, the program has been very useful, thank you. I've been working on a partial population design since we are interested in spillover effects. We are giving equal weight to both the MDE_T and MDE_S and the program gives us that the proportions should be around 44% of the clusters for pure control and the rest for saturation at a 50% rate. I wanted to see the possible effects on these two MDEs if we had a lower takeup rate than expected. As expected, the MDE_T goes up as the effective saturation rate lowers.To my surprise, no matter how much I lowered the saturation for the treated clusters, the MDE_S will always decrease. This is of course leaving the proportions fixed to the initial minimizing value.
My intuition would tell me that the MDE_S in a partial population design would have an inverted U relation with the saturation rate: both extremes would imply a high variance. I went through the equations in the paper and they match what I see in the program.
Am I missing something? Before I go into calculating the MDEs myself I wanted to know if my intuition is wrong. Any insights would be very appreciated, thanks!
Hi Seb,
No worries about my name  thanks for correcting. And, your 44% for pure control also makes sense, making your ICC to be somewhere between 0 and 0.1. So, all good there...
On the issue of takeup, if I am not misunderstanding, you're lowering the PPE saturation from the optimal 0.5 to numbers that are lower. What happens then, I think by definition, is that the untreated are 1 minus that saturation. So, that will lower the SE (MDE) for the SNT.
The hidden issue here, that is not addressed in the software is that you're trying to deal with noncompliance by changing the share of treated. They could actually be different things, in the sense that the spillovers on the randomized out could be different from spillovers on the noncompliers. This is an issue that gets complicated fast  and, if you look at earlier versions of our paper on SSRN, you'll see text that touches on the issue of noncompliance.
I hope this helps. Cheers,
Berk.
Berk, not Ben, sorry, iphone autocorrection!
Hi Berk,
very interesting post! I have used the program and found it very useful, thanks! I have the following question: what happens with unequal sample sizes per cluster? Say we have 10 clusters and their size vary from 50 to 100. Should we use an average cluster size in this case?
Hi,
Tackling this is on our agenda, but we have not dealt with it yet. Variation in cluster size will reduce power, so you may want to be on the conservative side with your sample sizes when you enter the mean or median which will underestimate what is needed. Stata, for example, has a parameter for this in "clustersampsi".