Power Calculation Software for Randomized Saturation Experiments


This page in:

One of the things I get asked when people are designing experiments – when they are either interested in or worried about spillover effects – is how to divvy up the clusters into treatment and control and what share of individuals within treatment clusters to assign within-cluster controls. The answer seems straightforward – it may look intuitive to assign a third to each group and I have seen a few designs that have done this, but it turns out that it’s a bit more complicated than that. There was no software that I am aware of that helped you with such power calculations, until now...

As a companion to our significantly revamped paper, titled “Optimal Design of Experiments in the Presence of Interference” (Baird, Bohren, McIntosh, and Özler 2016), we have a developed software to help researchers conduct power calculations when their experimental design calls for a two-stage randomization (first randomize clusters into different treatment saturations or intensities, followed by assigning individuals to treatment – based on the realized saturations in the first stage):

  • The dedicated webpage, courtesy of the Policy Design and Evaluation Lab of UC San Diego, comes with a graphical user interface (GUI) for ease of use.
  • A video provides a tutorial on how to use it.
  • We have also supplied Python, R, and MATLAB code, all of which allow the reader to replicate our findings in the paper, as well as to improve upon our code to conduct optimization that would not be easy or possible to do with the GUI. 
Using the GUI is simple: the researcher specifies an objective (i.e. which estimands to identify and the relative importance of each) and the software calculates the optimal design. The researcher can also provide a specific design and the software calculates the minimum detectable effects (MDEs) for estimands this design identifies. Credit goes to Patrick Staples for creating the GUI and coding in Python and R.

Behind the tool is a revamped paper with two new contributions. First, we set up a potential outcomes foundation for our model and map this into the regression models commonly used by applied economists to analyze randomized controlled trials. It is analogous to Athey and Imbens (2016), see Section 2.5 in particular, but for a setting with intra-cluster correlation and partial interference. The potential outcomes framework allows us to anchor our results firmly in the existing statistics and econometrics of experiments literatures, and provides a bridge between these literatures and the linear regression models used to analyze randomized saturation (RS) designs in practice. As Athey and Imbens (2016) state, sometimes “…it is helpful to take an explicitly causal perspective [on linear regression]. This perspective also clarifies how the assumptions underlying identification of causal effects relate to the assumptions often made in least squares approaches to estimation.”

Second, we added an application section that uses numerical simulations to illustrate the theoretical tools we develop using hypothetical and published study designs. First, we explicitly define and estimate optimal designs for objective functions that include different individual saturation, slope and pooled estimands. We demonstrate the power trade-offs that arise – based on which estimands the researcher would like to identify and estimate, as well as the relative weights that ze puts on each estimand. We calculate MDEs for randomized saturation designs in published papers and show how these designs affect the power trade-off between different estimands. For example, we are able to show that if we knew what we know now back in 2007, we could have designed our own Malawi cash transfer experiment (comparing CCTs to UCTs) differently, which would have produced lower MDEs for all estimands of interest.

What are some practical takeaways from the paper? These aren’t easy to summarize without getting technical, so I’ll try my best, but I suggest reading the application section for more clarity:
  • If you’re equally interested in identifying the treatment and spillover effect at each saturation (treatment intensity), then you need to allocate more clusters to the extreme saturations. For example, if the treatment saturations in your study are 0, 0.2, 0.4, 0.6, and 0.8, then you need to allocate more clusters to 0.2 and 0.8 than 0.4 and 0.6. This disparity declines with the intra-cluster correlation (ICC). Given ICC and cluster size, numerical simulations can provide the optimal allocation.
  • If you’re only interested in detecting a slope effect, you don’t need a pure control group. In this case, you should have saturations that are pretty extreme and symmetrical about 0.5 – somewhere around 0.1 and 0.9 depending on the ICC. You can add more saturations to test linearity, curvature, etc.
  • One of the main messages of our paper remains the same as before, but with a new insight: sometimes a researcher will ambitiously design a RS experiment, only to find that treatment or spillover effects do not vary by treatment saturation. This means that presenting the treatment effects by saturation is not that interesting – they’re all more or less the same. The instinct then is to pool all treatment (spillover) observations together to estimate an average (or pooled) treatment (spillover) effect. As the inherent heteroskedasticity of this regression model is no longer an issue when there is no heterogeneity in treatment effects (see Corollary 3 in the paper), it looks like the researcher can recover an average treatment effect at no cost – with the increased power that comes from pooling observations. Unfortunately, this is not the case:
    • We show that for the pooled ITT, a partial population experiment, in which there is a pure control group and a single treatment saturation, is optimal. Any deviation from the constant treatment probability reduces power or increases the MDE. This is true even when the errors are homoscedastic…
    • If the researcher cares equally about treatment and spillover effects, the treatment probability for that single interior saturation is 0.5.
    • As for the pure control group, it is never optimal to assign only a third of the clusters to pure control. The optimal range is between 0.41 and 0.5 – again depending on the ICC.
    • Bottom line: if the researcher a priori believes that slope effects are small and ICC is high, she is best off selecting a partial population design. More ambitious designs allow you to identify more estimands, but come with a risk of reduced power for pooled effects – should you wish/need to estimate them ex post…
We hope that you find the software useful in designing your experiments and that the revised paper helps with the underlying ideas. If you do use the GUI, please let us know about your experience so that we can improve it in subsequent versions. Also, if you improve our code using Python or R, please share it with us so that we can include it at our website.


Berk Özler

Lead Economist, Development Research Group, World Bank

July 27, 2017

Hi Ben, the program has been very useful, thank you. I've been working on a partial population design since we are interested in spillover effects. We are giving equal weight to both the MDE_T and MDE_S and the program gives us that the proportions should be around 44% of the clusters for pure control and the rest for saturation at a 50% rate. I wanted to see the possible effects on these two MDEs if we had a lower take-up rate than expected. As expected, the MDE_T goes up as the effective saturation rate lowers.To my surprise, no matter how much I lowered the saturation for the treated clusters, the MDE_S will always decrease. This is of course leaving the proportions fixed to the initial minimizing value.
My intuition would tell me that the MDE_S in a partial population design would have an inverted U relation with the saturation rate: both extremes would imply a high variance. I went through the equations in the paper and they match what I see in the program.
Am I missing something? Before I go into calculating the MDEs myself I wanted to know if my intuition is wrong. Any insights would be very appreciated, thanks!

Berk Ozler
July 27, 2017

Hi Seb,
No worries about my name - thanks for correcting. And, your 44% for pure control also makes sense, making your ICC to be somewhere between 0 and 0.1. So, all good there...
On the issue of take-up, if I am not misunderstanding, you're lowering the PPE saturation from the optimal 0.5 to numbers that are lower. What happens then, I think by definition, is that the untreated are 1 minus that saturation. So, that will lower the SE (MDE) for the SNT.
The hidden issue here, that is not addressed in the software is that you're trying to deal with non-compliance by changing the share of treated. They could actually be different things, in the sense that the spillovers on the randomized out could be different from spillovers on the non-compliers. This is an issue that gets complicated fast - and, if you look at earlier versions of our paper on SSRN, you'll see text that touches on the issue of non-compliance.
I hope this helps. Cheers,

July 27, 2017

Berk, not Ben, sorry, iphone autocorrection!

December 26, 2017

Hi Berk,
very interesting post! I have used the program and found it very useful, thanks! I have the following question: what happens with unequal sample sizes per cluster? Say we have 10 clusters and their size vary from 50 to 100. Should we use an average cluster size in this case?

Berk Ozler
December 26, 2017

Tackling this is on our agenda, but we have not dealt with it yet. Variation in cluster size will reduce power, so you may want to be on the conservative side with your sample sizes when you enter the mean or median which will underestimate what is needed. Stata, for example, has a parameter for this in "clustersampsi".