I gave a virtual seminar today on practical issues in conducting power calculations for DeNeB – the Development Economics Network Berlin, joint with Göttingen University. I received some specific questions about issues that researchers are facing in doing power calculations, and thought I’d share some of these and my thoughts on them here, in case they are also useful for others.
1) Are power calculations just for experiments, or do you also conduct power calculations in the case of non-experimental impact evaluations such as DiD, matching, RDD, etc? If so, do you need to adjust the calculations in any way?
Yes, power calculations are still useful when planning non-experimental impact evaluations. I think they are most useful when you need to decide whether it is worth the time and expense of collecting data – which could be new survey data, or negotiating access to administrative or commercial databases. This could be both for ex ante impact evaluations you are planning, or for cases where the intervention has occurred and you are deciding whether to obtain outcome data. As a specific example, Miriam Bruhn and I worked in Poland to evaluate a program that used a scoring cutoff to allocate grants to consortia of firms and research entities. We had data on the scores and grants awarded, and needed to decide whether the sample size was large enough to warrant conducting a follow-up survey to measure outcomes of receiving these innovation grants.
I previously blogged about doing power calculations for propensity score matching, and a three-part post on doing power calculations for RDD (part 1, part 2, part 3). The bottom line from these discussions is that you typically need larger samples with these non-experimental methods than you would with a pure RCT. For example, with matching the problem is that you may end up with a lot of units in the comparison group outside of the common support and not being good matches for the treated units, effectively needing to throw away some units. With RDD, there are several issues the posts discuss – e.g. fuzzy RDD’s have an effect a bit like incomplete take-up, which lowers power; doing optimal bandwidth selection also effectively involves not using some units for identification, etc. I haven’t looked closely at doing power calculations for DiD, so would need to think more about the issues here.
2) When would you conduct simulations and deviate from the parametric standard power calculation methods?
Typically when I am doing power calculations I do not have a lot of data on the sample or on a similar sample, and so I find it easier to make assumptions on a few key parameters and look at sensitivity in parametric formula than to attempt to explicitly model and simulate the full data generating process. But the simulation approach can be really helpful when you are moving beyond the simple single treatment and single control with one-level randomization to more complicated designs.
For example, the DeclareDesign packages in R are intended for this purpose, and the authors give an example on a blog they did for us of wanting to decide between employing a two-by-two factorial trial or a three-arm experiment where the interaction group is omitted. They note a lot of other types of cases where they have used these simulations, such as IVs, blocked experiments, etc. Indeed, I think simulations are going to be very useful for the growing use of multi-stage experiments (e.g. randomize villages or markets to treatment or control, and then individual units within these clusters), as well as for making decisions about adaptive experimentation. A final use case I see is when you are going to have small enough samples (or only a few clusters) that you are worried that small sample standard error methods or permutation tests are going to be needed, rather than the typical standard errors.
3) Let’s say you conducted an RCT on an outcome that has not yet been examined. Your power calculations show that given your sample size you should be able to detect an effect with a certain assumed size. After collecting the data you do not find statistically significant effects and the coefficient sizes are much smaller than expected. As the outcome has not been examined before, how could you be sure that you could not detect the effect because there is no effect vs. your assumption of the effect size was wrong and so the assumed power. What tests would you conduct?
This question relates to the issue that we have blogged about before of whether one should do ex-post power calculations, and the issues that arise when doing so. What you do not want to do is take the coefficient size of the treatment effect you estimated and then try to calculate ex post what was the power of detecting that effect (the blog post explains why). Instead, you can use the estimated standard errors and multiply them by 2.8 to get the ex-post MDE, and think about how this compares to costs, to outcomes of other interventions, etc.
4) How should I account for stratification and pre-specified controls in my power calculations?
With simple randomization, we estimate the treatment effect by running a regression like:
Y = a + b*Treat + e
Power then depends on the variability in e – a more variable outcome makes it harder to detect the changes coming from treatment.
Now if we stratify on a set of variables that we think are likely to predict the outcome of interest, then we would typically analyze the data by running the regression:
Y = a + b*Treat + c’X + u
Where X is the vector of strata controls. These soak up some of the variation in Y, and power is then going to depend on the variability of u, which is less than the variability in e because we have explained some of the variation in the outcome.
To account for this in our power calculations, we then need to know how much these control reduce the residual variance. Perhaps you may have baseline data/other data sets that allow for explicitly seeing this. But in most of my applications I need to rely on experience here – I know that outcomes like employment or business profits are quite hard to predict, and so controls might only capture 10-30% of the variation in the outcome. Then if we have Var(u) = 0.7 Var(e), then this means we would use sqrt(0.7) = 0.837 times the standard deviation we would have used in our power calculations.
As a very specific example, if we suppose we have a sample size of 500 treated and 500 control, a control mean of 1000, and a standard deviation of 700, then the power for detecting a 10% increase in the outcome is 0.62. Stratifying and including controls which reduce the variance by 30% would lead us to use a standard deviation of 586 instead of 700 in our calculations, giving power of 0.77.
Two caveats on the above. First, this is how I would think about it when you have a decent sized sample and not too many strata. Otherwise there is a degrees of freedom correction in estimating variances, and if you end up overstratifying on variables not very correlated with the outcome, power using the estimated variance can actually worsen (see my paper with Miriam Bruhn on this). Second, I blogged recently about issues using matched pair randomization or small strata with clustered randomization, and simulations may be useful to deal with those issues.
5) I have a clustered RCT. Around 10 percent more clusters were randomly selected and added after baseline due to administrative reasons. I plan to use ANCOVA for treatment effect estimation. Can I include the new clusters in the model?
This is not strictly a power calculations question, but is related. One of the virtues of the ANCOVA framework is that it controls for the baseline outcomes only to the extent that they are useful in predicting the follow-up outcome, and so you can be quite flexible with regard to changes in measurement (as blogged about here). So in this case, then I would treat this as a case of missing the baseline outcome variable for 10% of the sample, and so dummy it out. That is, create a dummy variable Missingbaseline which takes 1 for these 10% of clusters, and 0 otherwise. Then replace the missing baseline outcome with the sample median (or 0, it doesn’t matter) for this 10% of the sample. You would then run the ANCOVA:
Y = a + b*Treat + c*Ybaseline + d*Missingbaseline + e
6) I designed my RCT with one baseline and two end-line survey rounds to improve power/precision, as suggested by your paper written in 2012. However, the first round had been suspended unexpectedly due to Covid19 crisis, and had been continued again 4 months after. The sample is now divided into two parts, and the number of control and treated units are not balanced in each. Should I collect the second endline for all units at once, or should I follow the current pattern? The time interval between two rounds is set to minimize autocorrelation. If I collect the entire sample all at once, I may increase autocorrelation, but I can reduce bias, is it correct?
My paper shows that you can boost power by collecting multiple follow-up rounds of an outcome, and this boost in power is greater when these multiple follow-ups are less correlated (if they were perfectly autocorrelated, then additional new rounds do not provide new information, if they are uncorrelated it is like getting another independent draw to average out noise). But this only gives you a gain in power if you are estimating a pooled treatment effect across rounds. The more time you leave between follow-up rounds, the lower the autocorrelation may be, but the less interesting a pooled average may be, since we may then start becoming interested about dynamics of treatment effects.
In this case, the issue seems to be that the first follow-up round may have imbalance in when treatment and control units were surveyed. If outcomes vary over time, then this can lead to bias in estimating the round 1 treatment effect, since it will conflate differences from treatment with differences arising from temporal effects on the outcome. This might be able to be corrected by reweighting the sample, and if this is done, one could get an unbiased round 1 estimate, that is essentially the average effect over round 1A and round 1B. But then since we want an unbiased estimate of the round 2 effect, we are better to interview everyone all at once, even if this means some units now have more autocorrelation between round 1 and round 2 than originally anticipated.
However, I would note that the Covid-19 pandemic has been such a shock that for most interventions we will really be interested in the dynamic effects- were treatment effects different during the early part of the pandemic (when the round 1 survey started and was interrupted) than they are now a couple of years in. So we may not want to pool the two rounds of surveys, or at least will also want to examine dynamics.