Published on Development Impact

My practical tips for designing and analyzing powerful experiments

David McKenzie

July 28, 2025

This page in:

My practical tips for designing and analyzing powerful experiments

I have a new paper coming out in a symposium on power calculations in the journal Fiscal Studies, which puts together my tips for applied researchers on designing and analyzing powerful experiments. It pulls together some of the advice I’ve offered in this blog over the years, as well as from classes I teach on experimental design and from my own experiences. The goal is to offer practical advice on how to improve statistical power in randomized experiments through choices and actions researchers can take at the design, implementation, and analysis stages. A key message is that it does not make sense to talk of “the” power of an experiment. A study can be well-powered for one outcome or estimand, but not others, and a fixed sample size can yield very different levels of power depending on researcher decisions.

Most discussions of power calculations take the estimand, treatment, outcome of interest, variance, and intra-cluster correlation as given, with the focus then on either calculating the minimum detectable effect (MDE) in a given sample N, or calculating the sample size N needed to achieve a desired MDE. Instead, I argue that choices made in the design, implementation, and analysis stages of an experiment can change all of these inputs, and thereby improve statistical power. I’ll summarize some of the advice here, with the paper providing more details for researchers aiming to get more out of their sample sizes or budgets.

1. At the Design Stage

· Change the estimand: The estimand of interest is usually the intent-to-treat effect for individuals in some specified population. But you often will have more power to estimate impacts for individuals in a subset of this population. In particular, I recommend eliminating in advance from the experiment units that are likely to attrit, to not comply, or that are outliers. I give examples in the paper of how decreasing the sample size can actually increase power by doing this.

· Choose outcomes that you have more power to detect: theoretical discussions of experimental methodology typically write down a single outcome Y of interest. But in practice there are often many possible outcomes of interest, and the same study can be well-powered to detect impacts on some outcomes, while having very little power to detect impacts on other outcomes. I suggest measuring outcomes closer to the intervention in the causal chain; considering winsorized or binary versions of continuous outcomes; and using multiple rounds of post-treatment measurement. In clustered studies, consider outcomes with low ICCs, and ensure cluster sizes are not too unequal.

· Be judicious in how many and which treatments are used: do not add too many arms, and make treatments more intensive, or in a way that can be pooled.

· Randomize in a way to reduce the variance: use stratified or matched quadruplet random assignments. I also discuss how researcher choices can affect the intra-cluster correlation and avenues to pursue here.

2. At the Implementation Stage

· Boost take-up and reduce non-compliance: screen on interest, use reminders and other approaches to help make treatment easy to receive, use a waitlist to reduce the chance of the control group getting treated.

· Reduce attrition: collect multiple ways of contacting participants, make multiple survey attempts, and look for other data sources like administrative or web data.

· Reduce the variance through better measurement: incorporate data checks and triangulation approaches, combine multiple measures, be careful with issues like numbers containing many zeros.

3. At the Analysis Stage

Having designed and implemented an experiment, there are still further choices researchers can make at the analysis stage to improve statistical power and learn as much as possible from the experiment. While these are implemented at the analysis stage, they should be pre-specified in advance when possible, but otherwise could be used as additional exploratory analysis.

· Consider alternate test statistics: There may be more power to detect impacts on a subgroup of units more likely to take-up treatment. For example, Coussens and Spiess provide a method that estimates compliance with treatment as a function of baseline covariates, and then constructs a weighted IV estimator that puts more weight on the units that are more likely to comply. It is common to use indices of outcomes to deal with multiple testing and to aggregate different proxies for an overall concept. Standard measures typically put equal weight on each component or weight by the inverse variance. Anderson and Magruder (2023) provide an alternative index, which is constructed via a data-driven sample-splitting approach to maximize power for detecting mean treatment effects on a subset of indicators in the data by putting more weight on indicators for which there is a larger mean difference. This can be more powerful for detecting whether the treatment had some effect in a particular domain, but interpretation of the index and magnitude is more difficult. Another test with multiple variables is the omnibus test of Young (2019), which is a test of overall experimental significance. Here the null hypothesis is that no treatment had any effect on any outcome for any unit. There may then be occasions where we can learn that the treatment had some effect, even if the experiment is underpowered for detecting impacts on any specific outcome.

· Choose control variables to further reduce variation: the lagged dependent variable will typically be the most important, and trusty Ancova with strata controls is my default. PDS Lasso is unlikely to improve power much over Ancova in most settings, but there may sometimes be other controls that strongly predict the outcome of interest, in which case adding them can also improve power.

· Incorporate outside information through Bayesian analysis: use informative priors from the literature or elicited from experts or participants. I blogged about my work with Leo Iacovone and Rachael Meager which illustrates how to do this. The paper is now forthcoming at Econometrica, and our replication materials have code for doing this in R/Stan, as well as a basic version in Stata (I’ll try to blog about the Stata code after the summer).

Hopefully the paper is useful as you design and analyze your next experiment. There is also a subsection of the paper which offers tips for doing power calculations in grant proposals, pre-analysis plans and registered reports – pointing out several common issues/areas where I find many proposals fall short.

Get updates from Development Impact

David McKenzie

Lead Economist, Development Research Group, World Bank

More Blogs By David

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.