One of the more frequent sets of questions I receive about designing impact evaluations concerns what can be done to improve power with a relatively small n experiment (with similar questions also arising for prospective RDD, DiD etc studies). The number of experimental units n is often limited due to budget constraints, to capacity constraints of the implementing organization, or naturally limited by the number of villages eligible for a program or number of people or firms that apply. Here are some of my main thoughts/tips on approaches to try in order to improve statistical power.
Recall that the statistical power of a test is the probability that the test will reject the null when the null is false – or essentially what is the probability you can correctly detect an effect in your experiment when one is there. This is basically a question of whether you can separate the signal of a treatment from the noise in the data. Improving power then comes from strengthening the signal and/or reducing the noise. We can do this through the following approaches:
1. Increase the signal by increasing the intensity of the treatment, and maximizing(*) take-up of the treatment.
With a small sample, it can be incredibly difficult to detect small subtle effects of treatments. This is one of the problems with replicability of many underpowered psychology and behavioral experiments, where a very small change in wording or color or some other barely perceptible change is used to change behavior. In economics I often see people struggling to decide between a bundled program that includes multiple components (giving an intensive intervention, but perhaps a costly one) and a single item intervention that is easier to interpret and cheaper, but may not work as well on its own. With a small sample, I would err on the side of the bundle and a more intensive intervention that provides a stronger treatment to detect the impacts of.
Secondly, the treatment is unlikely to have much impact if not many people take it up. With a homogenous treatment effect, remember the inverse square rule, where the sample size you need to detect a given effect size is inversely proportional to the square of the take-up rate – so if only half the sample take up your treatment, you need 4 times as many units as with 100% take-up. If there are heterogeneous treatment effects and people sort into take-up based on their treatment effect, then power does not fall as drastically with the take-up rate and (*) if some units have negative treatment effects it is possible for power to be greater with less than 100% take-up than with 100%, but this seems rare in practice.
However, you may have little ability to change what the treatment is or how it is implemented, and be interested in measuring impacts of a policy under real world conditions. Most of my advice is then about different approaches to reducing noise. Think about running a treatment regression like:
Y = a + b*Treat + e
Then the precision we have in estimating b depends on the variance of e. The methods below all try to reduce this variance.
2. Reduce noise by paying careful attention to measurement
One reason the variance of e can be large is measurement error in Y. A whole set of survey design approaches apply here for trying to make sure you measure the concept you wish to measure with the smallest error possible. See our curated survey links for posts on how to improve measurement for many standard outcomes like consumption, income, profits, empowerment, labor supply, etc. Some general things to think about here are (i) embedding consistency checks in the survey so that unusual values get checked; (ii) using techniques such as anchoring and triangulation; (iii) using administrative data where available (although noting this can also have its own measurement error and may not always be an improvement); (iv) being extra careful when dealing with currencies with lots of zeroes; and (v) considering asking multiple questions about the same concept and averaging them together to average out noise (this can make more sense for questions intended to measure knowledge, behaviors, attitudes, practices etc than financial outcomes).
3. Reduce noise by averaging out observations over time (collecting more T)
I have a paper called “beyond baseline and follow-up: the case for more T in experiments” which discusses this idea in depth. The idea is that by measuring the outcome of interest at multiple points in time, and then estimating the average treatment effect on average over the time period, you can average out seasonality, idiosyncratic shocks, and some measurement error, to make it easier to detect the signal. This approach works best for outcomes that are not strongly autocorrelated (since the more strongly autocorrelated an outcome is, the less new information comes from collecting another time period of data). So if you are relying on surveys and can only ask about the last week or month’s income, consumption, profit, or mental health status, going back a month or 3 months later and collecting another round of data can boost power.
4. Reduce noise by making the sample as homogenous as possible to start with
The more similar all the units in your study are to start with, the easier it will be to detect a treatment that comes along and makes some of them different. Conversely, if they vary a lot from one another and then over time, it can be very hard to separate a signal from all the natural variation among them. So where you have the choice, making the sample as homogenous as possible will reduce the variance of e. A few strategies in particular can help here:
(a) Reduce the influence of outliers by screening out very different units and not having them be part of the experiment. E.g. if you are doing a study on firms, and most have 1 or 2 workers, and then you have a 100 person firm also included, whichever group (T or C) that gets assigned this large firm will have a much bigger mean employment, and it can be then difficult to detect a change of say 0.1 workers. So even with a small sample, you might improve power by reducing the sample further.
As a concrete example, suppose 300 youth apply to a vocational training program, and we are interested in measuring the impact on their income 2 years later. Suppose for the full sample, without an intervention, we have a mean income of $1200 per month, with a S.D. of $800. Then the power to detect a 20% increase in income is 0.738. [In Stata: sampsi 1200 1440, n1(150) n2(150) sd(800) ]. Imagine we instead screened out the 20 highest earning and 10 lowest earning individuals from our program before randomizing, based on baseline and predicted future income. This might reduce the mean to $1100 per week, but the S.D. to $500. Then even though the sample size is now 270 rather than 300, power has increased from 0.738 to 0.951 [sampsi 1100 1320, n1(135) n2(135) sd(500) ]
(b) Likewise, in a clustered experiment, you might want to screen out a cluster that is really different in size from the other clusters and have more similarly sized clusters.
(c) If budget constraints are the main issue for your choice of sample size, you may also want to improve power by screening out the observations most likely to attrit, so that your follow-up sample size is as large as possible.
(d) You may then want to focus on just a single industry and size of firm for a firm study, or workers trained in a single vocation for a jobs study, or farmers farming a single type of crop for an agricultural study etc- with the idea being that more of the shocks they experience will be similar to one another and get captured by the control variable and time effects, making it easier to detect changes in the outcome coming from treatment.
Note that this screening changes the estimand. You are no longer estimating the average treatment effect for the full sample of applicants, but for the applicants whose baseline characteristics are in a certain range. While this reduces external validity, you may not be able to accurately estimate the first (e.g. if you only have one 100-worker firm in your study, you are never going to be able to get a good counterfactual for that one firm, and better to say something accurately about the majority of firms that have 1-2 workers).
Note also that this need not mean that these screened out firms can’t take part in a program, just that you don’t consider them part of the experimental sample. So that single 100 person firm could still be included in the government program, just not in the impact evaluation of that program.
5. Think carefully about the outcomes of interest: power is likely to be larger for outcomes closer in the causal chain, and for less skewed outcomes less driven by outliers
I blogged previously about the “funnel of attribution” and the idea that it is easier to detect impacts the closer they are in the causal chain to the intervention. For example, suppose there is a government program designed to increase exports of innovative products by matching researchers with innovative ideas to firms that can help them commercialize these ideas. Then I have in mind a process where the intervention first helps firms and researchers match and start collaborating, then they develop new products, then they start selling these new products and trying to export them abroad. Many things need to go right and much time needs to pass for exports to occur- and I am likely to have a lot more power for an outcome like “collaboration formed in the past year” than “total exports”.
Secondly, if my outcome of interest is a very skewed outcome like total exports, then the variance of this outcome can be large, making power low. I may have more power for a binary outcome (e.g. for the extensive margin of exports at all, or defining a binary outcome for reaching a certain threshold of export sales such as exporting more than US$1,000).
The basic point here is that for a given n, a study may be very well-powered for detecting impacts on one outcome, and have very low power for detecting impacts on another. So thinking carefully about what you can realistically measure impacts on given the sample size is important.
6. Make the treatment and control groups more similar by ex-ante improving balance
I think this is often where many of the people I talk to first turn when thinking about improving power, but I think the steps above can be as, or more, important, depending on the outcomes you are interested in. The idea here is to improve on simple random assignment and make the treatment and control groups more comparable to one another by stratification, pairwise or quadruplet matching, or some other approach. Miriam Bruhn and I have a paper that discuss many of these methods, and show the increases in power that can be obtained. These methods offer more improvement in power the more persistent outcomes are. They tend to work well for things like test scores, where a kid who scores a high score on a test this year is also likely to perform well in six months, and offer less improvement in power for more volatile outcomes like sales, profits, or incomes in developing countries.
I get asked whether there are newer methods that use machine learning or establish better optimality results. A while back Thomas Barrios blogged about his job market paper on an optimal way to form matched pairs which involves sorting observations by the predicted outcome of interest and then forming pairs on this basis. Yuehao Bai has multiple papers, including a 2022 AER paper, on matched pair designs and their optimality, while there has been even work on whether to avoid randomizing at all and just choose the single best optimized allocation (I discussed my thoughts here). These methods work well if you have a single optimal outcome of interest and you can predict it well/it is reasonably persistent. However, in practice we are often interested in multiple outcomes, and different users of the research may put more or less weight on each outcome, so there is not a single set of weights you can use to form an index variable to act as a single outcome. Then there can sometimes be a trade-off where pushing hard to get the best possible match for detecting impacts on one outcome can worsen balance on some other outcome of interest. I discuss also some other potential issues with matched pairs and an approach to using matched quadruplets here.
My bottom line on this is that it is usually worth trying to stratify or match to improve balance on some key variables, but that in most applications I do it can be hard to predict the end outcomes, there are several of them, and that therefore there is not so much to be gained from going from stratification/matched quadruplets to the optimal matched pair.
7. Incorporate prior information to help increase learning from small samples
I blogged about recent work Rachael Meager, Leonardo Iacovone, and myself have about incorporating informative priors into your impact evaluation. With small samples, this offers the potential to learn more from a given experiment by bringing more information to the analysis. For example, I show how in cases where the priors coincided with the data, the posterior credible intervals can be narrower than traditional frequentist confidence intervals. We are currently developing Stata tools to make these easier for others to employ.
None of these are panaceas, and ideally you want to also increase n, but they can help. As an extreme example, in our experiment on improving management in Indian firms, my co-authors and I work with only 28 plants across 17 firms. We employ many of the ideas here: (1) we have an incredible intensive treatment (each treated firm gets 781 hours of top-quality consultant time) and 100% take-up; (2) we have administrative data from the plants on many of our key outcomes, collected directly from machine logs, and spent a lot of time with the implementors fine-tuning surveys and talking through the story of every data point weekly; (3) we had high frequency weekly data, with around 100 weeks of data, allowing us to also use large-T asymptotics; (4) the sample was screened to be all plants in one industry (woven cotton fabric) in one city, using the same production process as one another, and with workers in a specified size range to make the sample as homogenous as possible.
What other tips for improving power with small-n am I forgetting from this advice?
Additional points noted in online discussion:
As noted explicitly under point 4, some of the above tips involve changing the estimand: both in terms of what the outcome is, and who it applies to. This takes the pragmatic approach of thinking about what questions your study is able to answer, with the same experiment potentially able to have enough power to measure impacts well for one outcome on one type of population, but not for another outcome or for a broader population. You can then decide as a user whether the questions the study can answer are useful or not.
8. With clustered trials, you want to sample units within clusters to be dissimilar to one another to minimize the intra-cluster correlation - and if you have choice over number of clusters vs number of units per cluster, see this paper. Of course for maximizing power you want to avoid clustering the intervention as much as possible.
9. As well as ex-ante improvements in power through balancing, think about collecting good data on covariates that can help predict the outcome of interest and can be used as controls in the regression, perhaps via a method such as pdslasso (although a paper I am working on will show gains are typically pretty modest from this) - the main thing is getting good baseline data on the outcome of interest (which relates to the Ancova method discussed in the paper in point 3).
10. If treatment is going to change the variance of the outcome, then you can improve power by allocating relatively more units towards treatment. Of course you often don't know this in advance, making this risky. If outcomes can be obtained quickly, then a pilot or batch-adaptive approach can be used, as discussed in this paper. My experience is that in most small samples it has been hard enough to detect an effect on levels, and I would be skeptical of any effect on variances changing in small pilot - so I would only use this approach in settings where there is a strong theoretical reason to expect variances to change dramatically with treatment.