Syndicate content

Power Calculations for Regression Discontinuity Evaluations: Part 1

David McKenzie's picture

I haven’t done a lot of RD evaluations before, but recently have been involved in two studies which use regression discontinuity designs. One issue which comes up is then how to do power calculations for these studies. I thought I’d share some of what I have learned, and if anyone has more experience or additional helpful content, please let me know in the comments. I thank, without implication, Matias Cattaneo for sharing a lot of helpful advice.

One headline piece of information that I’ve learned is that RD designs have way less power than RCTs for a given sample, and I was surprised by how much larger the sample is that you need for an RD.
How to do power calculations will vary depending on the set-up and data availability. I’ll do three posts on this to cover different scenarios:

Scenario 1 (NO DATA AVAILABLE):  the context here is of a prospective RD study. For example, a project is considering scoring business plans, and those above a cutoff will get a grant; or a project will be targeting for poverty, and those below some poverty index measure will get the program; or a school test is being used, with those who pass the test then being able to proceed to some next stage.
The key features here are that, since it is being planned in advance, you do not have data on either the score (running variable), or the outcome of interest. The objective of the power calculation is then to see what size sample you would need to have in the project and survey, and whether it is worth you going ahead with the study. Typically your goal here is to get some sense of order of magnitude – do I need 500 units or 5000?

Two useful references are these papers by Schochet (2008) and Deke and Dragoset (2012). Schochet shows that when linear RD estimation is used with all the data available, the ratio of the variance of RD estimator to the variance of the random assignment (RCT) estimator is:
RD Design Effect = 1/[1-rho(treatment, score)^2]
Where rho(treatment, score) is the correlation between assignment to treatment and the score (or running variable). Note for an RCT this correlation would be zero, whereas in a RD design treatment is determined by the score exceeding some threshold. This is for a sharp RD, with a fuzzy RD, the sample size required increases further in the same way that incomplete take-up increases the sample needed for a RCT.

This RD design effect depends on i) the distribution of the score variable; ii) the location of the cutoff in this distribution; and iii) the treatment-control split in the sample. Now, when you don’t have any data, what can you do? Helpfully Schochet investigates how big this effect is for different score distributions and cutpoints.

  • If the scores are normally distributed around the cutoff, the design effect is 2.75
  • If the scores are uniformly distributed around the cutoff, the design effect is 4
  • With a bimodal distribution with less mass around the cutoff than two modes a little away, the design effect can be larger, up to 5 or so.
This design effect is effectively how many times the sample you need compared to an RCT to get the same power. So to get the same power as an RCT with 250 treatment and 250 control, you need 1000 treatment and 1000 control for an RD if scores are uniformly distributed.

This makes strong assumptions: it assumes that the data generating process is linear: y = a + bD + c*score + e – that is, that the slope is linear and the same on either side of the cutoff, and that the treatment effect is constant. If these aren’t true, you are likely to be overestimating your power, or otherwise artificially inflating power at the risk of bias which gives your tests incorrect size.

It gets worse: this design effect assumes that you are using the entire sample to estimate the regression discontinuity. However, as Deke and Dragoset note, when you use an optimal bandwidth selection process, this involves discarding scores too far away from the cutoff. They examine a number of education applications and show with education test scores, a typical application would have the optimal bandwidth discard about half the observations, giving design effects that require 9 to 17 times as many schools or students as would be the case in an RCT!

How to do this in Stata?
You can just use the sampsi command, but adjust the standard deviations by the square root of the design effect, and adjust the treatment effect if the RD is fuzzy. For example, if you want to figure out the sample size needed to detect a 20% increase in profits, when the mean profits and 100 and standard deviation is 50, then an RCT would require 132 T and 132 C:
  • Assuming a uniform distribution of scores and a sharp RD, use a sd of 100 instead of 50 to account for the design effectsampsi 100 120, sd1(100)  tells you that you need 526 in treatment and 526 in control with the sharp RD (i.e. 4 times the RCT sizes).
  • If we think it will be a fuzzy RD, where the jump will only be 0.6 instead of 1 in the chance of being treated at the cutoff, we then reduce the treatment effect to 20*0.6 = 12 and calculatesampsi 100 112, sd1(100) which tells you that 1426 T and 1426 C are needed with the fuzzy RD.
  • If we also want to allow for optimal bandwidth selection, so decide to use a design effect of 9 instead of 4, then:
    • sampsi 100 120, sd1(150) gives 1183 in each group for the sharp RD
    • sampsi 100 112, sd1(150) gives 3284 in each group for the fuzzy RD with 0.6 jump.
So in my toy example, an RCT that has enough power with a sample of 264 is equivalent to a fuzzy RD with a sample of 6,568!!!!
You should consider this a real rough approximation approach – but it may be good enough to give you a sense of orders of magnitude and potential feasibility of a RD design for prospective studies.

An alternative approach
An alternative to the plug-and-play approach described above is to simulate some data and do power calculations based on that. You can simulate several different possible distributions of the score variable, and several different functional forms relating the outcome of interest to the score variable, and then use this approach to calculate power as if you had actual data – I will discuss how to do this in the third part of this series.


Submitted by Rafa on

Hi David, very useful blog, thanks. Here is an example in education using test scores as the forcing variable where RD had not enough statistical power.

Add new comment