Syndicate content

Tools of the Trade

Power Calculations for Regression Discontinuity Evaluations: Part 1

David McKenzie's picture

I haven’t done a lot of RD evaluations before, but recently have been involved in two studies which use regression discontinuity designs. One issue which comes up is then how to do power calculations for these studies. I thought I’d share some of what I have learned, and if anyone has more experience or additional helpful content, please let me know in the comments. I thank, without implication, Matias Cattaneo for sharing a lot of helpful advice.

One headline piece of information that I’ve learned is that RD designs have way less power than RCTs for a given sample, and I was surprised by how much larger the sample is that you need for an RD.
How to do power calculations will vary depending on the set-up and data availability. I’ll do three posts on this to cover different scenarios:

Scenario 1 (NO DATA AVAILABLE):  the context here is of a prospective RD study. For example, a project is considering scoring business plans, and those above a cutoff will get a grant; or a project will be targeting for poverty, and those below some poverty index measure will get the program; or a school test is being used, with those who pass the test then being able to proceed to some next stage.
The key features here are that, since it is being planned in advance, you do not have data on either the score (running variable), or the outcome of interest. The objective of the power calculation is then to see what size sample you would need to have in the project and survey, and whether it is worth you going ahead with the study. Typically your goal here is to get some sense of order of magnitude – do I need 500 units or 5000?

Tools of the Trade: The Regression Kink Design

David McKenzie's picture

Regression Discontinuity designs have become a popular addition to the impact evaluation toolkit, and offer a visually appealing way of demonstrating the impact of a program around a cutoff. An extension of this approach which is growing in usage is the regression kink design(RKD). I’ve never estimated one of these, and am not an expert, but thought it might be useful to try to provide an introduction to this approach along with some links that people can then follow-up on if they want to implement it.

From my mailbox: should I work with only a subsample of my control group if I have big take-up problems?

David McKenzie's picture
Over the past month I’ve received several versions of the same question, so thought it might be useful to post about it.
Here’s one version:
I have a question about an experiment in which we had a very big problem getting the individuals in the treatment group to take-up the treatment. Therefore we now have a treatment much smaller than the control. For efficiency reasons does it still make sense to survey all the control group, or should we take a random draw in order to have an equal number of treated and control?
And another version

Allocating Treatment and Control with Multiple Applications per Applicant and Ranked Choices

David McKenzie's picture
This came up in the context of work with Ganesh Seshan designing an evaluation for a computer training program for migrants. The training program was to be taught in one 3 hour class per week for several months. Classes were taught Sunday, Tuesday and Thursday evenings from 5-8 pm, and then there were four separate slots on Friday, the first day of the weekend. So in total there were 7 possible sessions people could potentially attend. However, most migrants would prefer to go on the weekend, and many would not be able to attend on particular days of the week.

Endogenous stratification: the surprisingly easy way to bias your heterogeneous treatment effect results and what you should do instead

David McKenzie's picture

A common question of interest in evaluations is “which groups does the treatment work for best?” A standard way to address this is to look at heterogeneity in treatment effects with respect to baseline characteristics. However, there are often many such possible baseline characteristics to look at, and really the heterogeneity of interest may be with respect to outcomes in the absence of treatment. Consider two examples:
A: A vocational training program for the unemployed: we might want to know if the treatment helps more those who were likely to stay unemployed in the absence of an intervention compared to those who would have been likely to find a job anyway.
B: Smaller class sizes: we might want to know if the treatment helps more those students whose test scores would have been low in the absence of smaller classes, compared to those students who were likely to get high test scores anyway.

Why is Difference-in-Difference Estimation Still so Popular in Experimental Analysis?

Berk Ozler's picture
David McKenzie pops out from under many empirical questions that come up in my research projects, which has not yet ceased to be surprising every time it happens, despite his prolific production. The last time it happened was a teachable moment for me, so I thought I’d share it in a short post that fits nicely under our “Tools of the Trade” tag.

Curves in all the wrong places: Gelman and Imbens on why not to use higher-order polynomials in RD

David McKenzie's picture
A good regression-discontinuity can be a beautiful thing, as Dave Evans illustrates in a previous post. The typical RD consists of controlling for a smooth function of the forcing variable (i.e. the score that has a cut-off where people on one side of the cut-off get the treatment, and those on the other side do not), and then looking for a discontinuity in the outcome of interest at this cut-off. A key practical problem is then how exactly to control for the forcing variable.

Tools of the trade: recent tests of matching estimators through the evaluation of job-training programs

Jed Friedman's picture
Of all the impact evaluation methods, the one that consistently (and justifiably) comes last in the methods courses we teach is matching. We de-emphasize this method because it requires the strongest assumptions to yield a valid estimate of causal impact. Most importantly this concerns the assumption of unconfoundedness, namely that selection into treatment can be accurately captured solely as a function of observable covariates in the data.