Syndicate content

Power Calculations for Regression Discontinuity Evaluations: Part 2

David McKenzie's picture

Part 1 covered the case where you have no data. Today’s post considers another common setting where you might need to do RD power calculations.
Scenario 2 (SCORE DATA AVAILABLE, NO OUTCOME DATA AVAILABLE): the context here is that assignment to treatment has already occurred via a scoring threshold rule, and you are deciding whether to try and collect follow-up data. For example, referees may have given scores for grant applications, and proposals with scores above a certain level got funded, and now you are deciding whether to collect outcomes several years later to see whether the grants had impacts; or kids may have sat a test to get into a gifted and talented program, and now you want to see whether to collect to data on how these kids have done in the labor market.

Here you have the score data, so don’t need to make assumptions about the correlation between treatment assignment and the score, but can use the actual correlation in your data. However, since the optimal bandwidth will differ for each outcome examined, and you don’t have the outcome data, you don’t know what the optimal bandwidth will be.
In this context you can use the design effect discussed in my first blog post with the actual correlation. You can then check with the full sample to see if you would have sufficient power if you surveyed everyone, and make an adjustment for choosing an optimal bandwidth within this sample using an additional multiple of the design effect as discussed previously. Or you can simulate outcomes and use the simulated outcomes along with the actual score data (see next post).

But the question that often arises is a variant of the following: “Suppose 10,000 children sat a test. I only have budget to survey 2,000 of them. How should I select these? Should I do a random sample? Or chose only the 2,000 with scores closest to the RD-cutoff? Or something else?”
Cattaneo et al. (2016) provide an unequivocal answer to this question: “From the point of view of internal validity in RD designs, it is always best to sample first observations that are the closest to the cutoff c in terms of their running variable. This recommendation is justified by the same assumptions underlying identification, estimation and inference methods based on continuity/smoothness of the unknown conditional expectations. In this framework, the parameter of interest is defined at the cutoff c, and hence having observations as close as possible to Xi = c is the most useful.”
In contrast, Schochet (2008) agrees with this recommendation in terms of reducing bias, but notes several disadvantages of using a narrower versus wider bandwidth:

  1. For a given sample-size, a narrower bandwidth could yield less precise estimates if the outcome-score relationship can be correctly modelled using a wider range of scores. For example, if you knew the correct underlying relationship was a second-degree polynomial, you could fit this better by having data further away. However, since part of the appeal of RD is not relying on functional form assumptions, in most cases you will not want to choose a larger bandwidth for this reason.
  2. External validity: extrapolating results to units further away from the threshold using the estimated parametric regression lines may be more defensible if you have a wider range of scores to fit these lines over.
Here are my thoughts on this:
  • In general, it seems preferable to choose to sample within a relatively narrow range. For example, in one setting where I have been working, I choose to sample only units within 5 points (out of 100) on either side of the threshold; in another we took all units within 8 points on either side of the threshold.
  • For falsification testing it is good to have a little bit of a range around the cutoff. I wouldn’t want to take this to the extreme of having all the data at two masspoints, one on either side of the cutoff. For example, suppose scores are whole numbers from 1 to 100, and the cutoff is a score of 50. It might be possible to exhaust your budget by surveying only those with scores of 49/100 and 50/100. The RD assumption would be that assignment to treatment status is as good as random within this narrow range, and so then one could just compare means of the two groups. But such a sample would allow no possibility for all the usual checks one likes to see with RD. For example, taking time-invariant variables, we could only examine treatment versus control differences, but not show whether or not there is a jump at the threshold (i.e. we can’t distinguish a linear relationship between score and baseline variable, which is ok, from a discontinuous jump at the threshold, which is not ok). So if there is still some selection into who gets 49 vs who gets 50, you will be in trouble. Better to have scores of 45 to 49 and 50 to 54, and show that there are no discontinuities or jumps at the threshold.
Something to watch out for:
If you use a data-driven method to choose the optimal bandwidth, it will choose a bandwidth proportional to n^p, where p is some power (e.g. 1/5). What this means is that the optimal bandwidth chosen will itself change when you restrict the sample, even if you are just throwing away units that were outside the initial optimal bandwidth.
For example, in the demonstration data of U.S. house elections, rdrobust using the default settings chooses a bandwidth of 12.45 vote share on either side of the standardized cutoff of zero. The full range of the data is from [-100, 100]. Suppose I then only had surveyed the data with scores in the range of [-20,20] and ran rdrobust on this sample. Then it would choose an optimal bandwidth of only 3.63 vote share. So I do not think you necessarily want to choose a narrow range, and then choose an optimal bandwidth within that range.
 

Comments

Submitted by Jonathan Lain on

Really interesting to see how this works for RDD - the evaluation team at Oxfam have been grappling with similar issues for matching models (including PSM and CEM) for our impact evaluations. Very much looking forward to Part 3, as we've also experimented with simulation to try and calculate power for PSM. We're launching an informal blog series next week exploring Oxfam's evaluation and research methods (apologies for the plug), and one of the posts will be on this topic.

Submitted by Jonathan Lain on

Thanks David - yes I remember reading your blog post *very* carefully when we started to think about this problem. I'll be sure to post the link to the Oxfam blog here when it goes live!

Submitted by Jonathan Lain on

Hi David - as promised, here's our blogpost on experimenting with power calculations by simulation for PSM models: http://policy-practice.oxfam.org.uk/blog/2016/09/real-geek-ive-got-the-power-calculating-statistical-power-for-matching-models-by-simulation. The main details are in the technical note. Oxfam's impact evaluations are reasonably similar each time and we've conducted over 60 of them now, but we're still trying to work out how best to capitalise on all our existing data.

Submitted by Jason Kerwin on

This series of posts is really useful - definitely a valuable public good. I'm wondering if you've come across any useful references on power calculations for clustered RDs. I'm thinking about situations like the policies in India where villages get some government benefit if their population exceeds a threshold, and where we are interested in impacts on individual residents or households. My instinct is to just multiply the RD design effect by the typical clustering design effect, but I'm not totally confident that would give the correct answer.

Hi Jason,
I haven't come across any reference which discusses this. If any of our readers know, hopefully they can let us know in the comments.

Add new comment