I received a question this week from Kristen Himelein , a bank colleague who is working on an impact evaluation that will use propensity score matching. She wanted to know how to do power calculations for this case, saying that “Usually, whenever anyone asks me about sampling for matching, I tell them to do a regular sample size calculation to determine the size of the treatment, adjust for expected take-up rates, and then take 3-4 times more than the treatment for the control to get as good a match as possible. I found one paper  in the Journal of Biopharmaceutical Statistics that deals with the calculations, but that's about it. Do you have any suggestions?”
I thought I’d share my thoughts in the blog, in case others are also facing this issue.
First, I looked at the paper she had found. It is likely not that helpful for economists/social scientists planning power calculations, as it is based on assuming you know the proportions of the treatment and control groups that will be in different propensity score ranges within the common support. However, one interesting remark it makes is that for non-randomized trials in medicine, researchers often estimate sample size in the same way as they do for randomized trials, but then as a result, the U.S. FDA usually gives a warning and requests sample size justification or an increase in sample size based on consideration of the degree of overlap in the propensity score distribution.
What do I think we should do?
First, it is important to recognize that the extent to which your control group overlaps with the treatment group depends on how comparable in the first place the two survey populations are. For example, in work  on estimating the impacts of international emigration from Tonga, John Gibson, Steven Stillman and I use two control samples: a specialty sample we did ourselves of 183 individuals aged 18-45 who lived in the same villages as the migrants, as well as a national labor force survey which had 3,979 individuals aged 18-45 throughout Tonga. We find that after trimming propensity scores below 0.05 or 0.95, the large national sample shrinks from 4043 treatment + control observations to only 354, whereas when using the specialty sample taken in the same villages trimming on the propensity score only reduces the sample from 230 observations down to 200 observations. This echoes the conclusions of Smith and Todd  who find that the vast majority of individuals in the CPS and PSID are quite dissimilar to the participants in the supported work program they study. A smaller, better targeted specialized survey may therefore offer more power than a larger random sample of a broader population.
Given, this, I think the steps in calculating sample sizes needed to achieve a given power in a propensity score matching design should be as follows:
a) Figure out how much you know about the characteristics of the treatment group. For example, are individuals all drawn from particular geographic areas, do they all have income below a certain level, etc. The more you know on this, the better you can target your control sample to make it comparable to the treatment group.
b) Next, check for the possibility of panel data for part of the sample at least. We know propensity score matching is more convincing when the same survey instrument is used, where multiple pre-period values of the outcome variable are used to match individuals on, and where individuals come from the same local labor markets. So if panel data is possible for some individuals but not others, this should in part determine who makes your sample.
c) Then use sampsi in Stata or Optimal Design, or your other favorite power calculation program to calculate the size of the treatment group and control group required under a balanced experimental design.
d) Blow up these numbers from c) by dividing by the proportion that in step a) you expect to have left after trimming to the common support in propensity scores – so if you don’t know much about the treatment group’s characteristics in advance, or can’t target your control group sample very finely you may need to allow up to 10 times the experimental treatment sample (as in my example above), whereas if you know a lot about the treatment group characteristics and can sample accordingly, you may need only take a control group sample that is say 20-200% larger than in the pure experiment case. Note that if the control group is very different, you may also have to increase the treatment group size since some of those treated may not overlap in the common support with any of the controls.
e) If you have more data, either from a pilot or other survey, which tells you something about the distribution of the propensity score distribution likely for both the treatment and control groups, you can then do some stratified sampling to try and get balanced treatment and control groups within different propensity score ranges.
Anyone else had experiences doing this in practice that they would like to share?