I’ve been travelling the past week, and had several people contact me with questions about impact evaluation while away. I figured these might come up again, and so I’d put up the questions and answers here in case they are useful for others.
Question 1: Winsorizing – “do we do this on the whole sample, or do we do it within treatment and control, baseline and follow-up?”
Winsorizing is commonly used to deal with outliers, for example, you might set all data points above the 99th percentile equal to the 99th percentile. It is key here that you don’t use different cut-offs for treatment and control. For example, suppose you have a treatment for businesses that makes 4 percent of the treatment group grow their sales massively. If you winsorize separately at the 95th percentile of the treatment distribution for the treatment group and at the 95th percentile of the control distribution for the control groups, you might end up completely missing the treatment effect. I think it makes sense to do this with separate cutoffs by survey round to allow for seasonal effects and so you aren’t winsorizing more points from one round than another (which could be the case if you used the same global cutoffs for all rounds).
Question 2: Testing balance when the probability of assignment varies by strata. “Suppose that we are evaluating a scholarship program involving 2 schools of the same type, so that the treatment is getting a scholarship to go to this type of school. For each school there is a pool of candidate students that we are able to randomly assign, within each school, to either treatment (getting a scholarship) or not (no scholarship). Suppose that School 1 has 100 candidates and spaces for 80 scholarship recipients, but all 100 are just barely above the cutoff point for admission to this type of school (the cutoff is based on some test, and it is the same for both schools). Meanwhile School 2 also has 100 candidates but has space for only 20, and these candidates are on average better (higher score on the admissions test) than those in the School 1. Then for our total sample the "treatment group" of 100 students will have worse skills than the control group because 80 kids in the treatment group are from School 1 (which had worse schools, on average, compared to School 2) while the "control group" of 100 students will have generally better skills because 80 of then are from School 2. Then this overall randomization could "fail" a balance test in terms of initial test scores (and perhaps some other variables) because the 100 kids in the treatment group will have lower "pre-test" scores than the 100 kids in the control group. Intuitively, I don't think that this is a problem because we randomize within both Schools. Perhaps the balance test should control for which schools the kids are from, so that it is a balance test within schools rather than across schools as well.”
The intuition is correct, in that all you need to do is control for the randomization strata and then everything is random conditional on that. This is an example of the more general point that you should always control for randomization strata. I faced this issue in a recent internship program evaluation, where the probability of treatment varied from 37 to 96 percent across different strata. As a result, a simple comparison of means shows significant differences, but after controlling for strata it is ok (see Table 1 and discussion bottom of page 6). The only further complication here is to think about what average treatment effect you are interested in if there is heterogeneity in effects by strata – in the above example, are you interested in the average school-level effect, or the average student-level effect. If the former, you would need to re-weight.
Question 3: Dealing with low take-up of a program: “.During the pilot around 50% of the treated did not accept the treatment and this became a problem for the organization that provides the training. For the current round the team expects to get 150 eligible individuals to include in the randomization. However, we are really concerned because using the same method from the pilot for the randomization, this would lead us to 75 in each group and around 37 of them will probably reject the treatment. Therefore, we would end up with 38 take-up treatment. Do you think it is possible to use a different randomization method that allows us to replace those individuals that do not accept the treatment? Or should we have a bigger treatment group?”
Here was my advice:
1) Power is maximized when you have the same number in T and C, so your power is higher having 75 in T and 75 in C, regardless of the take-up rate of treatment. i.e. you are better off having 75 treatment with 37 taking it up and 75 control, than to have 100 treatment with 50 taking it up and only 50 control left.
2) That said, power decreases dramatically in the take-up rate. Is there any way to further screen the firms to randomize only among those who are likely to take it up. I’m not sure what “eligible” means here. I assume it means they applied at some point. Do you know why they drop out? One possible solution would be to tell them “we have had an issue where some of those we accept into the program don’t take it up. This isn’t fair for other firms, and is a waste of our resources. Therefore we want to only consider those who are eager and definitely want to participate. We invite you to show up at this event at 1pm on date X, and then we will randomly choose from those who show up to fill the slots in the program. This should screen out some of the non-interested.
3) If the implementor is still worried about having empty slots, and not having a large enough generation, I suggest you add a third group, which is a waitlist. You could potentially have this group randomly ordered, and then you can use them still as controls if not chosen.
So perhaps you ask your 150 to show up, and get 100 show up. You randomly assign 50 to treatment, 45 to control, and 5 to waitlist. Hopefully you might get 70-80% of the treated then take it up, so this gives 35-40 for the round, with a further 5 they can try and work with if the number drops below that. Then you pool together several recruitment rounds to get enough power.