This is a joint post with Miriam Bruhn.
Consider an experiment in which you are randomizing individuals into a control group and 5 different treatment groups, and you have several variables you would like to stratify the treatment on. The textbook description makes it seem very easy: simply create strata, and then within strata, randomly allocate 1/6th of the sample to each group. The difficulty that arises in practice is that often you end up with strata with cell sizes not divisible by the number of treatment and control groups. This causes both some minor conceptual problems and some slightly tricky programming issues. Since this is something that seems to come up in discussions on several projects, we thought it might be worth explaining what needs to be done, and giving an example in Stata.
To make it concrete, here are some data from a project we are working on, in which we have 4 variables we want to randomize over: labeled variables A-D. Variable A has 4 values (e.g. 4 geographic regions), variable B is binary (e.g. gender), variables C and D have 3 values each (e.g. education groupings, or household asset groupings). All told this gives 72 distinct strata. The total sample is 1751 individuals. Some combinations of these variables are more common than others, so the strata cellsize ranges from 1 to 99.
Then if we start and just allocate by 6s within each cell, we randomly allocate 1590/1751 individuals to a treatment or control group – the first part of this stata code does that. The key question is then what to do with the remaining 161 observations. One approach would be to group them all into a new strata (“the misfits”) and randomly divide this into 6 groups. However, the downside of this is that it doesn’t preserve balance within the original strata – we might for example end up allocating all of the women in this “misfits” strata to treatment.
It is clear then that we want to randomly allocate the leftovers within each strata, taking care that if we have X units (where X ranges from 1 to 5) left in a strata, we allocate them in such a way that:
a) No treatment or control group gets allocated more than one of these units within the strata, and
b) We randomly choose which treatment groups get the extra units.
This is clearest when there is only 1 left over unit – then we just draw a random number, and if it is less than 0.166, allocate to treatment 1, if it’s between 0.166 and 0.333 allocate to treatment 2, etc.
With 2 left over units, it gets more complex. There are 6 choose 2 = 15 possible ways of allocating the leftover 2 units among the 6 treatment groups. So we draw a random number at the strata level, and if this is less than 0.0667, the 2 units go one each in treatment 1 and treatment 2, if it is between 0.0667 and 0.1333, the 2 units go one each in treatment 1 and treatment 3, etc.
With 3 left over units, there are 6 choose 3 = 20 different ways to allocate the extra units. You can see our code does this the brute force way – going through all 20 options – in most cases the number of distinct treatments is likely to be no more than what we have here, so this should be fine, but more elegant programming will be needed if you have heaps of different treatment conditions.
The stata code goes through all these cases, and you see at the end all 1751 cases have been allocated among the strata in a way which gives good balance within strata. Then, following our earlier paper, we recommend you control for all these strata by including strata dummies in your estimation.
Of course if you have a different number of treatment groups, or a different maximum number of individuals in any given strata, you will need to modify this code before using it – but hopefully it is at least helpful to some people who are trying to implement this.
[edit 3/22/2016: there is now a user-developed Stata command randtreat (available through ssc) to deal with this issue).