We randomized! But did we really, though?


This page in:

Disclaimer: In this blog, we will not address the “So, should I re-randomize?” — cf. our conclusion for a few useful resources, we are going to stay out of this for fear of the gods of randomization … 


Imagine you are running a Randomized Control Trial, and you randomized an intervention across treatment and control groups. You find balance in your outcome variables across groups, so you are happy with the random assignment. Habemus randomization! You conclude that the assignment process was random and hand over the assignment result to your implementation team. Does this sound like a process you do all the time? What could potentially be wrong with this?  


A true story [Motivation]

The idea of writing this post came to us a couple of weeks ago, as we were performing a randomization for a multi-arm trial under multiple policy constraints. The project had a target number of treated individuals, for a combined group of two treatment arms, for each district. While the random assignment had to be done at the community level, the project target numbers were counted at the individual level (and multiple individuals for each community were identified as eligible for the program). 

Our first step was to manually allocate the district quotas of treated individuals across arms, and randomly assign communities, stratified by district. The random assignment generated balance in a key variable between arms. 

We were still a bit skeptical, so as a second step we performed the random assignment 100 times to see how frequently the selected covariate was balanced among the 100 random draws. We discovered that we had gotten lucky with our first draw. [For more on what exactly led to this issue, see our answer to Jason's great question in the comments!]

This exercise allowed us to fix our randomization. We thought we would share this experience with our readers and overview different methods for checking that your random assignment is truly random. For an in depth discussion of whether balance test should be required and when they are useful, we suggest reading David’s blog post


Is some data consistent with a random assignment process? 

First, we consider the case when you’re not the one running a random assignment; instead, you are presented with a dataset where units are supposedly randomly assigned to groups. One example of this is when the randomization for an experiment is done by a government or NGO. See this paper that discusses that a government-randomized farm input subsidy program in Malawi shows some level of targeting.

In such cases, a question that naturally arises is whether the data is consistent with a process that randomly assigned units to groups?

To tackle this question there are many tools one can use: 

  1. The simplest test performed in experiments with a treatment and control groups is just to conduct a series of t-tests comparing the means of the treatment and control group for different predetermined covariates by regressing each of them on treatment. DIME Analytics provides a convenient tool for doing this here.
  2. Another check one could do is a joint orthogonality test proposed by David.
  3. For cases where you have many groups, you can use resampling techniques analogous to those conducted in Carrell and West (2010). These tests involve drawing multiple equally sized random samples for each observed group,  computing the mean of a given covariate in each simulated group, then computing empirical p‐values for each observed group which is equal to the proportion of simulated group with means less than that of the observed group. If the observed groups were generated by a random process, the empirical p‐values should be uniformly distributed (this can be tested using a Kolmogorov‐Smirnov or Chi-Squared goodness of fit test). Example code for doing this can be found here


Habemus randomization?

We next consider the case where you are the experimenter randomly assigning units to treatment or control. 

In cases where the randomization is complex, it is probably prudent to check whether the treatment assignment mechanism generates balanced groups over multiple draws. As we previously mentioned, one scary possibility is that a treatment assignment mechanism that does not generate balanced groups over multiple draws can still generate certain draws where groups appear balanced across observable characteristics, the same way that a non-random assignment rule may yield some balanced draws—so just checking for balance on one realization is not a sufficient test!

To test whether a randomization procedure is working as intended, we turn to randomization inference methods. We use the procedure to draw a large number of possible treatment assignment vectors, then for each draw:


  • Conduct a t-test comparing the means of the treatment and control group for a given covariate(s) by regressing the covariate on treatment, and save the resulting coefficient and p-value.
  • Save the treatment assignment vector.


Once this is done, the checks one can conduct are:


  • Check if the distribution of the saved p-values is uniform. If assignment was truly random they should be uniformly distributed between 0 and 1. This can be checked using the previously mentioned tests.
  • Check if the distribution of the saved coefficients is mean zero. If assignment was truly random then the treatment should on average have no effects on a predetermined covariate. 
  • Check if there exists a randomization unit that never gets assigned to treatment (or conversely one that never gets assigned to control). This can be done by checking whether the average of a given component of the treatment assignment vector is 0 or 1 across the many draws.


We provide some example code for running these checks here.



We started by discussing how to check whether the treatment and control groups are balanced for a given draw of treatment assignments. Next, we discussed how to check whether a treatment assignment mechanism on average generates balanced groups. Naturally, this raises the question of how one should proceed when one or both sets of checks fail.  

Consider the case described in the introduction. If one obtains results suggesting a draw is balanced but that the treatment assignment mechanism does not on average create comparable groups, then one should find out why the treatment assignment mechanism is failing, address that issue, and then draw a random vector of treatment assignments again.

What should one do if instead we find that the treatment assignment mechanism generates comparable groups on average but obtain a draw where some imbalance is present? Bruhn and McKenzie (2009) provide a good discussion of why checking the balance of a draw is problematic when we know that the assignment of treatment was random. 

However, if one is still concerned about balance in such cases, recent advances in re-randomization techniques allow for choosing a draw that enforces covariate balance while taking that into account when conducting inference. For an example of such a method being applied see Beaman et al. (2020). For an example of the theory behind such methods see Li et al. (2018).

Now, habemus randomization!

Join the Conversation

Jason Kerwin
September 19, 2022

>We were still a bit skeptical, so as a second step we performed the random assignment 100 times to see how frequently the selected covariate was balanced among the 100 random draws. We discovered that we had gotten lucky with our first draw.

This is interesting - were these cluster-adjusted balance tests? That's the most obvious issue given the description of the design.

If not, was there just a bug in the code?

September 20, 2022

Thanks Jason for your question! In our example, we did control for strata and adjust for clusters. The issue we encountered was not due to a bug in the code exactly, but from our failure to realize ex ante how the strict criteria from our partners on quotas for communities to be assigned to treatment per district were overdetermining assignment. Satisfying all the quotas and constraints on randomization meant some communities were always assigned to treatment or control.

More specifically, the reason why the process was not random for ALL communities is due to a combination of a small numbers of communities per district, varying number of beneficiaries per community, and quotas for each district. This led to cases where communities were forced to be assigned to treatment or control to satisfy the district quotas. For example, if a community has two eligible individuals and all eligible individuals in the same community must have the same treatment status. If the quota for the district requires exactly one person to be assigned to treatment, then this community can’t be assigned to T. To ensure that the assignments would be random, we convinced our implementing partner to treat only one individual per community, which helped us to restore the randomization process, and had the added benefit of ensuring that more communities had at least one person selected. Of course, it would have been possible to review the eligibility lists before randomization to identify cases where quotas were not compatible with random assignment, but when the constraints are complex it can be hard to spot them all. Rerandomization and re-checking balance gives an easy way to spot these issues. We hope this clarifies!

Jason Kerwin
September 23, 2022

Very interesting, that does make sense. Thanks for sharing the details.

(Aside: it used to be the case that I would get emails telling me about responses to my comments on these blog posts, but that is no longer true—I only saw this because I decided to manually navigate back here.)

Theodor Kouro
February 21, 2023

Hi! What would be the best randomization method for this experiment: 4-arms, randomization at the class level (148 classes in 4 schools), and the number of classes in the last arm is set to 25 because of budget constrains. The observation units are students of these classes. I have a rich set of 20 covariates at the individual level and 7 at the class level. The classes are not of equal size unfortunately. My total sample is around 5000 students. Should I stratify cluster size and school so that I can have treated and untreated classes in each school? Also the number of classrooms varies (slightly) by school. If you could help me on this aspect, could you also share with any link to implement the randomization in stata? I would really appreciate it If you'd find some time and help me out. This is a large natural field experiment and I only have 1 shot! Thanks