Should we require balance t-tests of baseline observables in randomized experiments?
This page in:
I received an email recently from a major funder of impact evaluations who wanted my advice on the following question regarding testing baseline balance in randomized experiments:
Should we continue to ask our grantees to do t-tests and f-tests to assess the differences in the variables in the balance tables during the baseline?
Many argue yes since how else would you know if the control and different treatment arms are comparable ex ante? Economists do these tests as a matter of course…Others (mostly non-economists) argue that you should not do these tests. They cite: Hayes & Moulton: Cluster Randomised Trials, page 161 “It is not appropriate to carry out or report the results of significance tests comparing treatment arms…., since the allocation of clusters between arms is carried out randomly it is known that any differences that do occur must have occurred by chance…….. The point of displaying between arm comparisons is not to carry out a significance test, but to describe in quantitative terms how large any differences were….”
This is of course a question that many of us face when writing our own papers, and when refereeing those by others as well. I therefore thought I’d share my thoughts on this more generally.
The problems with doing these tests
Miriam Bruhn and I discussed this issue (p.225-227) in our paper (ungated version) on how randomized experiments are done in practice in development economics. There we noted several issues that had been raised about these tests in the medical literature:
- They are conceptually problematic: Altman (1985) notes that such tests amount to assessing the probability of something having occurred by chance when you already know that it did occur by chance “Such a procedure is clearly absurd”.
- Statistical significance is not what matters for imbalances: what matters is how highly correlated these baseline values are with the outcomes of interest. Altman notes “a small imbalance in a variable highly correlated with the outcome of interest can be far more important than a large and significant imbalance for a variable uncorrelated with the variable of interest.”
- Researchers might use these tests improperly: Two issues that come up are i) deciding whether or not to control for a covariate in treatment regressions depending on whether or not the difference is significant – my paper with Miriam discusses work showing that this messes up the size of your tests in subsequent analysis; and ii) Schulz and Grimes (2002) report that in the clinical trials literature, researchers who use hypothesis tests to compare baseline characteristics report fewer significant results than expected by chance. They suggest one plausible explanation is that some investigators may not report some variables with significant differences, believing that doing so would reduce the credibility of their reports.
So all of this suggests that doing such tests is a bad idea. So why do we use them?
Situations when these tests are informative
I think the main cases when these tests are useful are when threats to the randomization design arise. Two of these are particularly common:
- When there is reason to worry about whether the randomization was actually done correctly: One example of this is when random assignment occurs in the field and is not completely under the researcher’s control. For example, you might want to have respondents roll a dice or draw from a bag to select their treatment status at the time of a baseline interview – this can have advantages in the respondent seeing that it really is a fair process which determines whether or not they get a particular treatment. But the concern might be that enumerators allow some respondents to roll or draw again, or otherwise do not implement this procedure as intended. A second example is when random assignment is being done by the government or NGO rather than the researcher – you might worry that they have incentives to ensure particular people end up in the treatment. A third example is when the randomization procedure is very complicated (perhaps involving multiple phases, stratification, etc.) and there is a chance of programming errors accidentally causing imbalances. And of course this is not just about convincing yourself that it was done correctly, but convincing suspicious reviewers/readers – so perhaps it is not enough to say I have no reason to think the NGO manipulated assignment, but also you do these tests as further proof.
- When looking at baseline characteristics for a sample with attrition. Many experiments in development lose participants to attrition by the time of follow-ups. A common use of these tests is then to show that the sample that you manage to follow over time also looks balanced on observable baseline characteristics.
So should they be required?
I don’t think they should be required, but neither do I think that these tests are never useful. If you are going to use them, then a few ideas for using them better:
- Consider omnibus tests of joint orthogonality rather than a whole lot of t-tests variable by variable. Of course if we consider a whole lot of different variables, we expect a few of them to be significantly different between treatment and control groups just by chance – so better to just see if the baseline variables are jointly unrelated to treatment status.
- Focus on the size of the differences rather than their statistical significance. The normalized differences approach of Imbens and Rubin (2015) is useful here. This is defined as the difference in means between the treatment and control groups, divided by the square root of half the sum of the treatment and control group variances. This provides a scale-invariant measure of the size of the difference. They use this with propensity-score matching to show that differences of 1 or more are problematic in terms of giving results similar to the experiment, while differences of 0.25 or less seem to indicate good balance. This still seems an area where additional work on how to assess whether a difference is big or not is needed – where of course these differences only matter to the extent they predict future outcomes of interest.
- Clearly identify which variables you used as randomizing strata or otherwise explicitly tried to seek balance on in your randomization – it makes even less sense to stratify the random assignment by gender, and then test gender is balanced across groups.
- Test using the same specifications that you will use to test for differences in outcomes. This gets back to the point that we only care about differences that matter for outcomes. So if your main specification is going to control for baseline stratifying variables and then cluster the standard errors, it doesn’t matter if you find that a variable looks unbalanced when you don’t condition on strata.
This has come up so many times. Thanks a lot for a post on this.
Two other useful resources on this from some of our political science readers on twitter:https://pdfs.semanticscholar.org/d374/3abac0697e53cf7faf206c316da522aba…
Readers may also revisit Winston Lin's two-post series on regression adjustments to think about this issue (here and here). I always loved this section:
I also want to clarify the meaning of unbiasedness in Neyman's and Freedman's randomization inference framework. Here, an unbiased estimator is one that gets the right answer on average, over all possible randomizations. From this unconditional or ex ante perspective, the unadjusted difference in means is unbiased. But ex post, you're stuck with the randomization that actually occurred. Going back to our hypothetical education experiment, suppose the treatment group had a significantly higher average baseline (9th-grade) reading score than the control group. (Let's say the difference is both substantively and statistically significant.) Knowing what we know about the baseline difference, can we credibly attribute all of the unadjusted difference in mean outcomes (10th-grade reading scores) to the treatment? If your statistical consultant says, "That's OK, the difference in means is unbiased over all possible randomizations," you might find that a bit Panglossian.
Sorry to revive such an old post, but there's an additional dimension here. When doing experiments with very large samples (think program administrative data) and with many treatment groups even very slight differences between groups will be statistically significant...