# Regression-based joint orthogonality tests of balance can over-reject: so what should you do?

One of the shortest posts I wrote for the blog was on a joint test of orthogonality when testing for balance between treatment and control groups. Given a set of k covariates X1, X2, X3, …., Xk, this involves running the regression:

Treatment = a + b1X1+b2X2+b3X3+…+bkXk + u

And then testing the joint hypothesis b1=b2=b3=…=bk=0. This could be done by running the equation as a linear regression and using an F-test, or running it as a probit and using a chi-squared test. If the experiment is stratified, you might want to do this conditioning on randomization strata, especially if the probability of assignment to treatment varies across strata, and if the experiment is clustered, then the standard errors should be clustered. There are questions about whether it is desirable at all to do such tests when you know for sure the experiment was correctly randomized, but let’s assume you want to do such a test, perhaps to show the sample is still balanced after attrition, or that a randomization done in the field was done correctly.

One of the folk wisdoms is that researchers sometimes are surprised to find this test rejecting the null hypothesis of joint orthogonality, especially when they have a lot of variables in their balance table, or when they have multiple treatments and estimate a multinomial logit. A new paper by Jason Kerwin, Nada Rostom and Olivier Sterck shows this via simulations, and offers a solution.

**Joint orthogonality tests based on standard robust standard errors over-reject the null, especially when k is large relative to n**

Kerwin et al. look at both joint orthogonality tests, as well as the practice of doing pairwise t-tests (or group F-tests with multiple treatments) and doing some sort of “vote counting” where e.g. researchers look to see whether more than 10 percent of the tests reject the null at the 10% level. They run simulations for two data generating processes they specify (one using individual level randomization, and one clustered), and with data from two published experiments (one with k=33 and n=698 and individual level randomization, and one with k=10 and clustered randomization with 1016 units in 148 clusters).

They find that standard joint orthogonality tests with “robust” standard errors (HC1, HC2, or HC3) over-reject the null in their simulations:

· When n=500 and k=50, in one data generating process the test rejects the null at the 10% level approximately 50% of the time! That is, in half the cases researchers would conclude that a truly randomized experiment resulted in imbalance between treatment and control.

· Things look a lot better if n is large relative to k. With n=5000, size is around the correct 10% even for k=50 or 60; when k=10, size looks pretty good for n=500 or more.

· The issue is not surprisingly worse in clustered experiments, where the effective degrees of freedom are lower.

**What is the problem?**

The problem is that standard Eicker-White robust standard error asymptotics do not hold when the number of covariates are large relative to the sample size . Cattaneo et al. (2018) provide discussion and proofs, and suggest that the HC3 estimator can be conservative and used for inference – although Kerwin et al. still find overrejection using HC3 in their simulations. In addition to the number of covariates, leverage matters a lot – and having a lot of covariates and small sample can increase leverage.

**So what are the solutions?**

The solution Kerwin et al. propose is to use omnibus tests with randomization inference instead of regression standard errors. They show this gives the correct size in their simulations, works with clustering, and also works with multiple treatments. They show this makes a difference in practice to the published papers they relook at: in one, the F-test p-value from HC1 clustered standard errors is p=0.088, whereas it would be 0.278 using RI standard errors; and similarly a regression clustered standard error p-value of 0.068 becomes 0.186 using RI standard errors – so using randomization inference makes the published papers claim of balanced randomization more credible (for once a methods paper that strengthens existing results!).

My other suggestion is for researchers to also think carefully about how many variables they are putting in their balance tables in the first place. We are most concerned about imbalances in variables that will be highly correlated with outcomes of interest – but also often like to use this balance table/Table 1 to provide some summary statistics that help provide context and details of the sample. The latter is a reason for more controls, but keeping to 10-20 controls rather than 30-50 seems plenty to me in most cases – and also will help with journals having restrictions on how many rows your tables can have. Pre-registering which variables will go into this test then helps guard against selective reporting. There are also some parallels to the use of methods such as pdslasso to choose controls – I have a new working paper coming out soon on using this method with field experiments, and one of the lessons there is putting in too many variables can result in a higher chance of not selecting the ones that matter.

**Another practical note**

Another practical note with these tests is that it can be common to have a few missing values for some baseline covariates – e.g. age might be missing for 3 cases, gender for one, education for a few others, etc. This does not present such a problem for pairwise t-tests (where you are then testing treatment and control are balanced for the subsample that have data on a particular variable). But for a joint orthogonality F-test, the regression would only then be estimated for the subsample with no missing data, which could be a lot lower than n. Researchers then need to think about dummying out the missing values before running this test – but then this can result in a whole lot more (often highly correlated) covariates in the form of dummy variables for these missing values. Another reason to be judicious on which variables go into the omnibus test and focusing on a subset of variables without many missing values.

## Join the Conversation