# Be careful with inference from 2x2 experiments and other cross-cutting designs

## This page in:

Back in 2012, I wrote a post on how to get more than one paper out of an evaluation. I noted that one of the most popular approaches was to cross-randomize a second (or third) intervention on top of an existing one, with this being particularly useful as a cheaper way for graduate students to piggy-back an experiment on top of a larger, more expensive project, and for comparing the effectiveness of different interventions. For example, suppose an experiment is designed which allocates subjects to *Treatment 1 (T1) * or the control group. Then a second experiment might cross-randomize, and allocate half the subjects to *Treatment 2 (T2) *and the other half to a control group. This 2x2 design then ends up having 4 groups: *control, T1 only, T2 only, both T1 and T2. *

The outcome can then be written as:

*Y = a + b1*T1 + b2*T2 + b3*T1*T2 + e (long model)*

Where *b3 *measures whether or not there are complementarities between the two treatments, so that the effect of getting both treatments is different from what we would predict from adding the effect of just getting treatment 1 (b1) and the effect of just getting treatment 2 (b2). This was the case in a financial literacy experiment I blogged about here. However, I also noted in that post that, with regard to these complementarities, in many papers “*power appears to have been low to detect them – I think the common approach here is a footnote in one paper acknowledging a second intervention was done, noting there was no significant interaction, and thereby justifying saving the other treatment for a second paper*”. That is, paper 1 runs the regression:

Y = c + d*T1 + w

And paper 2 runs the regression:

Y = f + g*T2 + v

Or alternatively, even when just one paper is getting written, researchers may test for an interaction, and if there is no significant interaction, boost power by running the following regression:

*Y = k+ k1*T1 + k2*T2 + e1 (short model)*

**Is this a problem ?**

A new paper by Muralidharan, Romero and Wüthrich explores the use of these factorial designs in economics experiments, shows they are common, and that current practices of analysis can lead to misleading inference.

First, in terms of **prevalence**, they find that 27/124 field experiments published in top-5 journals during 2006-2017 use cross-cutting designs. Moreover, many of them go beyond 2x2s to have multiple treatments and multiple interactions, with the most extreme example being an experiment that randomized different fields on CVs, which Muralidharan et al. report as having 34 treatments and 71,680 interactions. But even that paper aside, 3 to 6 interactions in the design is not uncommon.

**How these interactions are treated matters for analysis and conclusions: **Out of these 27 studies, they find 16 (59%) do not include all interaction terms in the main specifications. They re-analyze 14 of these 16 experiments (2 didn’t have public data) by estimating the fully-saturated model (long model) above instead of the short-model, and find this matters for the *sign, magnitudes *and *significance *of the main treatment effects in many cases. The median change in the point estimates of the main treatment effects is 103%, 26% change sign, and 54% of the estimates reported to be significant at the 5% level are no longer so after including interactions. This is despite only 3.7% of interactions being significant at the 5% level.

One (perhaps cop-out) solution to this problem is just to change what you say is the parameter of interest. If we redefine the parameter of interest to be not the main treatment effect, but the composite treatment effect that is a weighted-average of the interactions with other treatments, then the short model will be ok. For example, Duflo et al. write in their randomization toolkit, about a 2x2 experiment with computerized learning and remedial education that “In this case, the effect of remedial education we obtain is that of remedial education conditional on half the school getting computer assisted learning as well.” i.e. instead of saying we are interested in b1 in the long model, we now say we are reporting b1 + b3*Prob(T2=1). However, most papers are not this careful in making clear that this should be the interpretation. Moreover, such a parameter is often of less interest for policy because it will be unlikely when a policy is considering treatment 1 that treatment 2 will be going on in the background for half the sample. In contrast, for mechanism experiments (such as resume audit studies), we may be less concerned about taking the precise magnitudes as a basis for policy implementation, and this weighted composite treatment effect may be fine to focus on.

**What not to do**

A common practice is to proceed as in my initial quote above, by a two-step procedure in which researchers first estimate the long model with all interactions, test that the interactions are significant, and then focus on the short model if they do not reject the interaction is zero. However, there are two problems with this approach:

1. In practice most studies are not adequately powered to detect interactions. Interactions may therefore be non-trivial in magnitude, even if not statistically significant. Muralidharan et al. report that the median absolute magnitude of the interactions in their re-analysis is 0.065 s.d., or 37% of the size of the main treatment effects. This is why adjusting for interactions can affect the sign and magnitude of main effects, even if the interactions are not statistically significant.

2. The other problem here is that the common issue of pre-testing affecting the distribution of test statistics – the estimators obtained from this two-step procedure are highly non-normal, making the standard t-statistics misleading, and leading to incorrect test size.

**What can you do**

The first recommendation is to report the results from the long model above (i.e. with all interactions). This will give correct inference. This is fine for 2x2 experiments, but does lower power, and especially will do so and become more unwieldly as the number of interactions grows. The authors then explore using simulations some recent econometric approaches – using a Bonferroni correction in model selection, which seems conservative and not better than the long model; and using tests or bounds which incorporate prior knowledge about the likely size of the interaction and which can improve power if you are correct, but worsen things if you are wrong.

But their most practical advice is to perhaps **rethink running factorial designs/cross-cutting randomizations altogether**. Instead, if you aren’t interested or powered to detect the interaction, just have cells with the main treatment effects. So instead of the 2x2, just allocate subjects equally to three groups (T1, T2, and control). This places more units in the cells you care about, so improves power. But it does make very painful and transparent the costs of adding another treatment – now when I have an experiment with T1 and control, and I want to think about also looking at T2, I need to re-allocate subjects from my main experiment. Doing this will make it much less likely that a second experiment will get piggy-backed on top of a main one. Otherwise, they say just to be much more transparent and clear in both pre-analysis plans and reporting that what you will report is a composite parameter.

## Join the Conversation

Actually, when you have T1 and T2 between-subjects and you do want to compare T1 to control and T2 to control, then to maximise you should NOT create equally-sized groups. Rather, you should put more subjects in the control group (since you are gonna use them in two statistical tests).

Hi Ben,

You are right. I'm assuming you are referring to the case when the interaction cell is left empty. To maximize power, you should place \frac{N}{2} (2 - \sqrt{2}) in the cells with T1 and T2, and the rest in the pure control group.

Best

Mauricio

Thank you for this comment, Ben. I'm assuming you are referring to the case where the interaction cell is left empty. You are 100% correct. The pure control should have around 42% of the total sample size to maximize power (or \frac{N}{2} \left(2 - \sqrt{2} \right)) where N is the total sample size. This is assuming you care equally about power for T1 and T2. We will add a note about this in the paper.