# (What) Should you do (with) experiments with factorial designs?

It is not uncommon for researchers in development economics to design experiments with more than one treatment arm. If you cross these treatments, i.e., have some units assigned to more than one, these are called factorial designs (2x2 designs in the case of exactly two treatments). This paper by Muralidharan, Romero, and Wüthrich (2021) on the shortcomings of such designs got some new life on social media recently, which surprised me because I remembered it from a few years ago. Reading the new version carefully, I don’t think that there is a lot that has changed in this version. However, three years have passed, and some of the paper’s important messages may reach audiences better now. There may also be some new messages to send or emphasize. So, below is my updated take on this important work…

**[Notation (borrowing from David’s blog below)**: *For example, suppose an experiment is designed which allocates subjects to Treatment 1 (T1) **or the control group. Then a second experiment might cross-randomize and allocate half the subjects to Treatment 2 (T2) and the other half to a control group. This 2x2 design then ends up having 4 groups: control, T1 only, T2 only, both T1 and T2.*

*Y = a + b1*T1 + b2*T2 + b3*T1*T2 + e (long model)*

*Y = k+ k1*T1 + k2*T2 + e1 (short model)***]**

*What the paper said …*

When you have David McKenzie as your co-blogger, it is likely that he has beaten you to blogging about a new paper. Sure enough, here is David’s post on this paper from July 2019. There is nothing in it that is wrong, so no reason for me to rehash the main messages, so here is a very brief summary in bullet points:

· Factorial designs are generally underpowered to detect complementarities/interaction effects.

· The estimand on either of your treatments in the ** short model** is a strange object – a weighted average of the pure effect of treatment (in the absence of any other treatments) and the effect of it in the presence of the other one. This is usually of no relevance – academic or policy…

· The ** long model** is the correct way to analyze such an experiment, as it provides the correct inference, even though it is often underpowered.

*What we should not do:*

The ** first, and biggest, message** here is what the authors call the data-dependent model selection procedure. It goes something like this: you have a 2x2 design and you say to yourself: “Well, the

**is producing effects on T1 that are promising but not significant. I’ll check to see if there is an interaction effect between T1 and T2 and proceed with lumping the observations in the interaction cell with T1, giving me a lot more observations to estimate the effect of T1.”**

*long model*You might think that this is a straw man: I assure you that it is not: many a researcher you deeply respect have gone down this road (partially or fully). It is not even that this thinking is wrong: if the interaction effect is truly zero, then the ** short** model and the

**model are identical. The problem is in implementation: what people will do is to run the**

*long***and if the interaction effect (**

*long model***above) is insignificant, they will proceed with the short model. This is deeply wrong.**

*b3*To see why, please see Figure 1, panel C (and Figure 2) in the paper: even for modest values of ** b3**, false rejection rates jump to the range of 0.12 to 0.36 (instead of 5% when alpha=0.05). Reasonably higher sample sizes generally won’t protect you from this problem and combined with the fact that many studies are underpowered, the conclusion is this:

*Don’t test whether the interaction effect is significant and proceed to use the short model if not.*

There is also a more general point here. It’s not just in factorial designs that we should follow this advice for interpreting p-values: depending on the power of your study a certain non-significant p-value can be evidence ** for** an alternative hypothesis, compared with the null. See this brilliant blog post by Daniel Lakens on interpreting p-values, where they show that a p-value of 0.168 can be evidence

**or**

*for***an alternative hypothesis depending on the statistical power of the test. In fact, in very high-powered tests (like 99%), a p-value of 0.04 is more likely under the null, i.e., evidence**

*against***the null – known as the Jeffreys-Lindley paradox.**

*for*

*Options on what to do:*

Now, ** the second thing** to consider is to have two treatments but to leave the interaction cell empty. So, you would have three groups: a pure control group (status quo), T1, and T2. To maximize power, you need to allocate approximately 42% of the sample to C, and then equally (29% each) to T1 & T2.

*[The variance of b1 or b2 is minimized when N1=N2=(N/2) (2-sqrt(2)). See p. 27 in the paper for more.]*

Incidentally, this is similar to the result on how to allocate clusters to treatment or pure control in a partial population experiment (PPE), where the researcher would like to estimate spillover effects: the optimal share of clusters to allocate to the control group is between 0.41 and 0.5, depending on the intracluster correlation (ICC) and splitting units within treatment clusters equally between treated and untreated (see Baird et al. 2018 for details). In both cases, the control group is being used twice (for T1 and T2 in 2x2 designs; for T and S in a PPE).

*So, it is not always optimal to divide control and multiple treatments equally: depending on your design and the ICC of the outcome of interest, the control group can be substantially larger. Note, however, that the power-maximizing design for detecting interactions is four equal-sized cells.*

Related to this are experiments where there is a control group and two treatment groups, but T2 subsumes T1: i.e., there is no T2 alone. This is perhaps most common in situations where an organization or policymaker is committed to a treatment but thinks that this treatment can substantially be improved by adding a small complement. For example, studies that examine whether treatment X works better if it comes with a small cash transfer attached to it (perhaps, by increasing the take-up of X OR by allowing the person to invest rather than spend OR by giving the person the mental space/bandwidth to accomplish behavior change, and so on). Or “cash plus” programs that add complementary activities to long-established cash transfer programs. In such situations, the organization or government may have no realistic interest (at least in the short term) in implementing the complementary activity alone (T2), so the experiment consists of C, T1, and T1+T2.

In such cases, I sometimes advocate for getting rid of the control group – if the organization is committed to T1 anyway or T1 is already and established (in effectiveness) intervention, then just design a study that seeks to see whether T1 can be improved. In other words, T1 is status quo (C) and adding T2 to T1 is the new intervention. I have never succeeded in convincing my counterparts in going for this: people somehow always want a control group. They worry that the audience, referees, donors will question whether T1 was having any effect in the first place. I’d like to see more of these…

There is perhaps one novel avenue that is much wider and used now than it was even a few years ago, through which to have factorial designs: adaptive or iterative experiments. If you are lucky enough to be able to have either a continuous stream of units to be treated OR you can divide your susceptible population into several chunks to join the experiment in batches over time, then you can include T1 + T2 along with C, T1, and T2 in your experiment (even contextually, meaning you’re running multiple adaptive experiments for different types of subjects – to tailor treatments to contexts). If T1 + T2 seems promising in earlier batches, the algorithms will allocate a larger sample to it in subsequent ones until you decide whether it is a worthy (say, cost-effective) intervention to be evaluated. Then, you can implement a powerful enough evaluation phase to follow the adaptive phase to establish the effect size with confidence (see, for example, the pre-analysis plan for this adaptive experiment).

Finally, it is true that while the estimand (** b1**) from the long model is more transparent and often the right object of interest, the average effect of a treatment averaged over other treatments (

**) from the short model, can still be meaningful – even desired. For example, in the adaptive experiment cited immediately above, we are trying to figure out optimal counseling strategies and price discounts for contraceptive methods for different types of people: some people might benefit from a more direct recommendation while others are better off left alone. Alternatively, some are more credit constrained than others. If we were to run the short model to estimate the effect of a counseling strategy, that estimand averaged over all price discounts is as good an estimate as another just using the pre-experiment (status quo) price and pools many more observations from the experiment. Given that we don’t know the optimal price, it’s fine to present the average counseling effect across all prices, as long as the reader clearly understands that this is the estimand.**

*k1*In fact, there is a way to do this while running the long model, which borrows from the literature on covariate adjustments in experiments. As the readers of this blog well know, the safest way to have covariate adjustments in RCTs is to have them introduced in a fully interacted way (Lin 2013). If the covariates are de-meaned (or centered, i.e., a transformation of (Xi-X_bar) for each observation), you will get the average treatment effect (evaluated at the means of the X vector). If, as the authors discuss in the paper, you think of T2 like a covariate, over reasonable values of which you would like to evaluate T1, you can run the long model with T2 demeaned. This would give you a more robust version of the T1 estimate from the short model, while making the estimand transparent (ATE over T2), while also showing you the heterogeneity (underpowered as it might be) of T1 over values of T2.

In the concluding section, the authors mention a recent paper by Banerjee and many co-authors (2021, gated NBER WP), which uses a “smart pooling and pruning procedure,” to avoid the criticisms made by Muralidharan and co-authors, who make the point that this might be attractive for high-dimensional treatment spaces but requires strong assumptions about effects sizes of treatments and interactions. That paper is next on my methods reading list and I might blog about it in the near future…

## Join the Conversation