Syndicate content

You ran a field experiment. Should you then run a regression?

Berk Ozler's picture
Recently, a colleague came over for dinner and made the following statement: “Person X told me that Imbens is now saying that we should not be running regressions to estimate average treatment effects in experiments.” When I showed some sympathy for this statement while focusing more on making tortillas, she was resistant: it was clear she did not want to give up on regression models…

It’s clear that development economists are not going to stop running regressions anytime soon – even those who only analyze data from field experiments. I suspect that many respected researchers would agree with that approach, likely including Guido Imbens himself, but it is safe to say that regression analysis is a different animal than causal inference using the potential outcomes framework. Part of the problem is that many economists have not studied the latter in graduate school and are more familiar with what is taught in standard econometrics courses: but this is a generational thing and it will pass.

"In particular there is a disconnect between the way the conventional assumptions in regression analyses are formulated and the implications of randomization. As a result it is easy for the researcher using regression methods to go beyond analyses that are justified by randomization, and end up with analyses that rely on a difficult-to-assess mix of randomization assumptions, modeling assumptions, and large sample approximations."

The quote is from a paper titled The Econometrics of Randomized Experiments, by Athey and Imbens (2016), AI from hereon. It is presumably this paper, all of which is very helpful, that caused my colleague’s colleague to speculate about Guido’s opposition to regression analysis for estimating treatment effects. Section 5, titled “Randomization Inference and Regression Estimators,” from which the quote above is lifted, is particularly relevant. What follows is a brief summary of some of the points made in the paper through Section 5 that I think are relevant to the question at hand (and stuff that I currently happen to be working on)…

First, for researchers who have not thought through the differences between the potential outcomes and regression frameworks, Section 2.4 is helpful in distinguishing between uncertainty arising in finite populations in the former vs. that from drawing random samples from “a large, essentially infinite super-population” in the latter. With potential outcomes, even when you observe the entire population, there is uncertainty because you only observe one potential outcome for each individual: randomization inference is the repeated random reassignment of this finite population with fixed potential outcomes to treatment (while regression assumes fixed realized outcomes and assignments, but different individuals with varying residuals (error terms) are being sampled from a much larger population).

This nicely segues into the discussion of Fisher’s sharp null (Section 4.1) – something that many of you probably heard mentioned in a seminar or blog post, but not quite sure what it is exactly. Jed blogged about exact p-values without mentioning Fisher some years back and the discussion thread under that post is very rich, even if it might need some updating given recent developments in the field: under a sharp null, we can infer all the missing potential outcomes from the observed outcomes (e.g. in the main case of no treatment effect on any individual), which allows us to construct the exact distribution of the statistic of interest (mean, rank, etc.) under that null hypothesis. This is followed by Section 4.2, where randomization inference is now discussed for average treatment effects in the Neyman tradition, who, as is mentioned by Winston Lin here, was interested in asymptotically valid inference for a sample.

Perhaps the nicest part of the paper for me (and likely most relevant for the readers who just want to get on with their regressions) is Section 5. Here, AI bridge the potential outcomes framework with the regression one by defining the individual error term as the difference between the potential outcome for a person and the population expectations of those outcomes (by placing the potential outcomes framework inside a random sampling from a super-population setting). Now the error term is defined in terms of potential outcomes, it has a clear interpretation, and its conditional (on treatment status) means are equal to zero – by design (see page 26 here). If we use a simple correction for robust standard errors we can obtain valid confidence intervals most of the time.

Could you have run your linear regressions to estimate program impacts without worrying about all this stuff that sounds nitpicky? The answer is probably “yes,” for a good majority of the time. But, the authors warn you about the circumstances under which you could be wrong. But, perhaps, more importantly, a gradual increased understanding of the underlying assumptions for randomization inference for causality vs. regression analysis will lead you to eventually become a more careful and “hands above the table econometrician.”

And, what about those potentially harmful and fishy regression adjustments that your journal reviewer is chiding you for? AI address this in Section 5.2. Their advice is not really all that different than what follows from Winston Lin (2013), which were summarized in a two-part post here and here. I think that Winston’s recommendations are still my best guide when I have to make these decisions: including centered (or demeaned) covariates and their interactions with treatment indicators in a fully interacted regression (treatment-by-covariate regression) model cannot hurt asymptotic precision (with caveats on the number of pre-treatment covariates compared with the sample size of the smallest group). AI have one suggested wrinkle on this: if all your covariates are binary indicators partitioning your population, then not only the average treatment effect has a clear interpretation (as the weighted average of subgroup effects), but it is also unbiased in finite populations. They urge researchers with continuous (or multivalued) pre-treatment covariates to “discretize” them before including them as covariates: for example, you might want to control for baseline values of height-for-age z-score (HAZ), but instead you might simply control for a dummy variable for being stunted (HAZ<-2) and its interaction with the treatment indicator. Hopefully, the gain in clarity and against finite sample bias will outweigh the loss in the goodness of fit from the transformation. They also recommend that rather than adjusting for covariates afterwards, you should stratify your randomization on them beforehand (David McKenzie would further suggest that you control for these “balanced by design covariates” in your regression analysis).

So, what is the bottom line here? As AI themselves mention, it would be easy to see these discussions as academic or, even worse, as semantics. But, I do think that giving these issues careful consideration might cost you a few days or weeks, but they’re bound to make your study a better one – even if you don’t end up doing anything differently than you would originally have. In the end, it’s hard to disagree with two recommendations made by Lin:
  1. Report unadjusted estimates – even if your primary analysis is adjusted [I might add that researchers should also report p-values using permutation tests in addition to regression-based ones]
  2. Pre-specify your primary analysis, which then allows you to deviate from it in a clear and transparent way. [I’d add that researchers who have not done pre-analysis plans (PAPs) or have PAPs that did not cover certain scenarios should fall back on the standard operating procedures being developed by Don Green’s lab: see here by Lin and Green, and here for a more detailed and hands-on piece by Lin, Green, and Coppock (that we used and cited in a recent paper). It’s very handy to be able to fall back on standard advice when you can’t turn to your PAP, but it should not be abused as justification to not create careful PAPs at the design stage of your study].
Regression models give economists a tool that they’re familiar with and allow them to estimate treatment effects and standard errors using standard packages while adjusting for covariates to improve precision (even when they conducted stratified randomization by certain covariates ex ante). Used cautiously, responsibly, and along with simpler reporting of unadjusted differences between treatment groups of the statistics of interest (rank sum, mean, etc.), they can be helpful rather than harmful.

However, there is at least on other important reason why the marriage of the potential outcomes framework with regression models is useful and that is for power calculations and the optimal design of experiments. I will discuss that issue in detail in my next blog – stay tuned…


Submitted by Jacobus Cilliers on

Fascinating post, thank you.

One question:
It is standard these days to control for the baseline outcome indicator, which is often a continuous variable. When there is no obvious cut-off criteria, then creating a discrete variable smells like data mining (not to mention the loss in statistical power, as mentioned in your post). Is there any way to formalize the trade-off between "small sample bias" and "loss in power" when deciding whether the discretize or not? I suspect that, for now at least, reviewers will be more comfortable with the continuous measure.

Submitted by Berk Ozler on

Good question - thanks. At a seminar recently, I received a recommendation to conduct post-stratification, which involves creating subgroup cells using baseline covariates. When I asked how to do this that does not look ad hoc, simple machine learning algorithms were suggested, which made sense. Yours is a much simpler version of this and I don't think that there is a standard answer. If I were you, what I would do would be to analyze the baseline distribution of this covariate (before you have the follow-up data) to propose a defensible cutoff and put it in your PAP. Then, when the follow-up data are there, you can report one but try both, so that you can speak to the power loss issue...

Sounds like something for the SOP by Lin, Green, and Coppock...

Add new comment