# Guest Post by Winston Lin: Regression adjustment in randomized experiments: Is the cure really worse than the disease? (Part II)

## This page in:

**1. Putting the bias issue in perspective**

Yesterday’s post addressed two of the three problems David Freedman raised about regression adjustment. Let’s turn to problem #3, the small-sample bias of adjustment. (More technical details and references can be found in my paper.)

Lots of popular methods in statistics and econometrics have a small-sample bias: e.g., instrumental variables, maximum likelihood estimators of logit and other nonlinear models, and ratio estimators of population means and totals in survey sampling. Ratio estimators are a special case of regression estimators. They have a bias of order 1 / N, and Freedman himself did not oppose them: "Ratio estimators are biased, because their denominators are random: but the bias can be estimated from the data, and is usually offset by a reduction in sampling variability. Ratio estimators are widely used." In the paper, I estimate the bias of OLS adjustment in an empirical example with N = 157 and it appears to be miniscule.

I also want to clarify the meaning of unbiasedness in Neyman's and Freedman's randomization inference framework. Here, an unbiased estimator is one that gets the right answer on average, over all possible randomizations. From this **unconditional** or **ex ante** perspective, the unadjusted difference in means is unbiased. But **ex post**, you're stuck with the randomization that actually occurred. Going back to our hypothetical education experiment, suppose the treatment group had a significantly higher average baseline (9th-grade) reading score than the control group. (Let's say the difference is both substantively and statistically significant.) Knowing what we know about the baseline difference, can we credibly attribute all of the unadjusted difference in mean outcomes (10th-grade reading scores) to the treatment? If your statistical consultant says, "That's OK, the difference in means is unbiased over all possible randomizations," you might find that a bit Panglossian.

Alternatively, we can ask for **conditional** unbiasedness, averaging only over randomizations that yield a similar baseline imbalance. From this **ex post** perspective, the difference in means is biased. So if N is small enough to make you worry about the unconditional bias of OLS adjustment, that's not a good reason to be satisfied with the difference in means. It may be a good reason to try an adjustment method that is unconditionally unbiased and also has reasonable conditional properties. Middleton & Aronow and Miratrix, Sekhon, & Yu have promising approaches.

**2. Tukey's perspective**

I mentioned yesterday that the classic precision-improvement argument isn't the only possible rationale for adjustment. The conditional bias of the difference in means is another. John Tukeygave a related but broader perspective: In a clinical trial, some doctors and some statisticians will wonder if your unadjusted estimate is a fluke due to baseline imbalances. Adjustment (as either the primary analysis or a robustness check) is part of doing your homework to check for that.

Tukey writes: "The main purpose of allowing [adjusting] for covariates in a *randomized* trial is defensive: to make it clear that analysis has met its scientific obligations. (If we are very lucky ... we may also gain accuracy [precision], possibly substantially. If we must lose some sensitivity [power], so be it.)"

My translations in square brackets should be taken with a grain of salt; Tukey believed in "the role of vague concepts". David Brillinger tells a hilarious story about the difference between Tukey and Freedman (p. 8).

Tukey and Freedman might disagree about whether the primary analysis in a randomized trial should be adjusted or unadjusted. But they both argued that statistics should not be used to sweep scientific issues under the rug. Our job is to expose uncertainty, not to hide it.

Although I don't feel qualified to be a go-to person for all practical questions about adjustment, I'll endorse three suggestions that overlap with those of Tukey, Freedman, and many other researchers.

**3. Suggestions for researchers, referees, and journal editors**

1. Editors and referees should ask authors to report unadjusted estimates, even if the primary analysis is adjusted. (Page limits are an issue, but Web appendices can be used, and readers should at least be given access to a summary table of unadjusted estimates.)

2. When possible, researchers should pre-specify their primary analyses (e.g., outcome measures and subgroups, as well as regression models if adjustment is to be used for the primary analysis). It's OK to add unplanned secondary analyses or even to change the primary analysis when there are especially compelling reasons, but departures from the pre-specified plans should be clearly labeled as such. One example to follow is Amy Finkelstein et al.'s analysis of the Oregon health insurance experiment. Finkelstein et al. publicly archived a detailed analysis plan before seeing the outcome data for the treatment group. (But see also Larry Katz’s comments on natural experiments and secondary analyses.)

3. Researchers, referees, and editors should study Moher et al.’s (2010) detailed recommendations for improved transparency and consider adapting some of them as appropriate to their fields.

A few comments:

Freedman gave recommendations for practice at the end of the second paper and expanded on them in these two excellent pieces. Jas Sekhon (who collaborated with Freedman and co-edited his selected papers) tells me, “David said that he didn't reject adjustment per se, but rejected it being done without taste and judgment. But both taste and judgment are in short supply, so as a general rule, he feared making adjustment common practice, especially if the unadjusted estimates were not presented. A pre-analysis plan would help to alleviate some of his concerns.”

Moher et al. (2010) is the latest version of the CONSORT Explanation and Elaboration document, which Freedman also recommended. The suggestion here is not to simply follow the CONSORT checklist, but to study Moher et al.’s detailed explanations. While searching for an empirical example for my paper, I read a number of experimental studies in leading economics journals and started replicating a few of them. Common problems I encountered were:(1) it was unclear whether randomization was stratified and if so, how; (2) researchers adjusted for post-treatment covariates, but I wouldn’t have known this if I hadn’t started replicating their analyses; (3) sample exclusions were not clearly explained, and the decision to adjust resulted in sample exclusions; and (4) the description of the treatment didn’t give enough information about services received by the control group. And these were papers I liked.

In some experiments, the probability of treatment varies across strata. In that case, what I mean by an “unadjusted estimate” is not the overall difference in means, but a weighted average of stratum-specific differences in means. Ashenfelter & Plant is a model of clarity on this. Space doesn’t allow me to discuss important questions about how this kind of unadjusted analysis compares with common practice in economics, so we’ll have to save that for another time.

**Takeaways**

My paper isn’t meant to provide guidance on all practical issues (that’s part of the reason for “Agnostic” in the title). Its goal is to add to the research conversation, not to bring it to resolution, and its first priority is to put Freedman’s critique in perspective. Freedman’s papers give a theoretical analysis of a completely randomized experiment with no attrition—not for realism, but because, as Paul Krugman writes, “Always try to express your ideas in the simplest possible model. The act of stripping down to this minimalist model will force you to get to the essence of what you are trying to say.” I used the same framework to show just how little needs to be changed to get more benign results. Parts of my paper are technical, but readers can skip around without reading all the details. It recommends helpful references and includes an empirical example with discussion of why adjustment often hardly makes a difference to precision in social experiments. This blog post is more expansive in some ways, but it’s not a substitute for the paper.

In the real world, we often have to deal with issues like stratification, clustering, survey nonresponse, and attrition. Also, linear regression isn’t the only possible adjustment method--e.g., some people prefer matching, and others are using machine learning algorithms. Other people have much more expertise than I do in most of these areas. So the practical takeaways I end with are necessarily oversimplified:

· In moderate to large samples with covariates that are helpful in predicting the outcome (e.g., if the outcome is 10th-grade reading scores and the covariate is 9th-grade reading scores for

the same students), careful regression adjustment is a reasonable approach.

· For careful adjustment, the covariates should be pre-specified and unaffected by the treatment, and the number of covariates should be much smaller than the sample size.

· Adjustment with treatment-by-covariate interactions (as described earlier) will often be a reasonable method in moderate to large samples, and may perform better than adjustment without interactions when the design is imbalanced (i.e., the treatment and control groups are of unequal size).

· Adjustment opens the door to fishing and loss of transparency unless strict protocols are followed. If researchers choose to adjust, they should also report unadjusted estimates.

**Acknowledgments**

David Freedman passed away in 2008. I only met him once, but he was very generous to me with unsolicited help and advice after I sent him comments on three papers. He encouraged me to study at Berkeley even though (or perhaps because) he knew my thoughts on adjustment were not the same as his. (As always, he was also a realist: "You have to understand that the Ph.D. program is a genteel version of Marine boot camp. Some useful training, some things very interesting, but a lot of drill and hazing.") I remain a big fan of his oeuvre, and I hope it's clear from my "Further remarks" and final footnote that the paper is meant as not only a dissent, but also a tribute.

Larry Orr, Steve Bell, Howard Bloom, Steve Freedman, Daniel Friedlander, and Steve Kennedy taught me much about experiments that I wouldn’t have learned otherwise.

Many thanks to Berk Özler for helpful discussions and all his work to shape and edit this piece, and to Jed Friedman, David McKenzie, Jas Sekhon, Dylan Small, and Terry Speed for helpful comments. Any errors are my own.

Winston Lin is a Ph.D. candidate in statistics at UC Berkeley. He used to make regression adjustments (always pre-specified) for a living at Abt Associates and MDRC.

My paper with Miriam Bruhn (http://siteresources.worldbank.org/DEC/Resources/In_Pursuit_of_Balance…) offers a couple of complementary perspectives on this issue. Notably,

- p225-227, notes that one thing you definitely don't want to do is to choose whether or not to adjust on the basis of a t-test of equality of means and discusses the meaning of the standard Table 1 which tests for balance.

- pages 228-229 offers a development economist's take on key things to report about an experiment, which is a complement to the CONSORT guidelines approach/Moher et al. paper.

Finally, the paper also notes the importance of controlling for variables you have stratified the randomization on - i.e. "adjusting" for variables you have forced already to be balanced, which is a bit of a different category than those discussed above.

Thanks very much for these comments, David. I agree with most of those points in your paper and it gives an excellent, helpful discussion.

Another complementary resource, which I found via Green and Gerber's new Field Experiments textbook that you reviewed recently, is Boutron, John, and Torgerson, "Reporting methodological items in randomized experiments in political science", Annals of the American Academy of Political and Social Science, 2010:

http://ann.sagepub.com/content/628/1/112.short

About the analysis when stratified randomization ensures balance, there are a number of defensible methods, including the one you mentioned. I think it's useful to separate two questions:

1) Are we estimating the simple average treatment effect for the entire study sample, or a weighted ATE that may weight the strata differently from their shares of the study sample? (And have we made clear to ourselves and our readers which one we're estimating?)

2) Do our standard errors, confidence intervals, and significance tests take into account (i.e., give us credit for) the blocking that ensures balance on certain baseline variables?

On #1, if the probability of treatment varies across strata, then OLS regression of the outcome on the treatment group dummy and strata dummies (without interactions) yields a consistent estimate of a weighted ATE, where the strata are weighted not by their shares of the study sample, but in a variance-minimizing way. (This is the Angrist 1998 Econometrica result, which is also discussed in Mostly Harmless Econometrics, section 3.3.1, and I believe in Morgan and Winship's causal inference book, although I don't have that book.)

Now if the study sample is a convenience sample, then this weighted ATE may not be any less policy-relevant than the simple ATE, so I wouldn't argue that there's one right estimand. But I'm not sure whether all researchers who use this method are aware of the Angrist result, and if they are, many of them are not making clear to their readers what the estimand is.

One of the many reasons I like the Ashenfelter & Plant paper is that they're very explicit about their estimand. (The "hands above the table" expression in my paper is an allusion to "hands-above-the-table econometrics". The other week I asked a reliable source if he remembered who coined the phrase. He said, "I believe that was Orley.")

Instead of using regression, they take the difference in means within each stratum and then form a weighted average of the stratum-specific treatment effect estimates. (This can equivalently be done by regression with treatment-by-stratum interactions if you recenter the stratum indicators appropriately.) For efficiency reasons, they also do not estimate the simple ATE, but they make this very clear. Regression often hides these issues if researchers and readers aren't careful. I'm not saying there's one right analysis, I just want to see more transparency.

On #2, you're making a very important point that we need to take into account the precision gains from blocking, and regressing on strata dummies is an attempt to do that. I think it'll work well in many cases, especially when the treatment probability is close to 50%. However, I haven't seen a discussion that justifies it without assuming a parametric model. Again, Ashenfelter & Plant are more transparent: the SE for their weighted average of stratum-specific treatment effect estimates can be easily derived from the stratum-specific SEs, and it takes into account the gains from blocking without any need for parametric assumptions.

Cyrus Samii has been thinking about these issues too, and he may already have notes that explain these issues more clearly than I have. I think it would be useful for all of us to keep discussing this; I don't claim to have all the right answers. Thanks again for your helpful comments.