Published on Development Impact

# Guest Post by Winston Lin: Regression adjustment in randomized experiments: Is the cure really worse than the disease? (Part II)

1. Putting the bias issue in perspective

Yesterday’s post addressed two of the three problems David Freedman raised about regression adjustment. Let’s turn to problem #3, the small-sample bias of adjustment. (More technical details and references can be found in my paper.)

Lots of popular methods in statistics and econometrics have a small-sample bias: e.g., instrumental variables, maximum likelihood estimators of logit and other nonlinear models, and ratio estimators of population means and totals in survey sampling. Ratio estimators are a special case of regression estimators. They have a bias of order 1 / N, and Freedman himself did not oppose them: "Ratio estimators are biased, because their denominators are random: but the bias can be estimated from the data, and is usually offset by a reduction in sampling variability. Ratio estimators are widely used." In the paper, I estimate the bias of OLS adjustment in an empirical example with N = 157 and it appears to be miniscule.

I also want to clarify the meaning of unbiasedness in Neyman's and Freedman's randomization inference framework. Here, an unbiased estimator is one that gets the right answer on average, over all possible randomizations. From this unconditional or ex ante perspective, the unadjusted difference in means is unbiased. But ex post, you're stuck with the randomization that actually occurred. Going back to our hypothetical education experiment, suppose the treatment group had a significantly higher average baseline (9th-grade) reading score than the control group. (Let's say the difference is both substantively and statistically significant.) Knowing what we know about the baseline difference, can we credibly attribute all of the unadjusted difference in mean outcomes (10th-grade reading scores) to the treatment? If your statistical consultant says, "That's OK, the difference in means is unbiased over all possible randomizations," you might find that a bit Panglossian.

Alternatively, we can ask for conditional unbiasedness, averaging only over randomizations that yield a similar baseline imbalance. From this ex post perspective, the difference in means is biased. So if N is small enough to make you worry about the unconditional bias of OLS adjustment, that's not a good reason to be satisfied with the difference in means. It may be a good reason to try an adjustment method that is unconditionally unbiased and also has reasonable conditional properties. Middleton & Aronow and Miratrix, Sekhon, & Yu have promising approaches.

2. Tukey's perspective

I mentioned yesterday that the classic precision-improvement argument isn't the only possible rationale for adjustment. The conditional bias of the difference in means is another. John Tukeygave a related but broader perspective: In a clinical trial, some doctors and some statisticians will wonder if your unadjusted estimate is a fluke due to baseline imbalances. Adjustment (as either the primary analysis or a robustness check) is part of doing your homework to check for that.

Tukey writes: "The main purpose of allowing [adjusting] for covariates in a randomized trial is defensive: to make it clear that analysis has met its scientific obligations. (If we are very lucky ... we may also gain accuracy [precision], possibly substantially. If we must lose some sensitivity [power], so be it.)"

My translations in square brackets should be taken with a grain of salt; Tukey believed in "the role of vague concepts". David Brillinger tells a hilarious story about the difference between Tukey and Freedman (p. 8).

Tukey and Freedman might disagree about whether the primary analysis in a randomized trial should be adjusted or unadjusted. But they both argued that statistics should not be used to sweep scientific issues under the rug. Our job is to expose uncertainty, not to hide it.

Although I don't feel qualified to be a go-to person for all practical questions about adjustment, I'll endorse three suggestions that overlap with those of Tukey, Freedman, and many other researchers.

3. Suggestions for researchers, referees, and journal editors

1.       Editors and referees should ask authors to report unadjusted estimates, even if the primary analysis is adjusted. (Page limits are an issue, but Web appendices can be used, and readers should at least be given access to a summary table of unadjusted estimates.)

2.       When possible, researchers should pre-specify their primary analyses (e.g., outcome measures and subgroups, as well as regression models if adjustment is to be used for the primary analysis). It's OK to add unplanned secondary analyses or even to change the primary analysis when there are especially compelling reasons, but departures from the pre-specified plans should be clearly labeled as such. One example to follow is Amy Finkelstein et al.'s analysis of the Oregon health insurance experiment. Finkelstein et al. publicly archived a detailed analysis plan before seeing the outcome data for the treatment group. (But see also Larry Katz’s comments on natural experiments and secondary analyses.)

3.       Researchers, referees, and editors should study Moher et al.’s (2010) detailed recommendations for improved transparency and consider adapting some of them as appropriate to their fields.

Freedman gave recommendations for practice at the end of the second paper and expanded on them in these two excellent pieces. Jas Sekhon (who collaborated with Freedman and co-edited his selected papers) tells me, “David said that he didn't reject adjustment per se, but rejected it being done without taste and judgment. But both taste and judgment are in short supply, so as a general rule, he feared making adjustment common practice, especially if the unadjusted estimates were not presented. A pre-analysis plan would help to alleviate some of his concerns.”

Moher et al. (2010) is the latest version of the CONSORT Explanation and Elaboration document, which Freedman also recommended. The suggestion here is not to simply follow the CONSORT checklist, but to study Moher et al.’s detailed explanations. While searching for an empirical example for my paper, I read a number of experimental studies in leading economics journals and started replicating a few of them. Common problems I encountered were:(1) it was unclear whether randomization was stratified and if so, how; (2) researchers adjusted for post-treatment covariates, but I wouldn’t have known this if I hadn’t started replicating their analyses; (3) sample exclusions were not clearly explained, and the decision to adjust resulted in sample exclusions; and (4) the description of the treatment didn’t give enough information about services received by the control group. And these were papers I liked.

In some experiments, the probability of treatment varies across strata. In that case, what I mean by an “unadjusted estimate” is not the overall difference in means, but a weighted average of stratum-specific differences in means. Ashenfelter & Plant is a model of clarity on this. Space doesn’t allow me to discuss important questions about how this kind of unadjusted analysis compares with common practice in economics, so we’ll have to save that for another time.

Takeaways

My paper isn’t meant to provide guidance on all practical issues (that’s part of the reason for “Agnostic” in the title). Its goal is to add to the research conversation, not to bring it to resolution, and its first priority is to put Freedman’s critique in perspective. Freedman’s papers give a theoretical analysis of a completely randomized experiment with no attrition—not for realism, but because, as Paul Krugman writes, “Always try to express your ideas in the simplest possible model. The act of stripping down to this minimalist model will force you to get to the essence of what you are trying to say.” I used the same framework to show just how little needs to be changed to get more benign results. Parts of my paper are technical, but readers can skip around without reading all the details. It recommends helpful references and includes an empirical example with discussion of why adjustment often hardly makes a difference to precision in social experiments. This blog post is more expansive in some ways, but it’s not a substitute for the paper.

In the real world, we often have to deal with issues like stratification, clustering, survey nonresponse, and attrition. Also, linear regression isn’t the only possible adjustment method--e.g., some people prefer matching, and others are using machine learning algorithms. Other people have much more expertise than I do in most of these areas. So the practical takeaways I end with are necessarily oversimplified:

·         In moderate to large samples with covariates that are helpful in predicting the outcome (e.g., if the outcome is 10th-grade reading scores and the covariate is 9th-grade reading scores for

the same students), careful regression adjustment is a reasonable approach.

·         For careful adjustment, the covariates should be pre-specified and unaffected by the treatment, and the number of covariates should be much smaller than the sample size.

·         Adjustment with treatment-by-covariate interactions (as described earlier) will often be a reasonable method in moderate to large samples, and may perform better than adjustment without interactions when the design is imbalanced (i.e., the treatment and control groups are of unequal size).

·         Adjustment opens the door to fishing and loss of transparency unless strict protocols are followed. If researchers choose to adjust, they should also report unadjusted estimates.

Acknowledgments

David Freedman passed away in 2008. I only met him once, but he was very generous to me with unsolicited help and advice after I sent him comments on three papers. He encouraged me to study at Berkeley even though (or perhaps because) he knew my thoughts on adjustment were not the same as his. (As always, he was also a realist: "You have to understand that the Ph.D. program is a genteel version of Marine boot camp. Some useful training, some things very interesting, but a lot of drill and hazing.") I remain a big fan of his oeuvre, and I hope it's clear from my "Further remarks" and final footnote that the paper is meant as not only a dissent, but also a tribute.

Larry Orr, Steve Bell, Howard Bloom, Steve Freedman, Daniel Friedlander, and Steve Kennedy taught me much about experiments that I wouldn’t have learned otherwise.

Many thanks to Berk Özler for helpful discussions and all his work to shape and edit this piece, and to Jed Friedman, David McKenzie, Jas Sekhon, Dylan Small, and Terry Speed for helpful comments. Any errors are my own.

Winston Lin is a Ph.D. candidate in statistics at UC Berkeley. He used to make regression adjustments (always pre-specified) for a living at Abt Associates and MDRC.

## Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000