Guest Post by Winston Lin - Regression adjustment in randomized experiments: Is the cure really worse than the disease? (Part I)

|

This page in:

Random assignment is intended to create comparable treatment and control groups, reducing the need for dubious statistical models. Nevertheless, researchers often use linear regression models to "adjust" for random treatment-control differences in baseline characteristics. In 2008, David Freedman published two papers critiquing this common practice. Freedman's critique is often interpreted as implying that adjustment is a bad idea, and it's had a strong influence on methodology discussions in political science and criminology. Berk and Jed tell me it's also mentioned in referee reports and seminars in development economics.

 

My paper "Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman's Critique" (Annals of Applied Statistics, forthcoming) tries to clarify the issues Freedman raised. (Slides from an earlier talk are also available.) Development Impact kindly invited me to give an overview and asked me to go further in discussing      practical implications. This is the first post of a two-part series; the second part (tomorrow) will discuss the small-sample bias of adjustment, along with practical suggestions for researchers, referees, and journal editors in the social sciences.

 

1. Conventional wisdom before Freedman

 

The classic rationale for OLS adjustment (which assumes the regression model is true) is that it tends to improve precision, under these conditions:

  • The covariates are pre-specified, correlated with the outcome, and unaffected by treatment.
  • The number of covariates (K) is much smaller than the sample size (N).

E.g., suppose we conduct an experiment to estimate the effect of an education program on 10th-grade reading scores (Y), and we also have data on the same students' 9th-grade reading scores (X), measured before random assignment. Common practice is to regress Y on T (the treatment group dummy) and X, regardless of whether the treatment and control groups have different mean values of X. The covariate "soaks up" variation in Y and thus reduces the standard error of the estimated treatment effect.

 

Adjustment can hurt precision if K is too big relative to N or if the covariates have very little correlation with the outcome. Cox & McCullagh (p. 547) give a rule-of-thumb formula that can suggest both when to worry and how high the R-squared has to be for adjustment to be of much help. Duflo, Glennerster, & Kremer (sec. 4.4) give a useful summary of conventional wisdom.

 

There are other possible rationales for adjustment (as we’ll discuss tomorrow). But precision improvement is the textbook rationale and the one challenged by Freedman's critique.

 

2. What did Freedman show?

 

Freedman analyzed OLS adjustment without assuming a regression model. He used Jerzy Neyman's framework for randomization inference, which avoids dubious assumptions about functional forms and error terms. Random assignment of the finite study population is the source of randomness, treatment effects can vary across study participants, and the goal is to estimate the average treatment effect. (Jed blogged earlier about R.A. Fisher's version of randomization inference, which tests the sharp null hypothesis that treatment had no effect on anyone.)

 

Freedman found 3 problems with adjustment:

  1. Contrary to the classic rationale, adjustment can help or hurt precision asymptotically (even as N - K goes to infinity, and even when the covariates are correlated with the outcome).
  2. The conventional OLS standard error estimator is inconsistent.
  3. The adjusted treatment effect estimator has a small-sample bias. The bias is of order 1 / N, so it diminishes rapidly as N grows.

To explain these results, Freedman shows that a randomized experiment cannot satisfy the assumptions of a classical linear model. He writes (beginning in perfect iambs), "The reason for the breakdown is not hard to find: randomization does not justify the assumptions behind the OLS model."

 

My paper shows that problems #1 and #2 can be easily fixed with simple tools of modern OLS regression. On #3, I briefly try to put the issue in perspective and point to new developments from Middleton & Aronow and Miratrix, Sekhon, & Yu.

 

A further problem is that in the absence of a strict protocol, regression adjustment can open the door to fishing (ad hoc specification searching). My abstract argues, "The strongest reasons to support [Freedman's] preference for unadjusted estimates are transparency and the dangers of specification search." We’ll come back to transparency and fishing tomorrow, but they are not the issues raised in Freedman's critique.

 

3. Does adjustment invalidate the standard errors?

 

I'll quickly address problem #2 first since Development Impact readers may have already guessed the solution.

 

Under the same assumptions used by Freedman, the Huber-White sandwich ("robust") SE estimator is consistent or asymptotically conservative. Non-economists who are unfamiliar with the sandwich may want to check out the "agnostic regression" discussion in Mostly Harmless Econometrics (pp. 40-48). (In large samples, the sandwich and nonparametric bootstrap SE estimators are very similar.)

 

Moreover, in a two-arm trial with a balanced design(i.e., the treatment and control groups have equal size), even the conventional OLS SE estimator is consistent or asymptotically conservative. This result was shown by Freedman but isn't mentioned in his own or many others' summaries of his critique.

 

As Berk discussed, the sandwich estimator can be unreliable in small samples. But small-sample inference is fragile regardless of whether you adjust. In fact, Freedman's preferred SE estimator for unadjusted estimates is equivalent to the HC2 sandwich estimator, as also noted by Samii & Aronow.

 

4. Precision improvement: Is the conventional wisdom wrong?

 

Going back to problem #1, I'll say more about practical implications here than I do in the paper. 

 

First, when does OLS adjustment hurt precision? We saw in section 1 above that the conventional wisdom already included some caveats. Freedman's analysis suggests we need more caveats. In the first paper's abstract, he writes, "Since randomization does not justify the models, almost anything can happen." I think that's too nihilistic, and here's why:

  • Freedman himself shows that in a two-arm trial with a balanced design, adjustment cannot hurt asymptotic precision. This is only an asymptotic result, but it suggests that the conventional caveats suffice here.
  • In remark (v), p. 10 of my paper, I briefly note an immediate consequence of Freedman's analysis: In a two-arm trial, in order for adjustment to hurt asymptotic precision, either the design must be so imbalanced that over 75% of the subjects are assigned to one group, or the covariate must covary more with the treatment effect than with the expected outcome. (These are necessary but not sufficient conditions.)

Second, my paper shows that a more refined OLS adjustment (using treatment-by-covariate interactions as described in the next paragraph) cannot hurt asymptotic precision, even when the regression model is horribly misspecified. (For intuition, I use analogies with survey sampling and Cochran's wonderful book. As I say on my18th slide, this result is “surprising, but not completely new”. The paper’s intro cites precedents that assume random sampling from a superpopulation.)

 

Conceptually, we can think of this adjustment as running separate regressions of Y on X in the treatment and control groups, and using these to predict the average outcome for the entire sample (1) if everyone were assigned to treatment and (2) if everyone were assigned to control. A computational shortcut is to regress Y on T, X, and T * (X - xbar), or equivalently to regress Y on T, X - xbar, and T * (X - xbar), where xbar is the mean covariate value for the entire sample. Then the coefficient on T estimates the average treatment effect (ATE) for the entire sample.

 

The summary of asymptotic results (Section 4.2 of the paper) suggests that among three consistent estimators of ATE (the unadjusted difference in means, the usual adjustment, and the interacted adjustment), the interacted adjustment is always either most precise or tied for most precise. Stepping out of asymptopia, I think the practical implications are:

  • In a two-arm trial with a balanced or nearly balanced design, the conventional wisdom is reasonable for the usual adjustment. I.e., adjustment for pre-specified baseline covariates tends to improve precision, if the covariates are correlated with the outcome and the number of covariates is much smaller than the sample size.
  • The conventional wisdom is reasonable more generally for the interacted adjustment (if the number of covariates is much smaller than the sample size of the smallest group).
  • If the design is very imbalanced and there are heterogeneous treatment effects that are strongly related to the covariates, then the interacted adjustment tends to be more precise than the usual one--possibly much more precise. Otherwise, the difference is likely to be small; the interacted adjustment won't necessarily do better, but it won't do much worse. (All of this assumes the number of covariates is much smaller than the sample size of the smallest group.)

Tomorrow we’ll discuss the bias of adjustment, Tukey vs. Freedman, and suggestions for improving transparency.

 

Winston Lin is a Ph.D. candidate in statistics at UC Berkeley. He used to make regression adjustments (always pre-specified) for a living at Abt Associates and MDRC.

Jed Friedman
July 19, 2012

... I think our audience will find it to be very useful and thought provoking. As you write, I and others have indeed received informal comments and referee reports claiming that adjusting for observables leads to biased inference (without supplemental caveats on small sample bias)... you will discuss this in particular tomorrow, but one interesting takeaway is that the precision arguments of Freedman don't seem to have settled in the minds of practitioners as much as bias. Thanks again and looking forward to more! Jed.

Winston
July 19, 2012

Thanks, Jed, it's interesting to hear about the discussions you've had.

I really don't know whether people find the small-sample bias to be most troubling of the 3 issues (to me it was the least troubling, but I have to admit I was prejudiced because the experiments I've worked on had large samples), or whether they're not remembering the other issues.

Most of us are busy with multiple projects, so it's hard to find time to read and digest papers, and I have friends who told me they just didn't remember exactly what Freedman said. On the other hand, I've seen very good methodologists writing about Freedman's critique as if the key or only issue is the small-sample bias. Since I admire their other work, I'd like to assume they had a sophisticated reason for this, e.g. they think the validity of inference is the first priority (which I agree with) and they know the SE issue is really about conventional vs. sandwich SEs, so we're left with the bias issue. But I don't know if that's what they're thinking.

I've seen other very good methodologists emphasizing the inconsistent SE issue as the most important of Freedman's issues. And I agree with them, but it's solved by the sandwich SE.

David Judkins
September 07, 2012

Winston,

I found your paper very interesting. I wonder how the interact estimator of ATE is related to the approach of Gary Koch, which, as I recall, is to fit a working model on the entire dataset without the treatment variable, and then to analyze the relationship of the residuals from the working model to treatment status. If the sample size is small, one uses a permutation test. Otherwise, one can use Z-tests. LeSaffre and Senn really panned it in 2003 but they only simulated the large-sample version.

--Dave

Winston Lin
September 08, 2012

Thanks, Dave. I haven't read Koch's papers in detail, but Daniel Rubin and Mark van der Laan discuss several estimators that have equal asymptotic variance, including their own targeted ANCOVA as well as OLS with interactions and Koch et al.'s (1998) estimator:

http://www.springerlink.com/content/q1817j682nw22847/

Rubin and van der Laan have some interesting simulations comparing the finite-sample performance of these methods, and I think more comparisons under different data-generating processes would be worthwhile.

Another method in this class is the "tyranny of the minority" estimator I discuss on p. 10, remarks (iii) and (iv) after Corollary 1.2. The tyranny idea doesn't generalize to designs with more than two groups (e.g., two treatment groups and a control group), but OLS with interactions does.