If you want to know the average impact of being assigned the __option__ of some “treatment”— the so-called “intent-to-treat” parameter—then you will get a good (unbiased) estimate by comparing the mean outcome for an experimental group that is randomly assigned the treatment with that for another group randomly denied that option.

However, it is often the case that we also want to know the impact of actually receiving the treatment. Participation in social experiments is voluntary and selective take-up is to be expected. This is a well-known source of bias. The experimenter’s standard fix is to use the randomized assignment as an instrumental variable (IV) for treatment status. Advocates of this approach argue that randomization is an indisputable source of exogenous and independent variation in treatment status—the “gold standard” for identifying causal impact.

One requirement for a valid IV can be readily granted: the randomized assignment to treatment will naturally be correlated with receiving the treatment. But let’s take a closer look at the other, no less important, requirement, namely that the IV only affects outcomes via treatment—the so-called “exclusion restriction.”

Any imaginable intervention will have diverse impacts, possibly with losers as well as gainers. Such impact heterogeneity can be safely ignored if the differences are uncorrelated with the actual placement of the intervention. But that is hardly plausible. People make rational choices about whether to participate in experiments. Some of the reasons for their choices can be observed as data and so we can control for them by adding interaction effects with the treatment status to the regression for outcomes. That is good practice for understanding impact heterogeneity.

However, people almost certainly base their choices on things they know but the analyst does not know. Take up will depend on latent gains from take up. This gives rise to what Jim Heckman and his coauthors call “essential heterogeneity.” (The idea goes back a long way in Heckman’s writings, but for a good recent discussion see the 2006 paper by Heckman, Urzua and Vytlacil.) This is such an intuitively plausible idea that the onus should be on analysts to establish on *a priori* grounds why it does __not__ exist. Yet it is still rare for experimenters to consider the implications of essential heterogeneity. (This is not, of course, the only problem faced in practice; I discuss a much wider range of issues here.)

It is not hard to see that essential heterogeneity invalidates randomized assignment as an instrumental variable for identifying mean impact. Amongst the units assigned the option for treatment, those with higher expected gains are more likely to participate. Thus the behavioral responses will entail that the assignment is correlated with the error term—specifically with the interaction between treatment and the latent gains from treatment. The randomized assignment is not then excludable from the main regression, and so it does not provide a valid IV—hardly the gold standard!

Just how much of a problem this is depends in part on what you want to learn from the impact evaluation. If you only want the mean impact for those actually attracted to the treatment in the randomized trial then the IV estimator will give you that number in sufficiently large samples. And this holds even though the randomized assignment is not a valid IV. In some applications this might be all you need, though it is rather limited information. You will not know how much the mean impact for those treated differs from the overall mean impact in the population. If you ignore the (likely) presence of essential heterogeneity, and assume that you have figured out the overall mean impact, then you could easily get it wrong in drawing inferences for scaling up the program based on the trial—which is after all the trial’s purpose.

A simple example will illustrate. First, let me describe the reality of a (highly) stylized world comprising 100 people. A policy intervention is introduced—access to an important new source of credit for financing investment. Counterfactual income (in the absence of the program) can take two possible values, namely an income of either $1 or $2 a day, with equal numbers of people having each income. For half those with $1, the causal impact of the credit scheme over some period is $1 (bringing their post-intervention income to $2), while it is zero for the other half. Similarly for those with the $2 income: half see their income rise to $3 while the rest see no gain. Then the mean impact is $0.50 and the total benefit when the credit scheme is available to all is $50. It can be assumed that only those who gain will participate, implying a take-up rate of 50% for the population as a whole when scaled up, or for any random sample.

Now suppose that the policy maker does not know any of this, and decides to enlist an evaluator to do a randomized control trial to assess the likely benefits from introducing the credit scheme as an option for everyone. Following common practice, the evaluator mistakenly assumes that the scheme has the same impact for everyone (or that the heterogeneity is ignorable). As usual, a random sample gets access to the extra credit, with another random sample retained as controls. It is readily verified that the IV estimate of mean impact will be $1 in a sufficiently large sample, which is also the mean impact on those treated. Ignoring the heterogeneity the policy maker will infer that the aggregate benefit from scaling up is $100—twice the true value.

You might still feel confident that using randomization as an IV does at least do better than simply ignoring the problem of endogenous take-up—by using the naïve ordinary least squares (OLS) method of simply comparing the mean outcome for those treated with that for the control group. But such confidence would be misplaced.

Indeed, if essential heterogeneity is the only econometric problem to worry about then the naïve OLS estimator also delivers the mean treatment effect on the treated; the IV and OLS estimates converge in large samples. I show this in a new paper, found here. (Notice that in the above numerical example, the OLS estimate is also $1.) There is no gain from using the IV method! Indeed, OLS requires less data since one does not need to know the randomized assignment and the control group need only represent those for whom treatment is __not__ an option.

The two estimators only differ in large samples when there is some other source of bias. An extension to the standard formulation of the essential heterogeneity problem is to allow the same factors creating the heterogeneity to also matter to counterfactual outcomes. (I develop this extension in the new paper.) If the higher counterfactual outcomes due to these latent factors come hand-in-hand with higher returns to treatment then the IV estimator can still be trusted to reduce the OLS bias in mean impact. A training program providing complementary skills to latent ability is probably a good example.

But here’s the rub. There is no *a priori* reason to expect the two sources of bias to work in the same direction. That depends on the type of program and behavioral responses to the program. If the latent factors leading to higher returns to treatment are associated with __lower__ outcomes in the absence of the intervention then the “IV cure” can be worse than the disease. The following are examples (which are described more fully here):

· A training program that provides skills that substitute for latent ability differences, so that the program attenuates the gains from higher ability.

· A public insurance scheme that compensates participants for losses stemming from some unobserved risky behavior on their part.

· A microfinance scheme that provides extra credit for some target group, such that participation attenuates the gains enjoyed by those with greater access to credit from other sources.

In such cases there is no reason to presume that using randomized assignment as the IV will reduce the bias implied by the naïve OLS estimate of aggregate impact. Indeed, there is even one special case in which the OLS estimator (unlike the IV one) is unbiased for mean impact (as described in here)—the essential heterogeneity can be ignored but so too can the randomized assignment! Granted, this is a “knife-edge” result. But even when both estimators are biased, it can be shown that averaging the two can reduce the bias under certain conditions.

I draw two main lessons from all this. First, to learn about development impact, whether or not you use randomization, there is no substitute for thinking about the likely behavioral responses amongst those offered the specific intervention in the specific context. Second, once one considers plausible (rational) behavioral responses, past claims that randomized assignment is the “gold standard” for identifying causal impact are seen to be greatly overstated; valuable maybe, but certainly not gold!

- Tags:
- heterogeneity

## Comments

## depends on what you mean by scaling up

## Scaling up

## More structure?

## This relates to absolute, not relative, desirability of RCTs

relativemerits of randomized treatment in the settings where its is feasible. This valuable post does not address the question of relative merits. This because every caveat discussed above applies to every evaluation method we have: Treatment effects can be heterogeneous conditional on unobservables in regression discontinuity designs, instrumental variables, propensity score matching --- always and everywhere. And concerns about treatment effect heterogeneity would be much stronger in other less rigorous research designs often employed in development work, such as ordinary least squares regressions and qualitative interviews. The last paragraph's conclusion that randomized treatment is onlymaybevaluable at all---and certainly not a "gold standard"---is therefore unwarranted. If "gold standard" means unconditional flawlessness, then no one I know ever claimed that, so the point is not useful. If "gold standard" means relatively desirable when possible, then the conclusion is unjustified because this post does not address the question of relative merits by pointing out problems with all methods. That said, this post makes a strong and helpful case that theory and related nonexperimental results are critical in properly interpreting the results of randomized trials. That is most certainly true and the helpful examples you offer should be studied by everyone in this field.## Gold standard and rigor

## Regressing on intent-to-treat

## gold standard, etc.

## Reply to David Roodman