Syndicate content

A rant on the external validity double double-standard

David McKenzie's picture

Concerns about external validity are a common critique of micro work in development, especially experimental work. While not denying that it is useful to learn what works in a variety of different settings, there seems to be two forms of double-standard (or a double double-standard) going on: first, economic journals and economists in general seem to apply it to work on developing countries more than they do to other forms of research; and second, this concern seems to be expressed about experiments more than other micro work in development.  

Double-standard 1: A common refrain from top journals is the following (which I received on a recent paper) “Both referees feel that the results, while strong, are unlikely to generalize to other settings”. So let’s look at the April 2011 AER. It contains among other papers (i) a lab experiment in which University of Bonn students were asked to count the number of zeros in tables that consisted of 150 randomly ordered zeros and ones; (ii) a paper on contracts as reference points using students of the University of Zurich or the Swiss Federal Institute of Technology Zurich; (iii) an eye-tracking experiment to see consumer choices done with 41 Caltech undergraduates; and (iv) a paper in which 62 students from an unnamed university were presented prospects for three sources of uncertainty with unknown probabilities; (v) a paper on backward induction among world class chess players.  It would seem hard to argue that external validity is greater or more obvious from such studies than from the average development paper.

But this is not the point of these papers of course. The point of these papers is to provide evidence from some particular context of a particular phenomenon or economic behavior. An important role for lab experiments in particular is proof of concept – to show under controlled settings that a particular behavior emerges. There are so many key development questions for which we have no rigorous evidence, that getting strong evidence from one setting, be it rural India, several villages in Malawi, or even small Pacific Islands, is surely informative. The test here should be how much we already know on the topic – getting returns to education cleanly for Chad might not be so interesting given the plethora of existing studies (unless it is vastly different from what we observe elsewhere) - whereas providing some first evidence on a key topic on which we know much less, like (insert favorite topic here), should be appreciated, provided it is from somewhere that is useful for looking at this phenomenon.

Double-standard 2: “Oh, but what about external validity?” is also one of the most common critiques of experimental work in development. Indeed President Zoellick’s speech on development research asked “has development economics lost its way?” and suggested that research was giving something “more like a map of the world being filled in by careful study of non-randomly chosen villages, one at a time”.   Dani Rodrik, among others, argues that “Randomized evaluations are strong on internal validity, but produce results that can be contested on external validity grounds….By contrast, the standard econometric and qualitative approaches I described above are weaker on internal validity—but conditional on credible identification, they have fewer problems of external validity.”

Is it really the case that external validity is so much more of a concern for experiments than for other micro-studies? Consider some of the most cited and well-known non-experimental empirical development research papers: Robert Townsend’s Econometrica paper on risk and insurance in India has over 1200 cites in Google Scholar, and is based on 120 households in 3 villages in rural India; Mark Rozenzweig and Oded Stark’s JPE paper on migration and marriage is based on the same Indian ICRISAT sample; Tim Conley and Chris Udry’s AER paper on technology adoption and pineapples is based on 180 households from 3 villages in southern Ghana; on a somewhat larger scale, Shankar Subramanian and Angus Deaton’s work on the demand for calories comes from 5630 households from one state in India in 1983.

The point is that we have learnt a lot from such studies. Clearly none of them by themselves can prove that their results would apply to other settings, and indeed some of our understanding of how things work has come from attempts to build on these initial studies by considering them in other settings and seeing whether the results are similar or different.

Clearly then the fact that a study is small in scale and limited to one location does not per se prevent it from being able to teach us a lot. So complaining that we are learning very little from randomized experiments on these grounds strikes me as ideology.

So how, and when, should we use an external validity standard in judging the merit of impact evaluations and other micro work in development? I’m sure we can have active debate on this issue, but here are some thoughts:

·         I do want to see a researcher set their study in context, and explain why this is a useful place to look at the issue being studied. So for example, studies which focus on clients of a particular microfinance organization or participants in a particular NGO program need to tell me why I should care about this group, and why looking at this subpopulation is interesting. Often the context will be useful for understanding one question, but not another – e.g. a sample from Tonga I would argue is useful for looking at migration issues given the importance of migration to small countries, but might be less useful for looking at the impacts of SME policies given the small market size and small number of firms in this country. A sample of college graduates might not be the right population for testing whether basic financial literacy is a key constraint to savings, but could well be a useful sample for other studies, etc.

·         I agree with Rodrik, Deaton, and others who stress the importance of trying to understand why a particular intervention works or not. This is a tough issue to address, as discussed in an earlier post, but a combination of theory, process evaluation, qualitative interviews, and analysis of the causal chain can allow progress.

·         I’m excited that so many experiments are being conducted to teach us about particular theories of economic behavior (e.g. how do adverse selection and moral hazard work in credit markets, how does time inconsistency affect decision-making, how does intra-household decision-making work, etc.) – but I think the double double-standard leads to undervaluation of simple treatment effects for important policy programs. Policymakers care if we can say “such and such a program to create jobs lead to 2000 jobs being created in country X” – then all the issues of generalizability apply when deciding whether the same could expect to hold in country Y, but for a new or untested program in which we have no rigorous evidence, proof that it worked somewhere is an important, and undervalued, starting point.

My bottom line is that I think experimental studies should be judged by the same standards as we should be applying to other work, namely “does this credibly teach us something about either how people behave in an economic context, or about how a particular policy performs in an interesting context?”.  Whether or not the result then generalizes to other settings is part of the ongoing research agenda, and should not be, by itself, the sole criteria on which we judge a paper.

Here ends the rant…look forward to hearing what readers think on this issue.

Comments

Submitted by Anonymous on
Great post. Nothing to add, you said it all so well, but it gave me good firepower for the next time I am subjected to these arguments.

Do you have any thoughts about the relationship of validity with inference? In discussions such as these, I often find these terms used interchangeably. If we say that experiments have strong internal validity are we saying that the statistical inferences are sound or our we referring to the causal implications of the statistical inferences (or both)? Conversely, can one have external validity if the statistical inferences of the study are not justified? For most observational research I encounter, there is rarely a well defined population from which the data were sampled. Rather, if pressed, I think researchers would either claim a "super population" (e.g. of all possible countries on all possible Earths) or posit a data generating process (that for some unknown reason is additive with Normal errors). I find these justifications thin. What then can we say about the external validity of the study if we do not have a population to which we can target our inferences?

Submitted by Berk Ozler on
I'll be posting a follow-up to David's piece today, in which you may find related arguments and references...

I made this comment over at Aid Thoughts, from which I learned about this post (have yet to update my priors on which blogs to check regularly). Aid Thoughts notes that studies that are not routinely critiqued over lack of external validity rarely lead to stories in the NYT about the end of poverty. My response: With no snark intended, where are the links to those NYTs stories heralding the results of a micro-experiment as the end of poverty? I recognize, and appreciate, the need for caution in interpreting the results of experiments. I am personally prone to the intellectual sin of changing some of my priors too quickly. But in all of my experience with the randomistas, including David McKenzie, they have been the most cautious in limiting the interpretation of their results. In fact, I am befuddled by the dual critique that seems to be going on. On the one hand, there is the critique that the conclusion of every experiment is, "more experiments are necessary" and on the other hand that the studies lack external validity. Both cannot be truly valid in a meaningful way.

Thanks Tim. The Aid Thoughts link is: http://aidthoughts.org/?p=2466 I think the general point that rhetoric can get carried away, leading to people talking about opposing viewpoints as strawmen is right - but just because some people or news organizations get carried away, is no reason for the rest of us to disregard nuance.

Rant it may be called but I feel the case has been brilliantly argued. However, (there is always one is there not?) As a practitioner with very very shallow pockets, there is always the temptation to stretch the learning to other settings. For instance, I am always looking for means of extending finding from a study into promotion of economic leadership amongst women in Ethiopia to a setting in Tanzania. The frustration comes when either methods chosen or assumptions in the theory of change are very specific to micro scenarios. These necessitate additional study which implementers can rarely afford. Also very often the learning that a practitioner is looking for are more on the strategies and processes that worked (or did not) and tight micro studies do not lend themselves easily to delivering that.

Thanks Makarand, I agree we always want more. This is where the other elements of a successful evaluations play a role too - the process evaluation and clear description of what was done; the insights into the context in where it was done and what in that context was thought to be important; and the stuttering steps towards understanding why something worked that we all try and make attempts towards. One of the insights from behavioral experiments (and of course from decades of marketing research) is that the strategies and processes for how an intervention gets sold can matter as much or more than the intervention or product itself - so I look forward to more discussions on this blog about these issues as well.

It's striking that little of this discussion references the actual negatives of RCT-based approaches; critics seem to favor nebulous appeals to a lack of external validity over anything concrete. But there are definitely downsides. Most of the ones that come to mind are in the extended family of placebo effects. While the placebo effect is often used to describe an actual physical response in the brain (producing natural opiates when a fake pain medication is administered) most of its close cousins are purely psychological or sociological. Hawthorne effects and John Henry effects seem the most relevant here (R. Barker Bausell has a great book detailing a bunch more in the context of medical research). It's hard to eliminate them in development experiments because a credible placebo is usually inconceivable. David, would you consider those to be threats to external validity? To me they seem like they'd mainly impact internal validity, but then it's hard to nail down exactly what is meant by experiments lacking external validity. There's no reason to think they'd be any less generalizable than an observational study with a comparable sample, as you point out.

Thanks Jason for this comment. Placebo effects arise in some, but not all, interventions. And we don't always worry about them - e.g. if we are investigating whether a training program helps get women into employment, we may not mind if the effect is through skills, or through some sort of placebo effect in which simply taking part in useless training gives you the confidence to approach employers. Of course we'd like to know so we can make the training better, but this type of placebo effect can in some sense be considered part of the intervention, so be ok for both internal and external validity. In many economic experiments, people don't know they are in an experiment, which also helps with not having to worry as much about these effects. I think this topic though deserves more discussion in a future post.

Submitted by Margarita Rayzberg on
This is a really interesting post. In the case of the first double-standard - that economic journals and economists seem to apply it to work on developing countries more than they do to other forms of research - and if external validity is not more of a concern for experiments than for other micro-studies, what is your sense for why this might be the case? Why more scrutiny (vis-a-vis external validity) for research in developing countries than for research in Bonn laboratories?