Published on Development Impact

A rant on the external validity double double-standard

This page in:

Concerns about external validity are a common critique of micro work in development, especially experimental work. While not denying that it is useful to learn what works in a variety of different settings, there seems to be two forms of double-standard (or a double double-standard) going on: first, economic journals and economists in general seem to apply it to work on developing countries more than they do to other forms of research; and second, this concern seems to be expressed about experiments more than other micro work in development.  

Double-standard 1: A common refrain from top journals is the following (which I received on a recent paper) “Both referees feel that the results, while strong, are unlikely to generalize to other settings”. So let’s look at the April 2011 AER. It contains among other papers (i) a lab experiment in which University of Bonn students were asked to count the number of zeros in tables that consisted of 150 randomly ordered zeros and ones; (ii) a paper on contracts as reference points using students of the University of Zurich or the Swiss Federal Institute of Technology Zurich; (iii) an eye-tracking experiment to see consumer choices done with 41 Caltech undergraduates; and (iv) a paper in which 62 students from an unnamed university were presented prospects for three sources of uncertainty with unknown probabilities; (v) a paper on backward induction among world class chess players.  It would seem hard to argue that external validity is greater or more obvious from such studies than from the average development paper.

But this is not the point of these papers of course. The point of these papers is to provide evidence from some particular context of a particular phenomenon or economic behavior. An important role for lab experiments in particular is proof of concept – to show under controlled settings that a particular behavior emerges. There are so many key development questions for which we have no rigorous evidence, that getting strong evidence from one setting, be it rural India, several villages in Malawi, or even small Pacific Islands, is surely informative. The test here should be how much we already know on the topic – getting returns to education cleanly for Chad might not be so interesting given the plethora of existing studies (unless it is vastly different from what we observe elsewhere) - whereas providing some first evidence on a key topic on which we know much less, like (insert favorite topic here), should be appreciated, provided it is from somewhere that is useful for looking at this phenomenon.

Double-standard 2: “Oh, but what about external validity?” is also one of the most common critiques of experimental work in development. Indeed President Zoellick’s speech on development research asked “has development economics lost its way?” and suggested that research was giving something “more like a map of the world being filled in by careful study of non-randomly chosen villages, one at a time”.   Dani Rodrik, among others, argues that “Randomized evaluations are strong on internal validity, but produce results that can be contested on external validity grounds….By contrast, the standard econometric and qualitative approaches I described above are weaker on internal validity—but conditional on credible identification, they have fewer problems of external validity.”

Is it really the case that external validity is so much more of a concern for experiments than for other micro-studies? Consider some of the most cited and well-known non-experimental empirical development research papers: Robert Townsend’s Econometrica paper on risk and insurance in India has over 1200 cites in Google Scholar, and is based on 120 households in 3 villages in rural India; Mark Rozenzweig and Oded Stark’s JPE paper on migration and marriage is based on the same Indian ICRISAT sample; Tim Conley and Chris Udry’s AER paper on technology adoption and pineapples is based on 180 households from 3 villages in southern Ghana; on a somewhat larger scale, Shankar Subramanian and Angus Deaton’s work on the demand for calories comes from 5630 households from one state in India in 1983.

The point is that we have learnt a lot from such studies. Clearly none of them by themselves can prove that their results would apply to other settings, and indeed some of our understanding of how things work has come from attempts to build on these initial studies by considering them in other settings and seeing whether the results are similar or different.

Clearly then the fact that a study is small in scale and limited to one location does not per se prevent it from being able to teach us a lot. So complaining that we are learning very little from randomized experiments on these grounds strikes me as ideology.

So how, and when, should we use an external validity standard in judging the merit of impact evaluations and other micro work in development? I’m sure we can have active debate on this issue, but here are some thoughts:

·         I do want to see a researcher set their study in context, and explain why this is a useful place to look at the issue being studied. So for example, studies which focus on clients of a particular microfinance organization or participants in a particular NGO program need to tell me why I should care about this group, and why looking at this subpopulation is interesting. Often the context will be useful for understanding one question, but not another – e.g. a sample from Tonga I would argue is useful for looking at migration issues given the importance of migration to small countries, but might be less useful for looking at the impacts of SME policies given the small market size and small number of firms in this country. A sample of college graduates might not be the right population for testing whether basic financial literacy is a key constraint to savings, but could well be a useful sample for other studies, etc.

·         I agree with Rodrik, Deaton, and others who stress the importance of trying to understand why a particular intervention works or not. This is a tough issue to address, as discussed in an earlier post, but a combination of theory, process evaluation, qualitative interviews, and analysis of the causal chain can allow progress.

·         I’m excited that so many experiments are being conducted to teach us about particular theories of economic behavior (e.g. how do adverse selection and moral hazard work in credit markets, how does time inconsistency affect decision-making, how does intra-household decision-making work, etc.) – but I think the double double-standard leads to undervaluation of simple treatment effects for important policy programs. Policymakers care if we can say “such and such a program to create jobs lead to 2000 jobs being created in country X” – then all the issues of generalizability apply when deciding whether the same could expect to hold in country Y, but for a new or untested program in which we have no rigorous evidence, proof that it worked somewhere is an important, and undervalued, starting point.

My bottom line is that I think experimental studies should be judged by the same standards as we should be applying to other work, namely “does this credibly teach us something about either how people behave in an economic context, or about how a particular policy performs in an interesting context?”.  Whether or not the result then generalizes to other settings is part of the ongoing research agenda, and should not be, by itself, the sole criteria on which we judge a paper.

Here ends the rant…look forward to hearing what readers think on this issue.


David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000