Syndicate content

WEIRD samples and external validity

Jed Friedman's picture

A core concern for any impact evaluation is the degree to which its findings can be generalized to other settings and contexts, i.e. its “external validity”. But of course external validity concerns are not unique to economic policy evaluation; in fact they are present (implicitly or explicitly) in any empirical research with prescriptive implications. This point is hit home in a very entertaining paper in the journal Behavioral and Brain Sciences, by Henrich, Heine, and Norenzayan, which comprehensively reviews the behavioral sciences literature and concludes that behavioral studies are overwhelmingly based on samples from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies. Has behavioral science (including the behavioral sub-fields in economics) overly focused on WEIRD subjects to the detriment of a broader understanding of human behavior?

Their review of recent articles published in the top psychological journals found that 68% of subjects came from the US, especially easily recruited undergraduate students, and a whopping 96% of subjects were from developed countries in the west. So the tendency to study WEIRD samples is undeniable. Yet while the study samples have been consistently particular, the inferences made from these studies typically strive for universality.

The Henrich, Heine, and Norenzayan paper is not just a finger-wagging “aha-gotcha” review – it takes pains not only to indicate which received wisdom in the behavioral sciences is challenged by more diverse samples but also discusses findings that do indeed appear to be universal. The authors look across a variety of psychological/behavioral domains of study including visual perception and spatial cognition (which seems to systematically vary around the world). I’ll mention their comparative findings from behavioral game theoretic studies such as the Ultimatum Game (where one player proposes a division of resources and the second player chooses to accept or reject the offer – if the offer is rejected neither player receives a payoff) and the Dictator Game (where the second player does not have the agency to accept or reject).

In early experimental results, the tendency for the proposed division to come close to 50% (especially when reputational concerns were added by repeated play) was taken as evidence for human beings’ evolved capacity for fair and punishing behavior. However it turns out that when the same games are applied to many small-scale societies, it is the results from the WEIRD societies that are the outliers. The mean offer from participants in the U.S. is higher than 14 other groups and nearly double the amount offered from the Hadza, foragers in Tanzania, and the Tsimane, foragers in the Bolivian Amazon. Meta-analysis of a wide array of data indicate that both a population’s degree of market integration and participation in a world religion independently predict higher offers. Rather than conclude that the tendency for cooperative exchange evolved early in human history the authors look more favorably on the view that “norms and institutions for exchange in ephemeral interactions culturally co-evolved with markets and expanding large-scale sedentary populations.”

Interestingly, there is also some evidence that certain groups around the world exhibit a willingness to reject offers that are too high, and the tendency to reject increases as the offer goes from 60% to 100% of the stake. While US samples have never exhibited this tendency, it is found is populations in Russia, China, and non-student adults in German, Sweden, and the Netherlands. A similar phenomenon has been observed in the comparisons of Public Good Games administered in various settings. In WEIRD samples, adding the possibility of punishing free-riding players shifts the outcome of these games from an equilibrium of little cooperation to one of stable high levels of cooperation. In non-WEIRD samples there is the same likelihood to punish free riders, but also an additional tendency to punish the overly cooperative – i.e. those that contributed more than the punisher deemed fair. This tendency, not observed in northern and western European populations was observed in samples from Oman, Greece, Saudi Arabia, and Russia.

Beyond game theory, there are clear divergences between WEIRD samples and much of the world when it comes to views of self, the basis of moral reasoning, and so on. The authors conclude their extensive review with the judgment:

The sample of contemporary Western undergraduates that so overwhelms our database is not just an extraordinarily restricted sample of humanity; it is frequently a distinct outlier vis-à-vis other global samples. It may represent the worst population on which to base our understanding of Home sapiens. Behavioral scientists face a choice – they can either acknowledge that their findings in many domains cannot be generalized beyond this unusual subpopulation (and leave it at that), or they can begin to take the difficult steps to building a broader, richer, and better-grounded understanding of our species.

Of course a meta-review of current impact evaluation work in development economics will reveal that WEIRD samples are far from the norm. Indeed the impact evaluation coverage is fairly global, with influential work conducted on most every continent. However I wonder if future reviews will point to an analogous bias of some sort. Perhaps, as often mentioned, methodological preferences focus efforts on an overly restrictive set of research questions. Or perhaps the increased attention that an evaluated intervention receives by the implementer results in bias that won’t reflect achieved outcomes under more typical circumstances. There is some merit to both these concerns, although I also feel the field continues to broaden and improve. We’ve engaged these concerns before on the DI blog and will keep discussing the weird and the regular in impact evaluation.