External validity as seen from other quantitative social sciences - and the gaps in our practice
First a bit of a conceptual framework. Let’s think about the assessment of external validity as an assessment of how well the impact estimate (based on the trial sample) predicts the impact of the same intervention implemented at scale in the target population. In short we care about Δ, which we define as the difference in the average treatment effect between the estimate from the trial and what would be obtained in the population. If there is no difference – if the impact evaluation is fully externally valid – then Δ = 0. Yet Δ can be non-zero due to differences in characteristics between the sample and the population, if such characteristics also mediate the treatment impact. Δ may also be non-zero due to differences in program implementation between small-scale and at-scale. More formally we can decompose Δ into its constituent components:
Δ =Δ_xo + Δ_xu + Δ_io + Δ_iu + interaction terms
Where Δxo and Δxu are the differences in observed and unobserved characteristics between the trial population and the target population, and Δio and Δiu are the differences in observed and unobserved implementation factors between trial and at-scale. For any one impact evaluation to be readily generalizable, we would hope for all of these delta terms to be zero or close to it. But what if they aren’t?
A paper by Olsen, Orr, Bell, and Stuart doesn’t attempt to answer this question directly, but does seek to clarify the conditions under which Δ will be non-zero [3]. They model external validity bias as a product of three factors:
- the degree of variance in impact across sites in the population of interest
- the coefficient of variation in the inclusion probability of sites across the population of interest
- the correlation between site specific impact and site inclusion probabilities in the population
A recent paper by Cole and Stuart explores techniques to extrapolate trial estimates [4] to a larger population. Specifically the two researchers standardize the observed results of one of the first ARV trials in the U.S. to a broader specified target population – the population of all HIV positive U.S. residents. The original trial found clear mortality reductions from ARV therapy but this original RCT of 1150 patients was largely older, whiter, and better educated than the overall population of HIV positive individuals.
To adjust the trial results so that the impact reflects the broader population, the researchers require values in this target population for the key characteristics that mediate the treatment effect but also vary between the study and target population. In this study, the characteristics considered are sex, race, and age. The conditional probabilities of selection in the trial sample are then estimated as a function of these characteristics, and used to reweight the estimated trial effect in order to reflect the target population. This process still finds overall mortality reductions but a 12% lower effect than found in the trial.
Similar to the above, a paper by Stuart, Cole, Bradshaw, and Leaf uses propensity scores to quantify the difference between trial participants and the target population [5]. Once this is done, the scores can be used to either match or weight control group outcomes to the population in order to assess how well these control group outcomes track the outcomes actually observed in the population. They use an example of a school-based intervention in the U.S.
In order to apply this approach there must be sufficient overlap in the propensity scores for the sample and target population – no method can help extrapolate trial results to a segment of the population if that segment is not observed at all in the trial. One measure of similarity between sample and population is simply the difference in the mean propensity score. While there is no magic threshold, differences of .25 standard deviations in mean propensity scores suggest that a large amount of extrapolation, perhaps unsubstantiated by the data, would be necessary.
Note that all of these papers discuss the importance of the study sample being sufficiently similar to the target population in order to ensure externally valid estimates. If this is not the case then perhaps these impact estimates can be corrected through re-weighting. As such these are all attempts to extrapolate impact estimates to a population when Δxo is non-zero. So this speaks to the importance for impact evaluations to comprehensively measure key mediating characteristics that also may vary respect to the target population. Doing so in a comprehensive fashion also reduces the potential importance of the Δxu term as fewer factors will be unobserved.
But what about the implementation factors, i.e. the Δio term? I couldn’t yet find anything that deals with implementation differences in a quantitative fashion. This is not for lack of recognition of the problem. For example the Stuart et al. paper, which evaluates a school program, mentions that relatively little is known about the school-level moderators of the program impact. Some possible key moderators they mention include the schools organizational capacity to implement the program, the principal’s support for the program, and the institutional motivations for participating in the program. However factors such as these were not assessed, and in fact it’s not clear how to best to measure them.
It is fairly clear however that, with respect to implementation factors, the external validity work is severely underdeveloped – for most studies there is no discussion of which implementation aspects matter the most and, from among these, which of them can be accurately measured. This is a major gap in our evaluative practice and a ripe area for future work.
So some take away messages from this review, as well as the handful of papers in econ that explore this topic:
- To estimate possible bias from Δxo, follow an approach similar to explorations of internal validity bias by comparing characteristics of study sites to target population
- Where divergences exist, perhaps extrapolations can be improved by re-weighting
- Ensure a sufficient number of sites whenever possible, and either select them randomly or, if you must select purposively, do so with goal of broader representation
- Devote additional effort and resources to recruiting sites that initially resist inclusion
- Think hard about factors of implementation that vary between the trial and scale versions of the intervention, and develop suitable quantitative measures where possible