Well I’m writing this on Election Day evening here in the U.S., and am rather consumed by the events at hand. Like many of our readers I have been following various poll aggregators (here , here , and here ) and now awaiting whether their predictions (all consistent with each other) will be borne out in the results. (Update: They certainly were – score one for survey data driven analysis.)
So, while waiting for the results to trickle in, I’ll discuss an analytic technique that (a) can help the with the external validity of an impact estimate and (b) is very similar to what political polls must do if they wish to infer “likely” voting patterns from their sample of respondents. Pollsters rescale responses with weights that reflect their hypotheses on the characteristics of the electorate. In impact evaluations, we can sometimes do the same thing when we are worried about external validity. With some (possibly very strong) assumptions we are able transform results from an impact evaluation on a “selected sample” to more externally valid results concerning a wider population.
Why do we care about this topic? Unfortunately, quite often the characteristics of participants in a randomized trial diverge from the target population of study (often the general population). This can happen for a variety of reasons. Perhaps the study is confined to a region that may differ in certain ways from the country as a whole. Or perhaps individuals of selected characteristics disproportionately take up the evaluated program.
Imai, King, and Stuart, as well as many other researchers, decompose possible bias in impact estimates  into two types of selection bias: treatment assignment bias, also known as selection on observables, and sample selection bias, or selection on unobservables. Selection on unobservables is always the tricky thorn in the side of causal estimates. One reason we like properly designed and implemented RCTs is because bias from selection on unobservables is often minimized or even completely eradicated. We have more analytic options to deal with selection on observables.
So how can results estimated off of an observably selected sample be generalized to a wider population? Well if we want to accept certain assumptions then we can use the magic of re-weighting. Stuart, Cole, Bradshaw, and Leaf discuss an extension of this long-standing method  in a recent paper. The authors propose the use of propensity score methods first to quantify the differences between study subjects and target populations in a flexible manner, and then to weight the observed outcomes to the population in order to assess the generalizability of the results to the broader population.
For the methods of Stuart et al. to be valid we have to assume no there is no selection into program participation on the basis of unobservables. This is an assumption that may or may not be a strong one, depending on the setting and the inclination of the researcher.
If we are comfortable with this assumption, there are essentially two ways of generalizing the average treatment effect, estimated on the sample, to a Population Average Treatment Effect (what we want to know).
- Post-stratification, which re-weights impact estimates based on population distributions. For example, consider the case of a randomized trial that has 20% males and 80% females but the population distribution is 50/50. Post-stratification first estimates effects separately by gender and then averages the two effects with population weights. When there are only a small number of variables this can be quite effective, but clearly becomes exceedingly complex if we are concerned with a large number of characteristics.
- Inverse probability treatment weighting (IPTW), which utilizes the propensity score. How is this done? First-off a propensity score is estimated that models membership in the study sample in relation to the wider population. By comparing the propensity scores of the sample vs. the population we can identify when a sample diverges too much from the population of interest to yield a reliable generalization. The difference in propensity score means is one metric for suitability of comparison. How do we know when a difference is large enough to prohibit generalization? There is no iron-clad criterion although Ho et al. recommend a threshold of 0.25 standard deviations  in the propensity score. If the difference is greater than this threshold, then results will depend on a large amount of extrapolation.
Once estimated, the propensity score can then be used to weight each individual observation to transform the study sample into a “pseudo-population” with characteristics similar to those of the target population.
The authors explore these concepts with an evaluation of a school-based intervention  in the U.S. state of Maryland that aims to improve school climate and promote positive change in staff and student behaviors. The evaluation of this program involved 37 schools randomly assigned to treatment or control. The authors match these study schools to data on all schools in Maryland and see that the study schools tend to be somewhat different – they have slightly lower test scores, more poor students, and higher rates of student suspension. The average difference in propensity scores between the study sample and the population is 0.73 SDs, a substantial difference that exceeds the threshold suggested above.
In spite of this large difference, the reweighted group outcomes of study controls tracks the population values over the three-year study period across a number of dimensions such as test scores and suspensions. This suggests that, when weighted appropriately, the schools in the trial can help policy makers learn about expected population effects across the whole state. Thus we should have more confidence that the estimated gains from the intervention would likely continue if the program were to be scaled up statewide.
Getting back to political polling of voting intentions – one topic I’d like to learn a lot more about is the potential promise of internet-based polling vis-à-vis the more standard phone-based polling. I imagine that reweighting is used extensively with internet-based polling and I am curious if the “no selection on unobservables” assumption is valid for the selected group of internet users. If anybody knows of good papers out there please let us know. Oh wait. Our commenting functions are down due to inundations by spammers, so hold that thought. We’ll return to it before the next major election.