Following on David’s rant on external validity yesterday, which turned out to be quite popular, I decided to keep the thread going. Despite the fact that the debate is painted in ‘either/or’ terms, my feeling is that there are things that careful researchers/evaluators can do to improve the external validity of their studies.
Duflo, Glennerster, and Kremer say in this 2006 toolkit on randomization  that “internal validity is necessary for external validity, but it is not sufficient.” Nancy Cartwright (not the one that is the voice of Bart Simpson), in this paper  questioning whether RCTs are the gold standard, sees the two more as a trade-off. She says:
“Other methods, less rigorous at the front end, on internal validity, can have far better warrant at the back end, on external validity. We must be careful about the trade-offs. There is no a priori reason to favour a method that is rigorous part of the way and very iffy thereafter over one that reverses the order or one that is less rigorous but fairly well reasoned throughout.”
So, which is it? While the following statements are likely not entirely fair to either side (of the debate on RCTs and validity), one gets the feeling that RCT proponents care much more about internal validity and relegate external validity to a slightly lower plane (or, to ‘later,’ as in “we can do follow-up studies ‘later’”) Conversely, the other side seems to be saying that it can live with a reasonable amount of doubt about internal validity if external validity is much better. I find myself agreeing with both statements above, which, then, at least to me, makes it strange that there is such a chasm between randomistas and their critics.
I think it is a fair statement that internal validity is necessary but not sufficient for external validity. The problem is with the definition of internal validity. I guess what happens here is that the anti-randomistas see the RCT camp as having 'vanity of rigor’ (Cartwright’s term), implying that the randomistas will not accept much else as rigorous work with internal validity. I am not sure this kind of generalization is warranted – surely there are good and bad practitioners on either side.
However, suppose that we allow for a continuum of internal validity (and consider studies with non-random assignment) and of external validity. Then, I can see myself preferring a study that is highly externally valid for a development problem or policy in a large number of settings with solid, but not ironclad, internal validity, to one that is perfectly identified but less relevant to circumstances outside the study sample. When seen in this way, the trade-off Cartwright refers to is obvious: what separates us into the artificial camps is the difference in our preferences along this trade-off.
(There is no requirement that all development practitioners and researchers be obsessed with both types of validity. Having studies math as an undergrad, I know that I had professors who spent day and night for months and years on some problem, say on differential geometry, never caring whether anyone else could make practical use of this work: the thrill of solving that obscure problem, proving that theorem, was enough. Of course, many years later, someone may use that piece of work to explain the DNA spirals, which may lead to many practical applications. So, it’s OK – in fact it’s great – to have people obsessed with internal validity only: if we have economic theory that produces testable hypotheses, some parameters of which are unknown, it is great to design experiments (lab or field) to estimate these. It hardly matters whether the intervention itself can be immediately put to use or scaled up by the World Bank or the Gates Foundation: we need this type of work. On the other hand, as a development economist working at the World Bank, just like many other policy-oriented academics, we have a duty to carry out studies that have high external validity right now, including on many important topics for which random assignment is impossible or simply out of the question.)
However, there are many things we can do to have internal and external validity, even though there will always be trade-offs. But, they require more thought, more work, more resources, and much more attention to the design of studies. Most importantly, they require that the researcher take an explicit stand on what choices she made when faced with these trade-offs and why. This is akin to David’s call yesterday  for researchers to set their programs in context and explain to us in great detail why crucial design decisions pertinent for external validity were made one way vs. another.
In Section 8 of their toolkit on randomization, Duflo, Glennerster, and Kremer (2006) go through a useful discussion of five issues related to external validity of RCTs and suggest some ways to ameliorate these concerns. Here are a few simple suggestions I add that might improve your study design:
1. Start with a relevant question: Not much to say here: make sure it passes the ‘who cares?’ test.
2. List and randomly select your study sample: One of the main criticisms levied against small RCTs is that they use ‘convenience samples.’ Retrospective studies that use nationally representative data are then preferred for external validity compared with these small samples that readers cannot make heads or tails of. It does not always have to be this way: even if you are going to set your study in one region or district in one small country, you can randomly sample your locations from the universe in that area, randomly select your study sample within those areas, and randomly assign treatment and control across these areas/schools/individuals. This allows you to generalize your findings to the entire target population in that region. This may not be as good as nationally representative, but it is much better than one clinic, two schools, or three villages that were selected non-randomly.
This will inevitably require more work and more resources (many times you will not have a proper and recent sampling frame, meaning you may have to list your study population first – perhaps door-to-door!). So, maybe this is not the way to go for your dissertation work (small grant, not enough time, etc.), but (a) you won’t be grad students forever, and (b) you can increasingly piggy-back on large projects that your professors have if you work hard and are patient.
3. WWtGD? Excuse the bad pun, but, really, try to keep this in mind during the study design phase: What Would the Government Do? Usually, you will start with a question that will imply an ideal study design: i.e. the design that would answer exactly that question of interest – as cleanly as possible. But, inevitably, that design will get amended as you move forward. If you constantly keep in mind how a program feature is likely to be treated in a scaled-up version of your program at the national level, you will ensure more external validity – even though it might likely come with trade-offs mentioned above for internal validity. An example might help. With apologies, it comes from my own work – only because I know it best.
In our study on comparing conditional and unconditional cash transfers in Malawi , we faced several difficult decisions during the design phase. One of these was on measurement: we chose not to have frequent random visits to classrooms to measure school participation. We were worried that frequent visits (even by independent data collectors not linked to the intervention) to check attendance would give the students in the unconditional treatment arm the impression that they were supposed to be attending school to receive their monthly payments. The idea was that a ‘truly unconditional’ cash transfer program at scale would not ‘suggest’ schooling by frequently monitoring attendance, so we should do our best to not do so either. However, there was a potentially large cost to this: we did not have direct observation on enrollment or attendance and had to rely on self-reports, teacher-reports, and school ledgers to measure these outcomes. Worrying about external validity came at the cost of some internal validity. We collected independent data on achievement tests in math and language to ameliorate this concern, but the fact remains that we had less than ideal measurement of our main outcome variable because we did not want to contaminate the interpretation of the findings in a larger context.
The important point here is not whether we made the right call or not: reasonable researchers can disagree with these types of decisions (of which we had at least a few more): someone else might have made the opposite call as the principal investigator. Rather, it is about weighing the pros and cons of these decisions carefully for the question at hand, which can be very pertinent for external validity, making explicit choices, and then being prepared to defend them. Depending on how you resolved these design issues, you will then be justifying those decisions against critics on either side of the aisle.
Let me finish with this from Cartwright (2007), which is a good guide to keep in mind: “Gold standard methods are whatever methods will provide (a) the information you need, (b) reliably, (c) from what you can do and from what you can know on the occasion.” It is easy to see why RCTs don’t always fit these criteria…