Syndicate content

Getting beyond the mirage of external validity

Markus Goldstein's picture
This post is coauthored with Eliana Carranza
 
No thoughtful technocrat would copy a program in every detail for a given context in her or his country.    That's because they know (among other things) that economics is not a science but a social (or dismal even) science, and so replication in the fashion of chemistry isn't an option.  For economics, external validity in the strict scientific sense is a mirage.
 
What the technocrats and politicians are interested in though, is what has worked elsewhere so they can draw on that when designing their program.   So here we posit three ways to think about how this knowledge can be organized that give us the closest we can get to external validity. 
 
  1. Adaptive replications:  This is the case when the sense and spirit of the program design is kept the same (and yes, that is somewhat vague) but, far from adopting a cookie cutter approach, it is replicated in a different context with significant local variation.  This type of replication is best illustrated by the nice recent graduation from poverty multi-country study in Science that Berk blogged about on Monday.  Banerjee, et. al. look at a set of programs characterized by a core set of interventions (give people livestock, training, consumption support, etc) but their local implementation differs in who actually gets the asset, how the program is targeted, what the consumption support looks like and other factors.  The variation in program design isn't mind-blowingly huge, but some of it is large enough that evaluations could (or have) asked questions that just focus on the particularly variation (e.g. how long and when to give consumption support to poor households).  But to us, wearing our social science hats (and not our lab coats), this kind of variation is exactly what we want because the next policy maker to adopt this kind of program is probably going to tweak things along the same dimensions.  In the end, for a sufficient diverse number of cases, this approach lets us see whether fairly similar (not identical) programs have similar effects in a number of contexts, and infer the probability that this program will work in a new context.  And that's really useful.
      
  2. Evolutionary learning:  In this case, the next iteration of a program draws on previous implementation and impact evaluation results to inform and try a significant design variation in the same or a different context. When the program results are null or negative, thoughtful evaluation tries to figure out why this was the case (was some other constraint missed? was there an implementation failure? was the "dosage" too weak? were impacts heterogeneous?), giving the next version of the program something to build on.  The same holds true for positive program impacts, where a tweak will be tried or a complementary intervention will be added to the next iteration of the program.  And the learning occurs when this next version of the program is evaluated.  The literature on business training is a good example of this.  Most early evaluations of training were not so promising.  So folks tried other interventions in the same vein (e.g. providing management consulting as in Bloom, et. al.). They tried/are trying training combined with capital, and training that takes a very different approach to what it is trying to teach.  Evolution - the natural order of things (even economics). 
     
  3. Learning at scale:  Here an initial positive result(s) leads to scale-up the implementation and evaluation of a program design in the same or a different context. This type of experimentation is seeking to answer how small scale programs (and the evaluation results) turn out when a lot of folks participate.  General equilibrium effects, less tailored implementation and a host of other considerations and effects might mean that program impacts are different under this scenario, so this is obviously an important way to learn.
 
Taking stock of these three approaches, clearly what is lacking is more adaptive replication studies and more learning at scale.
 
There are a couple of reasons why this is the case.     First, for learning at scale, convincing policymakers to do an evaluation that large and complex is orders of magnitude harder than doing one for a pilot program for a few thousand beneficiaries.  
 
Second (and this applies to both), is how publication and policymaking incentives play out.  Once the initial program is implemented and/or its evaluation results are published, the visibility of each additional iteration of program X (whether an adaptive replication or scaled-up version) drops significantly. In the publication space this may mean moving down the ladder of journal rankings or not being able to publish results at all. In the policy and evaluation funding spaces, it may become more difficult to justify and secure funding for another evaluation of program X when there is already evidence from the same or a similar context.
 
Banerjee, et. al. show us a way out of the challenge faced by adaptive replications: they can be made more appealing by setting up all the evaluations to take place at the same time, although this requires a massive amount of coordination (and no small amount of funding).  For learning at scale though, it’s still an uphill battle.   Some things that might help get more of this is capitalizing on cases where groups within the government/implementing organization have viable, competing visions of a project or finding evidence minded policymakers who are senior enough to push for this.   
 
Here's hoping we get more of both in the future.
 

Comments

Submitted by Gareth on

Thanks Markus and Eliana, you make a very good point about adaptive replication. Are there any prestigious journals that accept replication studies? Also, are there any studies that use observational data for interventions at scale to test if the effects from a pilot persist?

Submitted by Gabriel on

Great article, thank you! I will piggyback to ask a related question: are there any evaluations (randomized or otherwise) of *modern* at-scale expansions of deworming?

Submitted by Katrin Verclas on

Gabriel - you might be interested in a blog post recently on why we are not measuring *impact* at scale (though we measure plenty else), using the example of national deworming programs that we support http://www.evidenceaction.org/blog-full/why-we-do-not-measure-impact-at-scale-and-are-unapologetic-about-it.

Add new comment