Is in danger of being messed up. Here is why: There are two fundamental reasons for doing impact evaluation: learning and judgment. Judgment is simple – thumbs up, thumbs down: program continues or not. Learning is more amorphous – we do impact evaluation to see if a project works, but we try and build in as many ways to understand the results as possible, maybe do a couple of treatment arms so we see what works better than what. In learning evaluations, real failure is a lack of statistical power, more so than the program working or not. One could argue easily that judgment and learning are two sides of the same coin, but it’s really more yin and yang – but bear with me for a minute.
Now there are two big impact evaluation instigator camps – the policy sphere and the academic sphere. The academic sphere is all about the learning. The policy sphere is much more conflicted – and that’s where we need to be more thoughtful and explicit as institutions as to why we are doing impact evaluation.
Here’s why: while learning and judgment look like two sides of the same coin, they have very different implications for how an impact evaluation gets done. And that’s going to affect the quality of the results and the long term prospects for the shape of an evaluation culture.
Let me explain the differences. First: a judgment evaluation really requires the evaluator to be outside the project team. Impartiality trumps, and the credibility of that impartiality comes in part from being an outsider. If learning is the objective, the evaluator has to be inside the team – working with the project team to develop the questions the evaluation will address, the different treatment arms, maybe even aspects of the intervention. And my take is that impact evaluation done at a distance is almost always worse than that done in conjunction with the team – not least because the evaluator has less of a sense as to what the program actually did (including the treatment assignment, which can be a moving target) However, with close engagement, impartiality becomes, especially in terms of external credibility, much more cloudy.
Second, there is the question of how interventions get selected for an impact evaluation. In the world of judgment evaluations, they are selected by the folks in charge. Indeed, in this world it might even be a good idea to select them randomly – a sufficiently large budget would, in fact, give you a pretty good sense of performance (leaving aside the rather major problem that a significant fraction of what some agencies/governments do isn’t amenable to impact evaluation). For the learners, the selection of impact evaluations is driven by knowledge gaps: What don’t we know? What’s important to know? Can we find interventions to help us learn about this?
Third, as an extension of the points above, in judgment evaluations the program itself has no say in whether they are evaluated (in fact some non-trivial fraction will not want it – which from a judgment point of view is precisely the point). For a learning evaluation, the program has to want it, or at least allow the evaluators to proceed unhindered. While the judgment evaluator can work really hard on collecting the right data, the lack of program enthusiasm for this activity is likely to lower the overall quality of the data available in the end.
Fourth, there is an issue of what outcomes get evaluated. In a judgment evaluation, the project objective(s) as specified in the program document are all that matters. Learning evaluations should look at these, but are going to cast a wider net – and this will teach us things like the fact that conditional cash transfers might not only be good for keeping girls in school, but preventing HIV.
Fifth, the accountability of the evaluators is different, and this affects where they put their effort. For judgment evaluations, the evaluator serves some incarnation of management, and will answer the questions they want. For learning, the evaluator is responsible to a more diffuse bunch – management, the program team, the journal referees… This different emphasis has the potential to lead to different results. For example, the learning evaluator is more likely to spend more time and effort looking at heterogeneity when the average treatment effect is zero.
Sixth, the evaluation design itself is likely to be different. For a judgment evaluation, what matters is the aggregate effect. The learning evaluator is going to take a more organic approach – and while the aggregate effect matters, she or he is much more likely to look separately at different parts of the program or the impacts on different sub-populations.
At the end of the day, for the policy sphere, this isn’t a dichotomy – impact evaluations in the policy sphere aren’t – for now – either judgment or learning, but have elements of both. However, I’ve been in a couple of discussions at more the one agency lately where I see a tilt towards the judgment side. I haven’t looked incredibly hard, but it strikes me that no (?) policy institution clearly articulates their view on this balance, and then indicates how their actions/commitments/resources line up behind this view. Should they make this explicit? Heck, should they tilt to one side? If not, how do they make the Tao stay in balance?