The Tao of Impact Evaluation


This page in:

Is in danger of being messed up.   Here is why:   There are two fundamental reasons for doing impact evaluation: learning and judgment.   Judgment is simple – thumbs up, thumbs down: program continues or not.   Learning is more amorphous – we do impact evaluation to see if a project works, but we try and build in as many ways to understand the results as possible, maybe do a couple of treatment arms so we see what works better than what. In learning evaluations, real failure is a lack of statistical power, more so than the program working or not.   One could argue easily that judgment and learning are two sides of the same coin, but it’s really more yin and yang – but bear with me for a minute.  

Now there are two big impact evaluation instigator camps – the policy sphere and the academic sphere. The academic sphere is all about the learning.   The policy sphere is much more conflicted – and that’s where we need to be more thoughtful and explicit as institutions as to why we are doing impact evaluation.  

Here’s why: while learning and judgment look like two sides of the same coin, they have very different implications for how an impact evaluation gets done. And that’s going to affect the quality of the results and the long term prospects for the shape of an evaluation culture.  

Let me explain the differences.   First: a judgment evaluation really requires the evaluator to be outside the project team.   Impartiality trumps, and the credibility of that impartiality comes in part from being an outsider.   If learning is the objective, the evaluator has to be inside the team – working with the project team to develop the questions the evaluation will address, the different treatment arms, maybe even aspects of the intervention. And my take is that impact evaluation done at a distance is almost always worse than that done in conjunction with the team – not least because the evaluator has less of a sense as to what the program actually did (including the treatment assignment, which can be a moving target) However, with close engagement, impartiality becomes, especially in terms of external credibility, much more cloudy. 

Second, there is the question of how interventions get selected for an impact evaluation.   In the world of judgment evaluations, they are selected by the folks in charge.   Indeed, in this world it might even be a good idea to select them randomly – a sufficiently large budget would, in fact, give you a pretty good sense of performance (leaving aside the rather major problem that a significant fraction of what some agencies/governments do isn’t amenable to impact evaluation).   For the learners, the selection of impact evaluations is driven by knowledge gaps: What don’t we know? What’s important to know? Can we find interventions to help us learn about this?  

Third, as an extension of the points above, in judgment evaluations the program itself has no say in whether they are evaluated (in fact some non-trivial fraction will not want it – which from a judgment point of view is precisely the point).   For a learning evaluation, the program has to want it, or at least allow the evaluators to proceed unhindered.   While the judgment evaluator can work really hard on collecting the right data, the lack of program enthusiasm for this activity is likely to lower the overall quality of the data available in the end.  

Fourth, there is an issue of what outcomes get evaluated.   In a judgment evaluation, the project objective(s) as specified in the program document are all that matters.   Learning evaluations should look at these, but are going to cast a wider net – and this will teach us things like the fact that conditional cash transfers might not only be good for keeping girls in school, but preventing HIV.  

Fifth, the accountability of the evaluators is different, and this affects where they put their effort.   For judgment evaluations, the evaluator serves some incarnation of management, and will answer the questions they want.   For learning, the evaluator is responsible to a more diffuse bunch – management, the program team, the journal referees… This different emphasis has the potential to lead to different results.   For example, the learning evaluator is more likely to spend more time and effort looking at heterogeneity when the average treatment effect is zero.  

Sixth, the evaluation design itself is likely to be different.   For a judgment evaluation, what matters is the aggregate effect.   The learning evaluator is going to take a more organic approach – and while the aggregate effect matters, she or he is much more likely to look separately at different parts of the program or the impacts on different sub-populations.  

At the end of the day, for the policy sphere, this isn’t a dichotomy – impact evaluations in the policy sphere aren’t – for now – either judgment or learning, but have elements of both.   However, I’ve been in a couple of discussions at more the one agency lately where I see a tilt towards the judgment side.   I haven’t looked incredibly hard, but it strikes me that no (?) policy institution clearly articulates their view on this balance, and then indicates how their actions/commitments/resources line up behind this view.   Should they make this explicit?   Heck, should they tilt to one side?    If not, how do they make the Tao stay in balance?  


Markus Goldstein

Lead Economist, Africa Gender Innovation Lab and Chief Economists Office

September 25, 2012

Perhaps the real dichotomy is not judgment vs. learning, but about what is being learned. Academics and researchers typically look to extract global knowledge from an evaluation with the notino that a universal theory is being tested, not a specific program. The "policy sphere" is more interested in local learning - what will work here and now, with what impact at which cost?

Markus Goldstein
September 25, 2012

Derin, I agree. Two quick thoughts: 1) i think the judgment v learning tension matters because it shapes how we *do* the evaluation -- sure learning will always happen, but if the central motivation is up or down on a project, for some kind of organizational decision, it will be different than if the motivation is "let's see if this works and what aspects are important, etc". 2) Your point also leads to the question: who is going to evaluate yet another conditional cash transfer. They might not work everywhere, but someone interested in global knowledge will see that area as one perhaps not deserving her/his attention.

September 25, 2012

Agreed that there is a tendency to lean towards judgment evaluations in the policy sphere. There's also a tendency for learning evaluations to be misinterpreted as judgment evaluations, sometimes to the detriment of the programs they’ve evaluated. Since we know that policymakers make funding decisions on the basis of impact evaluations (whether they are judgment or learning exercises), researchers have a role to play in ensuring the findings of their evaluations are interpretted accurately. It may tweak an evaluator’s pride to say “The evaluation’s low statistical power reduced our ability to measure the program’s impact” instead of “The evaluation was unable to measure the program’s impact” but the reason researchers found no impact is clearer in the first statement than in the second statement. And ultimately, providing clarity is the goal of a learning evaluation.

Bill Savedoff
October 02, 2012

I like the distinction Markus is making between studies aimed at making a judgment from those oriented toward learning. It is similar to the distinction that I've seen between evaluations meant to address accountability in contrast to those meant to generate knowledge. In particular, there's a concern that people will be less willing to collaborate on evaluations aimed at accountability because it is directly assessing whether they are doing their jobs well; compared to evaluations aimed at building knowledge which can treat mistakes and failures as part of a normal learning process. This parallels the outsider/insider point that Markus was making, too.

I think the balance that you're looking for can come from looking at an organization's overall portfolio of evaluations. The work we've done at CGD has regularly argued that rigorous impact evaluations should be done strategically and on a small set of projects, chosen based on the value of the knowledge that can be learned from them. Organizations could tilt in favor of learning for that particular subset of evaluations, and lean in favor of judgment in other studies. This is the approach that USAID took in its new evaluation policy - focusing on learning in a subset of evaluations and setting high standards for those, while conducting a range of studies for other purposes.