On Tuesday, Markus wrote about the potential conflict between two competing reasons for conducting project evaluations: judgment vs. learning. He defined the former as “thumbs up or down: program continues or stops,” while learning evaluations tell us why a project worked and how it can be better designed. Today, I want to challenge that notion because there is neither a clear line between judgment and learning, nor should there be any. In fact, it’s fine to only have the latter…
We should not be doing evaluations because there is some sort of institutional demand from the top for everyone to be evaluating their projects. Only a culture in which project managers (we call them task team leaders or TTLs over here) are encouraged (and given the time and resources) to identify a problem, gather multi-disciplinary evidence, think of innovative designs to address the key policy issues, design a pilot, and assess the effectiveness of alternative approaches will help policymakers make good decisions and save money at the same time. Institutions that design big projects first, put them into place, and then bring in an independent evaluator at the final stage to see whether this is working or not are missing the point. We shouldn’t be conducting impact evaluations because we need to look responsible and be accountable to funders and taxpayers. We should be doing them because they allow us to design better programs to begin with and allow us to improve them as we go (by employing ongoing evaluations). Independent evaluations of big, expensive projects are fine but should form the minority of what we do and should still have a good reason for evaluation. They should (a) have passed some plausibility test to be scaled-up in the first place (preferably via well-designed earlier pilots) and (b) there should be a reason to doubt the observed impacts after scale-up (due to implementation differences, GE effects, etc.). If there is not much to learn or the knowledge will be useful only locally, scarce evaluation funds should preferably not be flowing there.
Take the Haiti poverty reduction project that has been covered in the Washington Post recently. While the idea is interesting and complex (although others, notably Chile’s Solidario program a decade ago, have also tried this approach) the evaluation that was designed seems like it will answer a Yes/No question on whether a “package” of interventions works or not. The findings may tell the government and the donors to scale the program or not. But it will not tell them how many agents to hire per 100 beneficiaries, how much to pay them (“Help, my teacher quit her job to become my family’s social worker!”), how long to train them, whether the results-based bonuses are necessary, what would have happened if there were also some demand side incentives to the households, etc. All of these questions are directly relevant to policymakers and needed immediately after the pilot phase to be able to scale the program up if the pilot is successful. But, way too often, institutions (and sometimes even researchers) fall prey to evaluating a "package" without paying any attention to what parts of the package are causing the effect, which is absolutely key to cost-effectiveness. So, I'd say judgment requires learning, especially if there are subsequent design actions to be taken.
There is not yet a culture in which the people in charge of designing projects are thinking of the evaluation from the get-go. The projects with the highest potential to produce relevant information for policymakers as well as academics are those where the impact evaluation and the project have been designed hand in hand, accommodating each other. That does not have to mean that project design has to be compromised to be able to have a rigorous evaluation – but it is hard work (although I also find it very enjoyable): it requires the TTL, who is already very busy, to put up with all sorts of details to make sure that the most promising and feasible treatments are being put into place while also ensuring that they are being evaluated against interesting counterfactuals.
Unfortunately, large donor institutions like the World Bank are the least likely to be able to pull such pilot evaluations off with client governments. Coordinating these things with a multitude of stakeholders is hard enough without having to worry about the evaluation. There is also the pressure for the project to succeed and do good, something the evaluator does not have to worry about. We need a culture in which taking risks and failing is perfectly fine: dust yourself off and try again with what you’ve learned. That will be so much easier if the risks are not high-stakes, meaning that the projects to be evaluated are not dictated from the top but small pilots are being designed by TTLs and policymakers in the field working closely with researchers, evaluated properly, and then going in front of for prime time if and when they’re ready.
I think that everyone should be leaning more towards learning, because learning is judgment. But, such evaluation work should be reserved for things we don't know, i.e. for pilot projects. We should not have cases like the OLPC, where the evidence comes in after the government has spent millions of dollars on something completely unproven at the time of design. For large projects that are somewhat proven, we should be monitoring them or evaluating small tweaks, not the whole thing.
Comments are welcome – I am sure many may disagree with parts (or all) of the above. Can the readers give examples from their own work? What are successful examples of learning evaluations that have been scaled-up? What are the challenges for TTLs in designing good pilots/evaluations? What can our senior management do to improve things? Please let us and them know…
Join the Conversation