Guest Post by Martin Ravallion: Are we really assessing development impact?
This page in:
When people ask about “development impact” they are often referring to the impact of some set of development policies and projects; let’s call it a “development portfolio.” The portfolio of interest may be various things that are (ostensibly) financed by the domestic resources of developing countries. Or it might be a set of externally-financed projects spanning multiple countries—a portfolio held by a donor country or international organization, such as the World Bank. Yet the bulk of the new evaluative work going on today is assessing the impact of specific projects, one at a time. Each such project is only one component of the portfolio. (And each project may itself have multiple elements.) As evaluators we worry mainly about whether we are drawing valid conclusions about the impact of that project in its specific setting, including its policy environment. The fact that each project happens to be in some development portfolio gets surprisingly little attention. So an important question is begging: How useful will all this evaluative effort be for assessing development portfolios? The (possibly big) biases we don’t much talk about When you think about it, assessing a development portfolio by evaluating its components one-by-one and adding up the results requires some assumptions that are pretty hard to accept. For one thing, it assumes that there are negligible interaction effects amongst the components. Yet the success of an education or health project (say) may depend crucially on whether infrastructure or public sector reform projects (say) within the same portfolio have also worked. Indeed, the bundling of (often multi-sectoral) components in one portfolio is often justified by claimed interaction effects. But evaluating each bit separately and adding up the results will not (in general) give us an unbiased estimate of the portfolio’s impact. If the components interact positively (more of one yields higher impact of the other) then we will overestimate the portfolio’s impact; negative interactions yield the opposite bias (see Box). For another thing, it assumes that we can either evaluate everything in the portfolio or we can draw a representative sample, such that there is no selection bias in which components we choose to assess. If the components are in fact roughly independent (as implicitly assumed) then ideally we should be evaluating a random sample of all the things in the portfolio, so as to make a valid overall assessment. (When interaction effects are also present one might need to sample differently.) But I have never heard of any development project being randomly sampled for an evaluation. (Here is a larger version of the box if you can't read the above easily) I can suggest a number of reasons why our present approach to impact evaluation risks giving us a biased assessment of the impact of a development portfolio. There is both a demand side and a supply side to evaluations. On the demand side, the choices about what gets put up for an evaluation appear to be heavily decentralized and largely un-coordinated. Interaction effects are sometimes studied, but for the most part the components are looked at in isolation. Then we will have little option but to add up the results across multiple evaluations and hope that interaction effects are minimal. On the supply side, the currently favored tool kit for evaluation relies heavily on methods that are tailored to assigned programs, meaning that some units get the intervention and some do not, and it is assumed that there are negligible spillover effects to mess up the evaluation. Within this class of tools, the current fashion is for randomized control trials (RCTs) and social experiments more generally. These are thought to be very good at removing one of the potential biases, namely that stemming from purposive assignment to the treatment and comparison groups. (Although the claim that even this particular bias is removed by an RCT requires some rather strong behavioral assumptions, as Jim Heckman, Michael Keene and others have emphasized—let alone all the other, generic, concerns, including external validity.) The problem is that the interventions for which currently favored methods are feasible constitute a non-random subset of the things that are done by donors and governments in the name of development. It is unlikely that we will be able to randomize road building to any reasonable scale, or dam construction, or poor-area development programs, or public-sector reforms, or trade and industrial policies—all of which are commonly found in development portfolios. One often hears these days that opportunities for evaluation are being turned down by analysts when they find that randomization is not feasible. Similarly, we appear to invest too little in evaluating projects that are likely to have longer-term impacts; standard methods are not well suited to such projects, and (just like everyone else) evaluators discount the future. Another source of bias is the heterogeneity in the political acceptability of evaluations. The (potentially serious) ethical and political concerns about social experiments will have greater salience in some settings than others. The menu of feasible experiments will then look very different in different places, and the menu will probably change over time. Our knowledge needs to improve in areas where there are both significant knowledge gaps and an a priori case for government intervention. However, it is not clear that we are focusing enough on evaluating the sorts of things that governments need to be doing. As Jishnu Das, Shantayana Devarajan and Jeffrey Hammer have pointed out, it is not clear that RCTs are well suited to programs involving the types of commodities for which market failures are a concern, such as those with large spillover effects in production or consumption. (See their 2009 article, “Lost in Translation,” in the Boston Review.) For these reasons, I would contend that our evaluative efforts are progressively switching toward things for which certain methods are feasible, whether or not those things are important sources of the knowledge gaps relevant to assessing development impact at the portfolio level. There is a positive “output effect” of the current enthusiasm for impact evaluation, but there is also likely to be a negative “substitution effect.” We appear to be building up our knowledge in a worryingly selective way that is unlikely to compensate for the distortions generating existing, policy-relevant, knowledge gaps—stemming from the combination of decentralized decision making about project evaluation with the inevitable externalities in knowledge generation. (You can read more about this in my article “Evaluation in the Practice of Development.”) What do we need to do? There are two sorts of things we need to do to address this problem, though both are likely to meet resistance. The first requires some “central planning” in terms of what gets evaluated. Nobody much likes central planning, but we often need some form of it in public goods provision, and knowledge is a public good. A degree of planning, with judicious use of incentives, could create a compensating mechanism to assure that decentralized decision making about evaluation can better address policy-relevant knowledge gaps to assure that we really do get at development impact. For example, in the early years of the Bank’s Development Impact Monitoring and Evaluation (DIME) initiative, a successful effort was made to promote evaluations of conditional cash transfer programs, which led to the (very good) 2009 Policy Research Report on this topic. Maybe this was just the low-lying fruit, since CCTs are often amenable to currently favored methods, at least at the pilot stage. Nonetheless, this example demonstrates that a degree of planning is possible to assure that pre-identified and policy-relevant knowledge gaps are addressed. The challenge is to apply this idea to development portfolios, typically involving multiple sectors and diverse projects. Second, we need to think creatively about how best to go about evaluating the portfolio as a whole, allowing for interaction effects amongst its components, as well as amongst economic agents. This is not going to be easy. Standard tools of impact evaluation will have to be complemented by other tools, such as structural modeling and cross-country and cross-jurisdictional comparative work. We will need to look at public finance issues such as fungibility and flypaper effects. And we will almost certainly be looking at general equilibrium effects—big time in some cases!—which can readily overturn the partial equilibrium picture that emerges from standard impact evaluations. There is scope for more eclectic approaches, combining multiple methods, spanning both “macro” and “micro” tools for economic analysis. The tools needed may not be the favored ones by today’s evaluators. But the principle of evaluation is the same, including the key idea of assessing impact against an explicit counterfactual. If we are serious about assessing “development impact” then we will have to be more interventionist about what gets evaluated and more pragmatic and eclectic in how that is done. Martin Ravallion is Director of the World Bank’s Development Research Group.
Well said, Martin. It would be nice if you noted the ongoing efforts in the HD network precisely to "coordinate and orient" our IE work towards policy relevant knowledge gaps. The "research pillars" established under the Spanish Impact Evaluation Fund do exactly what you called for -- use funding incentives and technical support from leading researchers to guide our collective focus on priority knoweldge gaps, as well as ensure the consistent technical quality and complementarity (ie, through use of common instruments) of evaluation efforts.
In addition to the CCT book you mentioned, another example of this approach is the work Deon, Harry and I have done over the past 5 years to build the evidence base on education policies aimed at making schools more accountable for results. We recently produced a synthesis book called Making Schools Work: New Evidence on Accountability Reforms. If you read it (!) you will note that we draw on evaluations using a range of different methods, as the research studies were aimed at answering big questions -- like how does the introduction of school performance targets and bonus pay for teachers affect system performance? -- that are not always amenable to RCTs.
I think your discussion is interesting. But I wondered if you are looking in the wrong place for interaction effects. If you are looking for interaction effects in an average agency's portfolio of investments then I am tempted to say "...in your dreams!" Most portfolios are like shotgun blasts rather than fishing nets, in terms of their degree of interconnectedness (an example: http://mande.co.uk/blog/wp-content/uploads/2011/05/Shotgun-fishing-net…)
I think the place to expect interaction effects is in geographic locations, where multiple aid agency and government programs are in operation.
What you say is true, but a key question here is how do you sell this to the donors? It seems to me that:
(a) donors are principal buyers for evaluation
(b) because they themselves have non-technical owners, including ultimately the citizenry, they want evaluations that are readily presentable and defensible.
(c) the proponents of RCT have been very successful in promoting their narrow product to the donors as "the gold standard", on the basis that (apparently) development assistance is just like administering drugs
(d) practitioners realise that the RCT product produces results that are often only marginally useful, and if its use became widespread, it would tend to eliminate much that was useful: a major retrograde step
Therefore, it seems to me that what's needed here is a simplified but compelling marketing message that says what you've just said—but in a way that is as rhetorically effective as the "gold standard" spiel.
What is that message?
Indeed there need to be a lot more effort to incentize evaluations of interventions that are important (esp. those we spend billions of dollars on, with justifiable or unjustifiable degree of faith) but may not have cute methodologies or quick turnaround. Most likely these topics are not attractive for assistant professors waiting for tenure. I'm not sure whether a lot of tenure professors are so keen to evaluate development interventions either. Perhaps DEC can create different incentives for its researchers?
I agree with you that there needs to be a lot more central planning in directing the research budget and brainpower. But who should do the planning?
My view is that the anchor needs to set directions and controls quality of IE work on its portfolio, in partnership with DEC. HD seems to be the most active in this regard. My understanding is each Sector Board is supposed to have a DEC representative, though I'm not aware exactly how the sector board actually influences the research direction of DEC. Perhaps the sector board should be involved in strategic staffing decisions of DEC? This approach will only work if sector boards' knowledge management role is strong.
The question is who'll finance the important research identified through this "central planning"? Our current approach of largely relying on government projects to pay for evaluation of themselves do not work. Even the best evaluations may not benefit the project implementation authority, simply as a result of time lag. We need to think of each evaluation exercise as an element of a long-term effort to understand a topic, so the beneficiary is the development community as a whole rather than the project authority. So evaluation should not be paid by the project. We need to set aside a percentage of lending volume as evaluation budget. Each sector board can allocate it, with DEC either conducting the research or doing quality control of contracted researchers. What's your view?