Evaluating the Millennium Villages: Reply to the MVP + Upcoming Seminar with Comments from Jeff Sachs
This page in:
The following post was co-authored by Michael and Gabriel.
The Millennium Village Project (MVP) is an important, experimental package of interventions that the United Nations and Columbia University are testing in 14 villages across Africa. The MVP offers a tremendous opportunity to learn whether such interventions can catalyze self-sustaining growth and escape from extreme poverty. But the evaluation approaches currently being used cannot generate convincing evidence of the Project’s impacts. Without such evidence, it will be impossible to generate the billions of dollars needed to scale up the Project approach across Africa, as its proponents hope to do.
We have written a new research paper (summarized here and here) that proposes small and inexpensive modifications to the MVP evaluation approach that would make it possible to evaluate the Project’s impacts.
That paper has generated much discussion, including reports in the Financial Times and in a major newspaper in Kenya. The Project itself has issued a lengthy official response by Pronyk, McArthur, Singh, and Sachs. We welcome this public debate as a way to improve learning about what works in development. We answer below the main questions posed in the Project’s response, much of which rests on a basic misunderstanding.
1) Does it make sense to compare trends at the Project site to trends outside the site?
Our paper shows that many of the improvements seen at the Project sites—improvements that the Project describes as its “impacts”—are also happening across large areas where the Project is not active. Pronyk et al. argue that such a comparison is misleading, stating that there are many interventions going on outside the Project site that resemble components of the Project. It’s true that many other interventions are taking place, but this point is irrelevant to estimating the impact of the Project in the target villages. Much of the Pronyk et al. response rests on this fundamental misunderstanding.
Our paper is exclusively about evaluating the impact of the Project. Here is how the World Bank’s Development Impact Evaluation Initiative explains project impact evaluation:
Impact evaluations assess the specific outcomes attributable to a particular intervention or program. They do so by comparing outcomes where the intervention is applied against outcomes where the intervention does not exist. An appropriate comparison group represents what would have happened in the absence of the intervention. By establishing a good comparison of outcomes for these two groups, an impact evaluation seeks to provide direct evidence of the extent to which an intervention changes outcomes. [The US government’s Millennium Challenge Corporation uses a similar definition.]
Measuring the impact of the Project means asking this question: What happened at sites that received the Project’s package intervention, relative to what would have happened at those sites in the absence of the Project? “In the absence of the Project” does not mean in the absence of any interventions whatsoever—it means what would have happened without that specific project.
In our paper, we compare trends at MVP sites to changes in the surrounding broad rural areas because they provide a plausible estimate of what would have happened at the sites in the absence of the Millennium Village Project. Crucially, this counterfactual “what would have happened” scenario includes those changes driven by governmental and NGO interventions which would have been active had the Project never existed. Consequently, the fact that various other interventions have been taking place in Kenya, Ghana, and Nigeria is not in any way an obstacle to evaluating the impacts of the Project in isolation. It is incorrect to measure the impact of a project by comparing outcomes under the Project to outcomes in a village “untouched” by any interventions whatsoever. This is because remaining “untouched” is not a realistic estimate of what would happened in those villages if the Project did not exist.
This critical point can be understood through the example of Mexico’s PROGRESA (Education, Health, and Nutrition Program). Like the MVP, PROGRESA was a package intervention program. The main components of PROGRESA were cash transfers, nutritional supplements, and in-kind health benefits, all provided as a package. Many of the rural communities that received PROGRESA, as well as control villages, were receiving other government and NGO interventions, including a different, pre-existing cash transfer program. This was not an obstacle to the PROGRESA impact evaluation. This is because the changes and interventions in the control villages are a plausible estimate of what would have happened at the PROGRESA sites if PROGRESA itself had never existed. PROGRESA’s experience has been a model for how rigorous large-scale impact evaluations can be carried out and for how they can inform effective policy.
2) Is five years enough to assess whether the Project has the impacts it claims?
Pronyk et al. dismiss the need for any impact evaluation extending beyond five years (the evaluation period stated in the Project’s evaluation protocol) before scaling up the project. Longer-term impact evaluation “cannot be taken seriously,” they state, because the individual elements of the MVP package are already “proven.” However, the entire point of the MVP has been to demonstrate the value of the Project as an integrated package. In the words of the Project organizers, “the MVP was conceived as a proof of concept that the poverty trap can be overcome and the MDGs achieved by 2015 at the village-scale in rural Africa by applying the United Nations Millennium Project’s recommended interventions in multiple sectors” and the “interventions are undertaken as a single integrated project.”
The fact that an individual element of the package can do some good—for example, that fertilizer can raise crop yields—does not imply that the package is proven to achieve the Project’s own stated objectives for long-term change. These objectives include assertions that the Project’s package can “achieve self-sustaining economic growth,” break free of “poverty traps”, and “achieve the Millennium Development Goals (MDGs).” It is not yet proven that the “single integrated project” carried out by the MVP can achieve these outcomes, and the troubled history of similar programs suggests skepticism that village-level package interventions can spark sustained development. It is precisely for this reason that we argue the need for rigorous evaluation in the case of the MVP.
We highlight the need for long-term evaluation in the case of the Project because achieving self-sustaining economic growth and the MDG’s are long term goals that cannot be assessed with only a five-year evaluation. Researchers we cite in the paper note the “striking similarities between the MVP and past rural development initiatives, which, for various reasons, proved to be ineffective in sustaining rural development” (emphasis in the original). In the paper we describe one of those initiatives, a five-year project in China which showed significant gains relative to comparison villages after five years. A long-term evaluation showed that five years farther down the line, living standards had improved just as much for the average village that had not been touched by the large, costly intervention.
We do not advocate, here or in our paper, doing nothing for 15 years until definitive proof arrives. Limited scale-up plans for the MVP are already being developed, and with a well-designed, rigorous evaluation, they can provide a golden opportunity to learn how well the Project is achieving its goals. Then, future decisions about scaling up the Project to the level of millions of people and tens of billions of dollars can be made taking into account the intermediate and long-term findings from the evaluation. As we say in the paper, “Taking 5–15 years to acquire scientific understanding of a large intervention‘s effects before enormous scale-up is an appropriate investment.”
3) Does our paper exaggerate the weaknesses of the Project’s evaluation protocol?
Unfortunately the Project’s evaluation protocol does little to alleviate fears that its impact evaluation approach will miss the learning opportunity afforded by this important Project. The protocol itself recognizes and considers important the key weaknesses we highlight. Among the weaknesses are the following:
a) The Project intervention sites were chosen in a way that could introduce bias. The MVP’s own evaluation protocol agrees: “The non-random selection of intervention communities has the potential to introduce bias,” because “issues of feasibility, political buy-in, community ownership and ethics also featured prominently in village selection for participation.” The MVP protocol itself thus contradicts Pronyk et al.’s assertion that they “refute any insinuation that the Millennium Villages were somehow systematically advantaged at the outset of the project.”
b) The comparison sites were also chosen in a way that could introduce bias. As the MVP’s own evaluation protocol acknowledges, the comparison sites were chosen well after the project began. The protocol’s method of choosing comparison sites lists some ways in which the comparison sites should be similar to the intervention sites, but does not eliminate (and does not claim to eliminate) the possibility that they are different in other important ways that could affect development outcomes.
c) The project did not collect baseline data on the comparison villages. The MVP’s own evaluation protocol highlights “limited ability to make clear statements regarding baseline equivalence” as a limitation.
d) Because of the small number of comparison villages, it is likely that even substantial differences between the intervention sites and the comparison sites will be difficult to statistically distinguish from zero. This observation follows directly from the statistical power calculations presented in the MVP protocol.
Our main concern in the end goes beyond these points about survey and data methods. It is that a low-cost, rigorous evaluation of the impact of this important project is still possible—but the first major report of the MVP does not point the way toward careful evaluation. Though Pronyk et al. complain that we “inaccurately describe the June 2010 Harvests of Development, as an MVP evaluation report,” the report describes itself as “part of the third-year evaluation,” and on 18 separate occasions lists what the authors consider to be the “Biggest Impacts” of the Project (pp. 65–67, 73–75, 81–83, 89–91, 97–99). One of the response note’s authors describes HOD as a “major scientific report.” In our view, an official publication that claims to state the “impacts” of a project is an evaluation report.
4. Can the impact of the Project be evaluated starting now with a randomized design?
Pronyk et al. claim that the randomized impact evaluation strategy we suggest requires sites untouched by any intervention whatsoever. This is not true. An impact evaluation of the Project requires comparison to sites where the Project has not intervened, and such sites are abundant. An impact evaluation for all external interventions of any kind would require a control site that had never received any intervention of any kind, but that would be a fundamentally different impact evaluation than an impact evaluation of the Project.
Again, Mexico’s PROGRESA evaluation provides a good example. For that program, both intervention and comparison communities were selected at random. Many of the communities had received earlier interventions. This was not an obstacle to scientifically evaluating the impact of PROGRESA, and there was no need to seek out “untouched” communities.
A randomized impact evaluation would be similar to what the MVP is already doing and carry roughly the same cost per site. The Project already has numerous intervention sites and numerous comparison sites. The randomized design we propose only requires a particular way of choosing which sites are the intervention sites and which are the comparison sites, in the next wave of the Project. It yields much better measurement of impacts, and no practical concerns prevent its use. All it requires is the will to measure real effects in a way that is clear and objective.
Pronyk et al. suggest that given the acute needs, the massive scaling up of the Project should not await careful impact evaluation. The Project has made strong claims to its funders and to the public about its ability to achieve lasting, self-sustaining change in isolated local settings. In decades past, Africans have endured countless false promises from well-intended outsiders who arrived with purported solutions to their long-term challenges. Low-income people in Africa need and deserve interventions that are proven to be effective.
The seminar on "Evaluating the Millennium Villages" announced for October 27th has been postponed. It will be re-scheduled soon and available online. To request more information, please use the "contact" form on the top left corner of this blog.
Why get bogged down with the MVP evaluation template? MVP is an attempt to assist Africa grow starting with the nursery bed approach. I fear the continent would be left behind if a growth formula is not defined by 2015.
Olugbenga: The good news is that a great deal of economic growth has been happening across 17 countries of Africa, as this new book by USAID chief economist Steve Radelet (available free online) illustrates:
Two notable things about that growth:
1) None of it comes from a "growth formula". It arises from different sources in different places.
2) It does not arise from intensive village-level package interventions by outside agencies. Such interventions are bogged down in a long history of failure --- as we discuss in our paper. That long history of failure places a heavy burden of proof on the Millennium Villages Project, which is why we consider evaluation of the Project's public claims to be so important.
How I wish you would understand that African nations would benefit more from rural to urban development programmes. There exist abject poverty in the rural communities of Africa. As an adult and a post graduate student in England, I used a telephone shower for the first time in Derbyshire in 1988. I give credit to Steve Radelet on his book. My point is the template for measuring growth should reflect the impact on living standards or else, they would be mere statistics. Enough of grammar and statistics.