Jeff Sachs, the Millennium Villages Project, and Misconceptions about Impact Evaluation
This page in:
News that another $72 million has been committed for a second stage of the Millennium Villages Project (MVP) has led to another round of critical discussion as to what can be learning from this entire endeavor. The Guardian’s Poverty Matters blog and Lawrence Haddad at the Development Horizons blog offer some critiques. In response to the latter, Jeffrey Sachs and Prabhjot Singh offer a rather stunning reply which seems worth discussing from the point of view of what is possible with impact evaluations. So let me dissect some of their statements:
“The simplistic idea, moreover, that one can randomize villages like one randomizes individuals, is extraordinarily misguided for countless reasons. The most obvious, but not even the most important, is cost. To do a controlled experiment (on a single intervention) with thousands of individuals is possible; to do a controlled experiment at the scale of 30,000-person communities is far beyond the project budget (or any budget for similar activities).”
Comments: 1) it is possible to randomize villages, there is just less power in doing so than if one randomizes an intervention at the individual level. But for a large transformative intervention like the MVP project, it should presumably be aiming for a large effect – in which case it may be possible to detect it with even 14 treatment villages like the MVP seems to already have. See for example Jed’s post on evaluating changes in supply chain management for medical drugs in Zambia. So to make a credible argument that the village sample is too small for evaluation, we need to see some power calculations. 2) it is unclear what is meant by cost being the most obvious reason here. The cost of including control villages is only the cost of surveying them, which has to be cheap relative to the millions being spent on the program.
“The logic is also flawed. In a single-intervention study at the individual level (e.g. for a new medicine) one can have true controls (one group gets the medicine, the other gets a placebo or some other medicine). With communities, there are no true controls. Life changes everywhere, in the MVs and outside of them.”
Comment: This is just a baffling comment. The whole reason for having controls is that life changes everywhere – if it didn’t, before-after analysis would be just find. The purpose of having these similar control communities is precisely to control for all the other stuff going on in these countries which could be causing changes in the Millennium Villages regardless of the impacts of the MVP. The work by Clemens and Demombynes critiquing the earliest claims of the MVP’s impacts showed clearly some of the massive changes occurring in Africa in indicators such as cellphone ownership that clearly render before-after analysis misleading.
“A third reason is even more important. Introducing community-based capital involves extensive local participation, design, and learning by doing. This includes the methods that communities improve over time with us to measure their own progress, leading to a sustainable monitoring and evaluation strategy that is part of the data-driven local management. There is no simple MV Project blueprint, though there is an overarching strategy. The logic of simple randomized trials does not apply in a context of design with extensive learning by doing, where the main goal is to develop new tools and systems that are replicable and scalable and used by the community itself”
Comment: This is a common misconception. Dean Karlan has a nice paper in the Journal of Development Effectiveness (ungated version here) which argues that randomized trials can be used to evaluate complex and dynamic processes, not just simple and static interventions. There is nothing fundamental to randomized trials (or to non-experimental methods of impact evaluation with control groups) that prevents analysis of a designs that have extensive learning by doing.
The people in the Millennium Villages don’t toil to win intellectual points. They are creating improved systems of service delivery within their communities that will have a lasting impact. Those systems can be rigorously documented, captured in ICT tools, and rendered replicable and scalable.
Comment: this confuses process evaluation with impact evaluation. Documentation and measurement of what is actually being done in the millennium villages is a crucial part of process evaluation, and no one is arguing that this shouldn’t be done. But this still tells us nothing about the impact of delivering those services – we can learn (I’m making up numbers here) that 50,000 malaria nets were given out, 20,000 of them were used, and that only 100 cases of malaria were observed in the 14 villages last year, but this doesn’t tell us what the impact is without knowing what would have happened without this intervention.
“There has been much naïve talk about paired “comparison” villages. The Millennium Villages Project actually has them, though we introduced them in year 3 rather than year 1, because in year 1 the considerable work required to create a foundation of community-driven strategies in the context of a very complex project took precedent. We knew from the start that there would be many complexities in comparison sites and we began to introduce them only when the project was functioning in all sites. For anyone who has taken the time to understand the difference in pace of initiation, organizational culture and preexisting capacity between the varied settings of the Millennium Villages will know that a “Year 1” comparison would be meaningless.”
Comment: there seems to be two parts to this argument: 1) that they were too busy setting up the project to set up control sites – but this is precisely the time to think about control sites- as one narrows down the list of feasible locations, it should be relatively easy to choose comparison locations at the same time; and 2) that there were “many complexities” in the comparison sites – I have no clue what this means.
“They also don’t understand the deep limitations of the particular analytical tool of comparison sites. Yes, comparison sites are being monitored and used in the monitoring and evaluation, but they should not be overrated. They will add surprisingly little true insight into the project and its achievements. Here’s why. We already have a natural comparison, and that is what is happening to the MDGs in the district and country as a whole compared with the MVs. Spending a great deal of time and personnel on one particular “comparison site” is misplaced concreteness. The district and national data are basically free, collectible, and a good standard of comparison, while any other single comparison site is somewhat arbitrary and a noisy comparison. The comparison village is definitely not a “control” village in the sense of a real, unchanged control, nor could it be…No place is standing still to be a control site … If the comparison site happens to get a new road or an extension of the power grid, this gives an artificial “small sample” error to the comparison with the MV.”
Comments: this completely ignores all issues with deliberate selection of the MV villages meaning they are not directly comparable to other villages in the district and the country as a whole. It also limits assessment of impacts to the rather crude set of indicators that are collected in national data, whereas the MVP aims to do a whole lot of transformation within communities that one would surely like to see the effects of. The final points here repeat the confusion that a control village need to stand still – it does not; and that the issue of small samples – which power calculations can tell us about.
“Moreover, many of the key lessons derived from the MVs are already being taken on board in neighboring villages, and even at national scale. The community health worker programs, e-health, and community-based malaria control are good examples of rapid diffusion from the MVs. When the comparison villages make progress, some of that progress is a spillover from the MVP itself.”
Comments: this seems to be the most valid concern expressed in this article, and is a common issue to think about in impact evaluation – are there spillovers to other communities. In principle one could try and measure these spillovers – there is no reason to have only one control village per treatment village, and one might be able to measure spillovers on nearby villages while still using similar, but not quite so near, villages as controls. This would be a more complicated design and have further implications for power. Nevertheless, the potential for spillovers is a reason to try and design an evaluation which measures these to the extent possible, not to abandon doing an impact evaluation altogether.
….” The progress towards achievement of the MDGs, within the MVs and by example beyond the Millennium Villages, is the true measure of success.”
Comments: surely the measure of success has to involve how much the MVP project contributes to progress towards achieving the Millennium Development Goals, not whether these goals are achieved or not – they could be achieved or not achieved for reasons completely beyond the control of the project.
Now there are valid reasons to debate which methodology is best for evaluating the impact of the MVP, and serious discussion of the components that should factor into this decision seem worthwhile. But fallacious statements such as those made in this post by Sachs and Singh do nothing to further the debate nor to encourage others considering large-scale interventions to seriously invest in rigorous impact evaluation.
Finally, one must also question what donors like the Soros Foundation and the UN relied on in terms of evidence when deciding to fund this second phase of the MVP project. Either donors are happy to fund such a program based on factors other than empirical evidence, or arguments like those above are misleading decision-making.
Just after posting this I see Michael Clemens and Gabriel Demombynes have again posted on the MVP project's need for evaluation on the Guardian's Poverty Matters Blog. The main new piece is that they dissect Sach's claim that impact is being shown by the number of peer-reviewed publications by the MVP team - they note most are not measures of impact, and that those that are are narrow and are based on either before-after or with/without comparisons with no controlling for selection:
http://www.guardian.co.uk/global-development/poverty-matters/2011/oct/1…
I fully agree, Michael. I have a little on my comments on my blog at: http://www.blog.ellerman.org/2011/10/impact-evaluations-and-sachs-mille…
Thanks, David, for this thoughtful dissection. Prof. Justin Wolfers of U.Penn. calls the same Sachs/Singh post "indefensible", so you are not alone.
I wanted to highlight two additional facts that corroborate your good points:
First, the idea that the project was initially somehow too busy to have comparison villages is demonstrably untrue. As of 2007, years into the intervention at some sites, the project was still publicly saying that it did not have any comparison villages for "ethical" reasons. For example, that statement appears here:
http://www.pnas.org/content/104/43/16775.full
Sachs also said, years into the project, that having any comparison sites at all was "ethically not possible", here (p. 20 of transcript):
http://www.cgdev.org/content/article/detail/6660/
So it's clear why the project had no comparison sites. Sometime around 2008 they must have changed their minds, because now there are comparison villages; the unethical became ethical. They don't seem to like to talk about that now, which may be why their current explanation for having no baseline data on comparison sites sounds so bizarre.
Of course it's untrue that comparison villages are unethical; comparisons of this sort are critical to giving a basis for claims about a project's impact. Ethical scale-up of a huge intervention across the entire African continent requires careful establishment of its impacts before scale-up, just as the release of a new drug for U.S. consumers requires careful establishment of its effects before widespread distribution.
Second, in the published version of our paper (not in the working paper version), we execute some detailed power calculations:
http://www.tandfonline.com/doi/abs/10.1080/19439342.2011.587017?journal…
The result is that, as you suggest, 15-20 villages would very likely be sufficient to reliably detect the large impacts the project claims for itself. Since the goal of the project is to create thousands of such villages across the entire continent, carefully studying the first 15-20 is relatively easy and necessary. As we document, it would be cheap and easy to do such an evaluation. All that's missing is the willingness to learn what the true impacts of the project are.
David Ellerman: I agree that the question you highlight is important. There are two different research questions. Each is worth asking.
First, is the Millennium Villages intervention better than nothing at all? Similar projects in the past were documented by Uma Lele, Hans Binswanger, and others at the World Bank to have *no lasting impact whatsoever*, as we document in our paper. That is, those similar Integrated Rural Development interventions were not even better than nothing. That's important to know, and it is therefore a valid research question.
Second, a different question is whether the Millennium Villages intervention is better than the best available alternative use of the money. This is the question you raise and it is indeed critical. It is related to the first question: if it's not better than *nothing*, then by definition it's better than any other alternative use of the money that does no harm. But it's a separate question, and also important to know, if we are interested in making the best use of scarce resources.
Those of us who have worked in resource poor 'integrated rural development' programs in Africa and Asia have known for over 25 years that by spending a lot of money, if at all reasonably used, measurable improvements in crop yields, health and educational outcomes could be achieved at population level. International and local NGOs have based their work throughout this time on very similar principals to the MVP, as well as promoting many of the same interventions.
The no treatment pseudo counterfactual, as Ellerman observes, does not bring us closer to selection of programs and interventions that are most likely to stimulate positive change with limited resources.
This discussion nicely avoids a much more basic problem in real-existing impact evaluations, namely that the comparisons are to no-treatment cases rather than the best-alternatives using comparable resources. If the MVs, where $XXXX is spent per person, passed all the impact evaluation tests compared to no-treatment comparison villages, that would still be of little significance since it uses the no-treatment pseudo-counterfactual.
The real question is the best alternative use of scarce development aid, and that is best investigated by sponsoring large-scale parallel experimentation of different approaches and then comparisons or benchmarking between them. Without parallel experiments where comparable resources were spent per person, even successful impact evaluations of the MVs would only be employing the ultimate low-hurdle of showing that spending a lot of money is better than doing nothing. Gee, I wonder if that ultimate low hurdle aspect of impact evaluations has anything to do with their popularity with World Bank project managers??
I really am at this point baffled at the MVP strategy which has been stated all along as providing a proof of concept so that a model can be replicated and scaled up.
At this point there is so much criticism of the project and the way it is being documented and carried out, that I can't believe that even the people at MVP can bring themselves to believe that scale-up and replication will happen at the conclusion of their efforts.
Thus, at this point, the probability that the MVP will fail--not as judged by achieving MDGs or causality--but in terms of persuasion, seems to already be approaching its asymptote. Even with the best data and persuasive approaches it would have been difficult to convince funders, national or international, to commit the required capital to replicate and scale in any meaningful way. Now, via the avoidance of public debate or substantively engaging with critics, it would seem impossible. The MVP has done more than enough to convince decision makers that they can be ignored or placated when they come knocking at some point in the future.
Thanks for the great post, David.
A few points:
1) When I debated the MVP evaluation with John McArthur at Oxford last spring, I argued that Mexico’s PROGRESA program is a perfect example of the value of rigorous evaluation. The PROGREA evaluation provided convincing evidence of the programs’ success, which generated political support and funding for the project (as well as informing modifications to the program in later iterations.) The Sachs-Singh post suggests that Mexico’s PROGRESA evaluation is not a good model for an MVP evaluation because randomizing over villages (as opposed to households) is impossible. Not only is this incorrect as a general matter (as you point out), PROGRESA’s evaluation itself randomized over villages. This is described in any of the many papers written about PROGRESA, and we made this clear in an earlier exchange with the MVP:
http://blogs.worldbank.org/africacan/evaluating-the-millennium-villages…
2) We briefly discuss the spillover issue in the paper. While spillover effects in neighboring areas could have occurred, we think it is implausible that changes in indicators at the national level substantially reflect spillover effects of the MVP. In any case, as you say, a good research design can address spillover effects, particularly if the issue is thought through in advance.
3) The Sachs-Singh post makes the argument “In a single-intervention study at the individual level (e.g. for a new medicine) one can have true controls (one group gets the medicine, the other gets a placebo or some other medicine). With communities, there are no true controls.”
I believe the MVP’s misunderstanding on this point comes from the way medical trials are typically carried out. In a medical trial, each group is given a strictly defined treatment, and an attempt is made to limit each group to only its particular treatment. For example, in a cancer study, one group might get chemotherapy, another radiation therapy, and another a placebo. Individuals (I believe) are told explicitly not to receive any treatment apart from that defined for their respective groups. In the cancer example, they would be instructed not to seek out other treatments such as other medication, surgery, homeopathic treatment, etc. during the course of the trial. This is done to limit the variation in outcomes that is not related to the treatments being studied and thus improve the precision of the estimated treatment effects.
In a social science randomized control trial, it is impossible and would probably be unethical to exercise this degree of control over the experiences of participating individuals. In such a study, we expect that there will be changes in outcomes that have nothing to do with the studies treatments, and, as you say, that’s exactly why we need a control group.
My guess is that the MVP evaluation team has this misunderstanding because they come to the question from a background in medical rather than social science trials.
The medical analogy to a test of the MVP concept is not the randomized case control trial of a drug, but rather the test of the performance of a new medical school or a new HMO. Each village would be diagnosed and a treatment proposed according to the situation it presented. So too, there would be many outcome measures with some difficulty a priori in determining their relative importance. In such a situation, one would have to have a very large number of villages/patients in the sample to expect statistically valid results. The cost of "treating" a very large number of villages would indeed be very large.
Now think about the ethics. First, in a randomized drug trial one would prohibit the controls from taking the drug being tested, but in the MVP how could one ethically prevent the control villages adopting some or all of the interventions being used in the subject villages? I don't think it would be ethical to tell the government it could not introduce interventions in other villages that it found appropriate in the subject villages.
A more profound ethical question arises when in the control villages the data takers begin to observe children dying who would be saved in the MVP villages, families going hungry who would be fed in the MVP villages, kids uneducated in the control villages who would be educated in the MVP villages. It is one thing to not intervene if you have no direct knowledge of human needs and quite a different thing not to intervene if you are observing them and have the means to help.
And who said that there can not be two complementary reasons for a decision?
For a response from Jeffrey Sachs, Paul Pronyk and Prabhjot Singh of the Millennium Villages Project, please see: http://2mp.tw/6k