There has obviously been a large increase in the number of rigorous impact evaluations taking place of World Bank projects over the past decade, including increasing use of randomized experiments. But one comment/complaint of a number of operational staff and government policymakers is still that “randomized experiments take too much time”. In order to avoid repeating myself so often in responding to this, I thought I’d provide some responses on this point here.
Too much time compared to what?
Of course being a good proponent of counterfactual thinking, the first issue here is to consider whether this a statement about them just taking too much time in general, or too much time compared to some alternative way of measuring impact that they have in mind.
- Too much time in general: Here the complaint seems to be along the lines of “you came and talked to my team about setting up an impact evaluation 2 or 3 years ago, and don’t have anything to show for it” or a prospective version of the same idea. My response is that the issue is that i) large government projects and Bank operations actually take a lot of time to get operating, and even though the project was planned two years ago, it took this long for procurement and government sign-off and implementation – and so the problem is not us but the project; and ii) it takes time to see effects, so it is not possible to know whether a project has had the intended impact or not after one year, if we expect it to take several years for the effects to materialize once it gets underway. For the latter issue one can of course try to measure intermediate outcomes, but given uncertainty with the trajectory, timing, and sustainability of impacts, I am always very nervous about short-term results being used as a basis for long-term policy.
- Too much time compared to a prospective non-experimental evaluation: here the idea seems to be that setting the study up as a randomized experiment delays the project getting underway and that doing an experimental evaluation takes longer than a non-experimental one. In practice this doesn’t seem like a big issue to me. It’s true that if one wants an oversubscription design the project might have to work harder to ensure there are enough applicants in some cases, or that setting up random selection might take a few months. But compare two non-experimental alternatives: i) difference-in-differences – here one definitely needs to collect at least one baseline survey (which can be omitted in an experiment if time is very pressing), and ideally collect more than one round of baseline to show parallel trends; and ii) propensity-score matching – here one needs to survey a much larger sample in order to get enough controls that look similar enough to the treatment units. Indeed the good non-experimental prospective evaluations I’ve been involved in have taken at least as much time as the experimental ones.
- Too much time compared to a retrospective non-experimental evaluation: certainly if we want to start today and compare how long it will take to get an estimate of the effect of a program that has already been implemented compared to the effect of one that is about to be implemented, the former can take a lot less time. But i) obviously one needs to have good enough data and a decent identification strategy to be able to do the retrospective evaluation; ii) the retrospective evaluation can’t inform changes in policy before and during implementation; and iii) in several experiences I’ve had, governments are much less interested in learning about some of their past programs (which they have often changed the rules of, and which might be associated with a former government) than of learning about their ongoing programs. But the bottom line here is that these are attempting to answer different questions – when a good ex-post evaluation is possible of a program that hasn’t changed much, it should certainly be strongly considered instead of, or as a precursor to, ex-ante evaluation, but this doesn’t describe many of the evaluation questions of interest.
- Too much time compared to business as normal quick-and-dirty evaluations: here the comparison seems to be to an alternative where the program is run, and at the end there is an IEG-type ex-post evaluation and collection of basic indicators. But of course these other methods are much less convincing in telling us impacts and are often more about process evaluation; and ii) here one should consider the evidence from DIME that impact evaluations appear to help projects to deliver in a more timely fashion