Syndicate content

Keeping it “real” in real-time evaluations: Guest post by Florence Kondylis and Maria Jones

Carrying out evaluations to affect policy is the big motivation of many development economists. Usually, grant proposals and such will ask researchers to document “How will your results affect policy?”. In this post, we address a corollary of that problem statement: “when and how should your results affect policy?”. All the work that goes into the evaluation design at the start drums up a lot of enthusiasm among policymakers, and may open windows of opportunity for policy influence long before the final results from the evaluation are available.
In the past few years, we have devoted a lot of time and effort to running agricultural RCTs in many countries across the world. That’s worth mentioning, because the season nature of agricultural interventions adds a layer of complexity in linking IE results to implementers’ actions on the ground. Take a simple example: country X has two main seasons a year, season I and season II. You launched an intervention ahead of season I planting time that you expect will affect farmers’ input use throughout the year. Your implementing partner really wants your results to feed back into their planning for next season I intervention. However, you are unlikely to have collected all the survey data needed to document the impact of the intervention over both seasons, much less fully analyzed that data, before the implementing partner starts planning for the next Season I. What to do? Should you tell the partner implementing agency not to use this year’s trial to inform their intervention design until the following year? Or provide them with preliminary (and likely incomplete) results? Obviously, there is no right and wrong in this space, so we will try not to be (too) normative. Because DIME works mostly on programs with partners at scale, we will not talk to the issue of moving from efficacy to effectiveness—excellent examples of this are provided by our colleagues at DIV and their grantees (http://www.usaid.gov/div/portfolio).

Two examples from our work
Two examples stem from our own experience with two RCTs: a savings innovation trial and a feedback tool for farmers. In the case of the savings experiment, we worked with local banks to promote new products targeted towards savings for agricultural inputs. The process of designing the savings products, the enthusiasm from the local banks, and initial evidence of appeal to farmers (high attendance at trainings and significant take-up) generated a lot of interest from other actors in the rural finance sector. We were approached for advice by an organization with a mandate to train local banks and launch new financial products. They were launching a large intervention, right away, and our ‘results’ on take-up and usage were sufficient for them to judge that the products we were testing would be ‘good enough’ to scale up. As the costs of them scaling up the new products were small relative to their status quo intervention, this was an uncontroversial choice. But what if our intervention turned out not to be welfare-enhancing, or actually ‘hurt’ farmers in some way? We did not have data on agricultural production or income / welfare at that time and were not prepared to make a judgment call so early. Had we known of the demand at the design of the intervention, we could perhaps have adjusted data collection timelines.

In the case of the farmer feedback tools, we worked with a large NGO to increase attendance in agricultural extension trainings and take-up of their service (the service consists of agricultural inputs + training paid on credit). We tested various types of tools (scorecards, logbooks, and a hotline), to collect a mix of quantitative ratings and qualitative information (comments/complaints). A follow-up household survey was used to check attendance in and knowledge gains from the extension training, and we used the NGO’s administrative data to monitor take-up of their service. Our partner decided to scale up the use of the hotline, because of positive reports from the field that farmers “liked it” and because it helped them to uncover serious operational issues – much more so than any other treatments we had tested. These were not the intended metrics from the experiment, but scaling up the hotline improved their operations, and the rigidities to switching to another feedback tool as more results come in are small.

How should we think about scaling-up?
We argue that evaluation teams should put more thought into scale-up opportunities – both within and beyond the scope of the specific project – at the experimental design stage. Just as power calculations are central to deciding how many clusters/units should be assigned to each treatment arm and what sample size is required to detect an acceptable effect size, we should decide at design what metrics will be used to inform scale-up and when/how these decisions will be made. This can be built into the typical IE roadmap, with just a little re-tooling. We propose the idea of sufficient and necessary scale-up conditions:
 
  1. Necessary: set a clear timeline of what will be available / needed and when
    1. Input from researchers: If you are evaluating a demand-driven project, when do you expect to get take-up data (e.g. data from training attendance lists / registration drive)? When do you expect to measure intermediary outcomes (e.g., input purchase and application)? Final outcomes?  Are you performing some monitoring exercise along the way, is your survey work split into multiple rounds (e.g. at planting, and then at harvest)?
    2. Input from implementing partner: How does that timeline relate to the operational schedule? Usually, operations will require some heads up to amend the design of their interventions. When is the latest they could wait to make a decision on what to field in their next cycle?
    3. Final output: A clear timeline where researchers’ and operations’ schedules intersect, forming your universe of potential “decision points”. This is an opportunity to as much as possible align these schedules. For instance, if you are using a long survey instrument, can you think of splitting the two visits across the year to have preliminary results?
  2. Sufficient: agree on a few markers the team is comfortable using to inform future design; this is where there may be more debate between research and implementation.
    1. Input from researchers: You are using a series of markers to test a series of assumptions along your causal chain, e.g., which intervention was most successful in securing high take up / attendance? What treatment arm elicited highest use? Overlapping these sequential insights with your partner’s operational timeline will allow you to suggest a menu of “decision points” to select from.
    2. Input from implementing partner: What do they care the most about? Is it purely take-up, or is it welfare gains? This is a little simplistic perhaps, but we find that this is a good starting point to devise a list of markers ranker by order of importance.
    3. Final output: First off, this allows for a clear vision of when, at earliest, our partner’s key questions can be answered. What’s the added value of having this discussion? It allows for the researchers to, sometimes, improve the partner’s decision at the margin. For instance, one of our partners told us: “I am happy as long as everything goes up.” That’s nice—but what if his “up” is someone else’s “down”? We had a good conversation on this, and ended up agreeing that some welfare measure would be useful in improving the quality of his decision.
Finally, it is important to ensure that implementing partners understand that scaling up your treatment(s) within your experimental sample before you have collected all the data required to perform the IE is problematic and would undermine all the work done! If treatment effects are expected to vary with length of exposure, following treatment and control after scale up has taken place could however be interesting…
Anyone else has lessons to share/thoughts from the scale-up twilight zone?

Florence Kondylis and Maria Jones are both with the World Bank’s DIME team.
 

Comments

Submitted by Heather on

Thanks much for this post; it's an important contribution. On the 'necessary set,' you have researcher and implementing agency inputs. But what other stakeholders might fall in the ideal set-up? For one, the implementing agency may not be the decision-makers about scaling. For two, the example you give is about other organizations and banks. How might you have better understand demand and timelines for evidence from other such stakeholders during the formative and design phase?

Also, I think your post raises the importance of having a good monitoring plan and system in place.

Suvojit and I have a post over here (http://blogs.worldbank.org/publicsphere/relevant-reasons-decision-making) that might provide food for thought about the decision-points you discuss for scale-up. Always happy to hear your thoughts on it!

Thanks again for a good post.

Add new comment