For many of us conducting RCTs to evaluate pilots of interventions, an exciting stage is when we are invited by government officials to present the results. After we have presented the main impacts of the intervention, government officials usually ask: "This program is great and I am interested in implementing it at the national level. How can I do that? What conditions need to be met to implement the program at scale and make sure that we can still get similar impacts?" These are fair questions, as there is already evidence showing that what works in controlled experimental settings does not always hold in weak public institutions.
One way to address those questions is to embed implementation research into our impact evaluations. To learn more about implementation research, I interviewed Emmanuel Adebayo, implementation research expert and lead of the Adolescent Learning, Action, and Benchmarking initiative (AdLAB), a collaboration between the Development Research Group and the Global Financing Facility.
I would like to start by asking what is implementation research and why does it matter when informing the scaling up of a pilot?
Implementation research (IR) studies how interventions are actually delivered in real-world systems. It looks at how implementation needs to be adapted across different settings, and what makes the difference between an intervention that can be sustained at scale and one that cannot. Impact evaluations (IE) ask whether a program worked. IR asks something else entirely: how did it work, why, under what conditions, and what would it take to replicate it somewhere else? These tend to be treated as secondary questions, but they are closely related to what economists call understanding “mechanisms,” that is, the pathways through which an intervention produces its effects. The difference is that IR also asks whether the intervention is actually feasible to deliver and sustain, not just in the pilot but when you try to scale it.
In practice, this means paying attention to the operational realities that shape whether programs succeed once they leave controlled settings. The goal is not just to produce evidence, but to produce evidence that real systems can use. For example, implementation research could assess whether, after offering bed nets, a mass media campaign about their importance increases use, comparing use before and after. Or it could assess whether electronic medical records reduce the amount of time needed to fill out administrative forms.
How do you think IR complements impact evaluation?
I strongly believe that impact evaluation, particularly RCTs, is one of the most important methodological developments. But there is a structural limitation that I believe IR fills. A well-designed RCT achieves internal validity by controlling for context. However, context is also exactly what determines whether an intervention can be implemented, adapted, and sustained in real-world systems. When you scale from a pilot to a national rollout, or move an intervention from one country to another, the controlled conditions are gone. What you are left with is an average treatment effect, which tells you very little about what to do next. It does not on its own explain how to replicate or scale an intervention across settings. For example, a number of high-quality RCTs on Community Health Worker programs in Sub-Saharan Africa show modest or mixed effects. While some people may read this and conclude that the model does not work, studies that included IR strategies tell more nuanced stories about variations in implementation quality. IR is what surfaces these variations and gives you something to fix.
Based on what you are saying, many economists may say that we are already doing this in their research.
That is partially true. In reality, many impact evaluations in economics incorporate implementation research elements through a mixed-methods approach, providing qualitative information on the context and assessing why and how the results were obtained, whether positive, negative, or null.
However, I have observed two things. First, it is not done systematically or in a structured way. IR has its own frameworks, its own standards of evidence, and its own analytical rigor. IR should be systematic, structured, and auditable. You typically use purposive sampling to make sure you are talking to the right people, apply structured analytical frameworks, and follow saturation principles to know when you have collected enough data. Several organizations like International Initiative for Impact Evaluation and WHO Alliance for Health Policy and Systems Research, have published detailed guidance on how to combine causal and implementation questions within a single study.
Second, and probably as a consequence of not following a framework, the research elements incorporated by economists often miss opportunities that could provide a deeper understanding of how context affects results, such as more measures of fidelity and compliance, greater assessment of assumptions, cost and cost-effectiveness analysis, service delivery and systems mapping, and more indicators along the causal chain.
So what are the minimum steps or ingredients for IR? In other words, if I am designing an impact evaluation and want to include IR, what should I consider?
Although the design of IR should follow the implementation question you are trying to answer, there are a few core ingredients that I think matter almost every time. The starting point is the need to explicitly define the implementation questions. Then, design a process evaluation to measure implementation moderators, such as delivery, reach, feasibility, monitoring, feedback, adaptation, and relevance to providers and beneficiaries. You also need a theory of change about the behavior of individuals and organizations implementing the intervention (acceptability), as well as measurement and reporting of adherence to content and dose (fidelity), and standardized reporting of the intervention, including costing, sufficient for replication. These elements should be part of the original study design, not added later. In fact, they should be included in your pre-analysis plan.
IR also works best when embedded throughout implementation, with feedback loops that allow learning during delivery rather than only after the program ends. If you are designing an IE, the key shift is to study, alongside outcomes, how the intervention moves through real systems, using IR frameworks and mixed-methods approaches.
I would like you to talk more about the timing. Often, we economists use some elements of IR to understand when the impacts of the intervention were not in the expected direction. Is that a good practice?
I think IR is a useful tool across the entire project cycle, depending on the question you need to answer. Used prospectively, IR can strengthen both program design and the evaluation itself — and this, in my opinion, is its most valuable contribution. For example, a CCT program can significantly reduce school dropout rates and increase school attendance. However, it could also lead to increased class sizes and more heterogeneous classrooms, placing additional strains on an already weak system. These second-order effects are not always captured in standard outcome measures. Anticipating them requires attention to implementation dynamics from the outset, and IR frameworks have been shown to help with identifying these barriers and facilitators early on.
That does not mean we should not do IR ex post. Ex post IR is valuable for understanding why programs produce uneven results, why implementation differed across settings, or why promising interventions struggled when scaled. My point is not that IR should happen only before or during implementation, but that limiting it to a “post-mortem” exercise misses one of its greatest strengths: its ability to identify barriers, facilitators, and system constraints early enough to improve both program delivery and the evaluation itself.
When is IR not the right tool?
IR is not the right tool when the primary question is simply whether an intervention works and there is not yet credible evidence of effectiveness. I believe that IR is most useful once there is an intervention that is being implemented or being considered for implementation and the key questions relate to delivery, adaptation, feasibility, scale-up, sustainability, or why results vary across contexts. In other words, IR does not replace the need to establish causal effects, it is most useful when effectiveness has been established.
You work on the AdLAB initiative. What does embedded IR actually look like in practice?
The AdLAB runs IR studies embedded alongside program delivery across multiple countries. We use IR methods to generate continuous feedback, identifying barriers to service delivery, testing adaptations in real time, and feeding evidence directly into operational decisions. In some contexts, this involves understanding how health systems deliver adolescent-responsive services and what is sustainable and scalable within existing operations. In other contexts, we focus on integrating new interventions into routine systems rather than parallel structures. The common thread is that evidence is generated during implementation and used by decision-makers as programs evolve.
What should impact evaluation researchers take away from this?
I am not asking everyone to become implementation researchers. I am asking them to think of IR as a complementary methodology and an investment in the durability of their work. Adding a research approach guided by IR frameworks can make a significant difference in formative research as much as in an impact evaluation. I do not think we need to choose between causal inference and implementation research. I think we need both, designed together, to complement one another. Ultimately, the goal is not only to know what works, but to ensure that what works can actually be sustained and delivered at scale.
Join the Conversation