Published on Development Impact

On experimental evaluations of systems interventions

This page in:

A quick look at the burgeoning literature on policy evaluations will reveal a preponderance of evaluations of demand side schemes such as conditional cash transfers. There is an obvious reason for this beyond the promise that such interventions hold: the technology of treatment allows for large sample randomized evaluations, either at the household or community/village level. As long as financing is sufficient to sample an adequate number of study units, study power will not be a concern.

Impact evaluations of systems-level interventions are more scarce (for two examples in the health sector, see here and here) and not because there is little interest in conducting such studies.  An obvious stumbling block is the aggregate nature of such interventions. For example, there are very few study units available to evaluate health systems reform if the level of reform involves processes or management at the highest levels such as provincial governments or the central ministry.

I was recently part of a team working in Zambia on interventions to improve population access to essential medicines. There are several ways sick people can access medicine, but in Zambia by far the most prevalent method was through the local publically managed primary health care clinics where drugs were nominally free. There was one drawback with this channel though, and that was the drugs were often out-of-stock.

Typically, drug stock-outs are not a procurement problem at the national level – usually there are sufficient quantities of drugs sitting in the central store rooms in the capital. However it is an infrastructural and logistic management challenge to get these drugs to the front line clinics that need them. Often the drugs never arrive. In our baseline data we found stock out rates of 40% for potentially live saving pediatric anti-malaria drugs.

A team of health specialists and supply-chain experts designed an alternative supply-chain management system and wanted a credible evaluation of performance. That’s when I joined the team. The problem from the evaluation perspective, as you might expect, is that efficient supply chains often require effective management at a relatively high level of the system. In this case the proposed intervention was effectively a district-level intervention involving a new position of supply-chain specialist based in the district, as well as new tracking and ordering systems, etc.

The pilot project had enough resources to test the new approach in 16 districts. Even though the level of observation would be the stock-out rate at the clinic (and there are typically 15-20 clinics per district) the presence of district-wide management would surely create a high level of observational dependence among clinics within the same district.

Effectively we were looking at 16 treatment units in this evaluation. When I was told this, I calmed myself with the thought that this size may be minimally sufficient to identify a reasonable effect. But then my colleagues told me it was very important to test two contrasting supply chain models, each in 8 of the districts. Urk.

I immediately went to do some back of the envelope power calculations. It wasn’t as bad as I initially feared, however we still wouldn’t be able to identify a standardized effect less than .54 (measured in standard deviations of the outcome of interest – drug stock-out rates). For the case of pediatric malaria drugs, if the interventions reduced the stock out rate from 40% to 20%, a dramatic gain in availability, we would not be able to definitely say this gain was due to the intervention, at least at standard levels of precision. And if both intervention models were at least partially successful, it would be difficult to claim that one option was definitively better than the other, again at standard levels of precision.

I drove my team crazy stressing in every meeting that this study design was likely underpowered and they shouldn’t expect that we would be able to identify moderate (and important) improvements. Nevertheless I still thought that we would learn valuable evidence on how to implement these reforms, even if the study was underpowered, and I also prepared for the potential need to adopt a quasi-experimental evaluation design – a matched difference-in-difference – in order to improve power.

Fortunately it turned out that one of the interventions was a smashing success in terms of reducing drug stock-out rates, achieving a standardized effect size of 0.74. For the case of pediatric malaria drugs, the end-line stock-out rate in clinics receiving the most promising intervention was 11% while for clinics in control districts the rate was 48%. (The stock-out rate under the second intervention was 30% - and not significantly different from either the controls or the successful intervention, as feared.)

This study has been received with enthusiasm in operational circles and has led to planned major reforms to the public sector supply chain in Zambia. A happy ending. But I fear that this enthusiasm, if untempered, will lead to further numbers of effectively underpowered evaluations. We need to tread carefully while also exploring alternative or supplementary means of evaluation for these cases.

Future blogs will continue discussion of systems reforms and the challenges faced. If you of any good examples, I would love to hear about them…


Jed Friedman

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000