On experimental evaluations of systems interventions


This page in:

A quick look at the burgeoning literature on policy evaluations will reveal a preponderance of evaluations of demand side schemes such as conditional cash transfers. There is an obvious reason for this beyond the promise that such interventions hold: the technology of treatment allows for large sample randomized evaluations, either at the household or community/village level. As long as financing is sufficient to sample an adequate number of study units, study power will not be a concern.

Impact evaluations of systems-level interventions are more scarce (for two examples in the health sector, see here and here) and not because there is little interest in conducting such studies.  An obvious stumbling block is the aggregate nature of such interventions. For example, there are very few study units available to evaluate health systems reform if the level of reform involves processes or management at the highest levels such as provincial governments or the central ministry.

I was recently part of a team working in Zambia on interventions to improve population access to essential medicines. There are several ways sick people can access medicine, but in Zambia by far the most prevalent method was through the local publically managed primary health care clinics where drugs were nominally free. There was one drawback with this channel though, and that was the drugs were often out-of-stock.

Typically, drug stock-outs are not a procurement problem at the national level – usually there are sufficient quantities of drugs sitting in the central store rooms in the capital. However it is an infrastructural and logistic management challenge to get these drugs to the front line clinics that need them. Often the drugs never arrive. In our baseline data we found stock out rates of 40% for potentially live saving pediatric anti-malaria drugs.

A team of health specialists and supply-chain experts designed an alternative supply-chain management system and wanted a credible evaluation of performance. That’s when I joined the team. The problem from the evaluation perspective, as you might expect, is that efficient supply chains often require effective management at a relatively high level of the system. In this case the proposed intervention was effectively a district-level intervention involving a new position of supply-chain specialist based in the district, as well as new tracking and ordering systems, etc.

The pilot project had enough resources to test the new approach in 16 districts. Even though the level of observation would be the stock-out rate at the clinic (and there are typically 15-20 clinics per district) the presence of district-wide management would surely create a high level of observational dependence among clinics within the same district.

Effectively we were looking at 16 treatment units in this evaluation. When I was told this, I calmed myself with the thought that this size may be minimally sufficient to identify a reasonable effect. But then my colleagues told me it was very important to test two contrasting supply chain models, each in 8 of the districts. Urk.

I immediately went to do some back of the envelope power calculations. It wasn’t as bad as I initially feared, however we still wouldn’t be able to identify a standardized effect less than .54 (measured in standard deviations of the outcome of interest – drug stock-out rates). For the case of pediatric malaria drugs, if the interventions reduced the stock out rate from 40% to 20%, a dramatic gain in availability, we would not be able to definitely say this gain was due to the intervention, at least at standard levels of precision. And if both intervention models were at least partially successful, it would be difficult to claim that one option was definitively better than the other, again at standard levels of precision.

I drove my team crazy stressing in every meeting that this study design was likely underpowered and they shouldn’t expect that we would be able to identify moderate (and important) improvements. Nevertheless I still thought that we would learn valuable evidence on how to implement these reforms, even if the study was underpowered, and I also prepared for the potential need to adopt a quasi-experimental evaluation design – a matched difference-in-difference – in order to improve power.

Fortunately it turned out that one of the interventions was a smashing success in terms of reducing drug stock-out rates, achieving a standardized effect size of 0.74. For the case of pediatric malaria drugs, the end-line stock-out rate in clinics receiving the most promising intervention was 11% while for clinics in control districts the rate was 48%. (The stock-out rate under the second intervention was 30% - and not significantly different from either the controls or the successful intervention, as feared.)

This study has been received with enthusiasm in operational circles and has led to planned major reforms to the public sector supply chain in Zambia. A happy ending. But I fear that this enthusiasm, if untempered, will lead to further numbers of effectively underpowered evaluations. We need to tread carefully while also exploring alternative or supplementary means of evaluation for these cases.

Future blogs will continue discussion of systems reforms and the challenges faced. If you of any good examples, I would love to hear about them…


Jed Friedman

Senior Economist, Development Research Group, World Bank

Join the Conversation

April 05, 2011

Nice post. In case readers are looking for more examples, two other evaluations of district-level health system interventions are IMCI and ACSD. Both evaluations resulted in a flood of publications, but here are two summary ones:
- The Multi-Country Evaluation of Integrated Management of Childhood Illness (MCE-IMCI): "Programmatic pathways to child survival: results of a multi-country evaluation of Integrated Management of Childhood Illness." Health Policy Plan. 2005 Bryce J, Victora CG et al. http://www.ncbi.nlm.nih.gov/pubmed/16306070
- ACSD: "The Accelerated Child Survival and Development programme in west Africa: a retrospective evaluation" Lancet, Volume 375, Issue 9714, Pages 572-582
J. Bryce, K. Gilroy, et al.

Love this new blog, by the way.

But can you switch to another CAPTCHA system? This one is ridiculously difficulty - took me 3+ tries to post this comment.

Berk Özler
April 06, 2011

Hi Brett,

Thanks very much for your comments. We raised this issue with the web administrators at the Bank as we would like this blog to be a forum for discussion and don't want spam deterrents that are so effective that they scare away the legitimate comments.

We were tolds that the current system is one of the best in reducing the inevitable (and large amounts of) spam -- World Bank blogs apparently used to get tons of spam from investment companies, book sellers, etc. prior to its installation. The Bank is currently moving towards a new web and they will look into this issue to make participating in this forum easier for readers like yourself.

Thanks again for bringing this up.

Mark Fredrickson
April 06, 2011

Did you consider blocking districts prior to randomization? For example, pairing up districts by a baseline measure of stocking rate. If you select meaningful covariates that influence outcomes, you can get substantial power gains through blocking (as the within block variance is much smaller than the pooled variance).

Alternatively, you could consider ex post covariate adjustment to increase precision, though this is generally less preferable than blocking before the design.

Here is a chapter from the upcoming Cambridge Handbook of Experimental Political Science (this chapter by Jake Bowers, University of Illinois) with more details on both strategies:


Berk Özler
April 06, 2011

This is true -- I knew blocking was helpful but was surprised to see how much expected power it adds in a recent IE design. We plan to have future posts on this here.

Jed Friedman
April 06, 2011

Mark, thanks for the excellent comments as well as the reference. We actually did block prior to randomization on the basis of a few observables although at that point in the study we had very little in the way of facility level information. Partly as a consequence, the gains from blocking were somewhat muted.

I kept the example simple so did not mention covariate adjustment, although we certainly have that approach in our back pocket... I'll link to the actual study when we get a working paper out.

These are topics I plan to return to in the future and will eagerly read the linked material. Much appreciation, Jed.

Jed Friedman
April 06, 2011

Brett, thanks so much for these articles - I'm really happy the audience is wider than solely economists. We have a lot to learn from/exchange with other disciplines.

April 07, 2011

Is it possible to incorporate repeated baseline and post-treatment measures to gain further power with your analysis? I'd be interested to hear your thoughts on the practicality of that strategy in this particular context. Thanks.

Berk Özler
April 07, 2011

David McKenzie has a paper on this, but I am not sure whether it is out as a working paper yet or not. I'll ping him to respond...

David McKenzie
April 07, 2011

A preliminary version of this paper is available at:

I'm currently revising the paper, and will blog on it when it is revised further.

April 07, 2011

Thanks very much; I look forward to it and appreciate all of your responses and the blog.