Syndicate content

How Systematic Is That Systematic Review? The Case of Improving Learning Outcomes

David Evans's picture
With the rapid expansion of impact evaluation evidence has come the cottage industry of the systematic review. Simply put, a systematic review is supposed to “sum up the best available research on a specific question.” We found 238 reviews in 3ie’s database of systematic reviews of “the effectiveness of social and economic interventions in low- and middle- income countries,” seeking to sum up the best evidence on topics as diverse as the effect of decentralized forest management on deforestation and the effect of microcredit on women’s control over household spending.

But how definitive are these systematic reviews really? Over the past two years, we noticed that there were multiple systematic reviews on the same topic: How to improve learning outcomes for children in low and middle income countries. In fact, we found six! Of course, these reviews aren’t precisely the same: Some only include randomized-controlled trials (RCTs) and others include quasi-experimental studies. Some examine only how to improve learning outcomes and others include both learning and access outcomes. One only includes studies in Africa. But they all have the common core of seeking to identify what improves learning outcomes.

Here are the six studies:
  1. Identifying Effective Education Interventions in Sub-Saharan Africa: A Meta-Analysis of Rigorous Impact Evaluations, by Conn (2014)
  2. School Resources and Educational Outcomes in Developing Countries: A Review of the Literature from 1990-2010, by Glewwe et al. (2014)
  3. The Challenge of Education and Learning in the Developing World, by Kremer et al. (2013)
  4. Quality Education for All Children? What Works in Education in Developing Countries, by Krishnaratne et al. (2013)
  5. Improving Learning in Primary Schools of Developing Countries: A Meta-Analysis of Randomized Experiments, by McEwan (2014)
  6. Improving Educational Outcomes in Developing Countries: Lessons from Rigorous Evaluations, by Murnane & Ganimian (2014)
Between them, they cover an enormous amount of educational research. They identify 227 studies that measure the impact of some intervention on learning outcomes in the developing world. 134 of those are RCTs. There are studies from around the world, with many studies from China, India, Chile, and – you guessed it – Kenya. But as we read the abstracts and intros of the reviews, there was some overlap, but also quite a bit of divergence. One highlighted that pedagogical interventions were the most effective; another that information and computer technology interventions raised test scores the most; and a third highlighted school materials as most important.

What’s going on? In a recent paper, we try to figure it out.

Differing Compositions. Despite having the same topic, these studies don’t study the same papers. In fact, they don’t even come close. Out of 227 total studies that have learning outcomes across the six reviews, only 3 studies are in all six reviews, per the figure below. That may not be surprising since there are differences in the inclusion criteria (RCTs only, Africa only, etc.). Maybe some of those studies aren’t the highest quality. But only 13 studies are even in the majority (4, 5, or 6) of reviews. 159 of the total studies (70 percent!) are only included in one review. 74 of those are RCTs and so are arguably of higher quality and should be included in more reviews. (Of course, there are low-quality RCTs and high-quality non-RCTs. That’s just an example.) The most comprehensive of the reviews covers less than half of the studies.

If we do a more parsimonious analysis, looking only at RCTs with learning outcomes at the primary level between 1990 and 2010 in Sub-Saharan Africa (which is basically the intersection of the inclusion criteria of the six reviews), we find 42 total studies, and the median number included in a given systematic review is 15, about one-third. So there is surprisingly little overlap in the studies that these reviews examine.

What about categorization? The reviews also vary in how they classify the same studies. For example, a program providing merit scholarships to girls in Kenya is classified alternatively as a school fee reduction, a cash transfer, a student incentive, or a performance incentive. Likewise, a program that provided computer-assisted learning in India is alternatively classified as “computers or technology” or “materials.”

What drives the different conclusions? Composition or categorization? We selected one positive recommendation from each review and examined which studies were driving that recommendation. We then counted how many of those studies were included in other reviews. As the figure below shows, the proportion varies enormously, but the median value is 33%: In other words, another review would likely have just one third of the studies driving a major recommendation in a given review. So composition matters a lot. This is why, for example, McEwan finds much bigger results for computers than others do: The other reviews include – on average – just one third of the studies that drive his result.

At the same time, categorization plays a role. One review highlights the provision of materials as one of the best ways to improve test scores. But several of the key studies that those authors call “materials,” other authors categorize as “computers” or “instructional technology.” While those are certainly materials, not all materials are created equal.

The variation is bigger on the inside. Systematic reviews tend to group interventions into categories (like “incentives” or “information provision” or “computers”), but saying that one of these delivers the highest returns on average masks the fact the variation within these groups is often as big or bigger than the variation across groups. When McEwan finds that computer interventions deliver the highest returns on average, it can be easy to forget that the same category of interventions includes a lot of clunkers, as you can see in the forest plot from his paper, below. (We’re looking at you, One Laptop Per Child in Peru or in Uruguay; but not at you, program providing laptops in China. Man, there’s even heterogeneity within intervention sub-categories!) Indeed, out of 11 categories of interventions in McEwan’s paper, 5 have a bigger standard deviation across effect sizes within the category than across effect sizes in the entire review sample. And for another 5, the standard deviation within category is more than half the standard deviation of the full sample. This is an argument for reporting effectiveness at lower levels of aggregation of intervention categories.

Source: McEwan (2014)

What does this tell us? First, it’s worth investing in an exhaustive search. Maybe it’s even worth replicating searches. Second, it may be worthwhile to combine systematic review methodologies, such as meta-analysis (which is very systematic but excludes some studies) and narrative review (which is not very systematic but allows inclusion of lots of studies, as well as examination of the specific elements of an intervention category that make it work, or not work). Third, maintain low aggregation of intervention categories so that the categories can actually be useful.

Finally, and perhaps most importantly, take systematic reviews with a grain of salt. What they recommend very likely has good evidence behind it; but it may not be the best category of intervention, since chances are, a lot of evidence didn’t make it into the review.

Oh, and what are the three winning studies that made it into all six systematic reviews?
  1. Many Children Left Behind? Textbooks and Test Scores in Kenya, by Kremer, Glewwe, & Moulin (2009)
  2. Retrospective vs. Prospective Analysis of School Inputs: The Case of Flip Charts in Kenya, by Glewwe, Kremer, Moulin, and Zitzewitz (2004)
  3. Incentives to Learn, by Kremer, Miguel, & Thornton (2009)
Tomorrow, we’ll write briefly on what kinds of interventions are recommended most consistently across the reviews.

Future work. Can someone please now do a systematic review of our systematic review of the systematic reviews?

Credit: xkcd

Update (March 3, 2015): This post is part one of a two-part series. The second part is here.


Submitted by Andrew on

Fourth lesson: context matters! Effect sizes depend on a lot more than intervention category.

Submitted by Helen Abadzi on

What really matters is the rules of learning, that are followed in all members of homo sapiens. When you don't know them, you don't know why you get the results you do. Then context appears to be critical.

Imagine you want to increase the gas mileage that cars can get. How much information about that would you get by counting the cars sold by dealerships? The Bank's learning studies and results are a very similar concept.

Submitted by Neal Haddaway on

Only one of those six reviews has anything to do with an actual systematic review (the 3ie one)...

I would be very interested in your definition of a systematic review. Conn, McEwan, Glewwe et al., and Murnane & Ganimian all established a search protocol with keywords and sources, as did the Krishnaratne et al. (3ie) review.

Submitted by Neal Haddaway on

It's typical in most subject areas for them to at least call themselves SRs in the title or abstract. We see a lot of studies referring to a systematic literature search but that doesn't make them SRs. Obviously there IS no clear definition of where a review becomes systematic, but Cochrane, Campbell, CEE and EPPI are good organisations to base a rough definition on.

Agreed that those are good starting points. But I would propose that several of these qualify despite not being within those particular series and despite not having "systematic review" in the title or abstract. When they have "review" (or "meta-analysis") in the title or abstract and it is executed systematically, I would count it. But I'd agree to definitely go on a case-by-case basis in the absence of the Campbell / Cochrane / etc. imprimatur.

Submitted by Neal Haddaway on

OK, but that just depends on your definition of 'systematically'. Does that include any literature review that employs a search string rather than prior knowledge? It would be good to hear your definition of systematic review so we can know how widely to apply your conclusions. Thanks for replying.

We looked at studies that claimed to review the literature and that either did a meta-analysis, a vote count, or a narrative review of a systematically searched literature (and there, I mean "systematically" as in a wide set of keywords and sources, well documented).

Submitted by Neal Haddaway on

Yes, I read that. I wonder how many people would agree with that definition...

Submitted by Neal Haddaway on

This is a good start:

Great examples! Cochrane's is super: "Each systematic review addresses a clearly formulated question. ... To answer this question, we search for and collate all the existing primary research on a topic that meets certain criteria; then we assess it using stringent guidelines, to establish whether or not there is conclusive evidence about a specific treatment."

And Campbell's "A systematic review must have: Clear inclusion/ exclusion criteria; an explicit search strategy; systematic coding and analysis of included studies; meta-analysis (where possible)".

Thanks for sharing!

Submitted by Neal Haddaway on

This is also a useful resource for appraising reviews:

But still, some of those six articles are definitely not systematic reviews. Have you asked people from any of the coordinating bodies whether they agree with you?

I have not. But I'll definitely draw on these guidelines in ongoing analysis. Thanks for sharing them.

Submitted by Lee Crawfurd on

Speaking of which, Campbell apparently have one in the works:

"Education Interventions for Improving the Access to, and Quality of, Education in Low and Middle Income Countries: A Systematic Review"

Submitted by Jennifer Ambrose on

I completely agree that lower levels of aggregation of intervention categories are more useful, and I think the same goes for outcomes (e.g., not lumping academic tests & cognitive tests in one analysis, or "ever sexually active" & "sexually active this year," but instead estimating effects on the two outcomes separately). Outcome definitions can also be a source of heterogeneity in effect size on comparable interventions, if one study counts "regular school attendance" as above 80% and another counts it as above 90%, for example. Of course, that points to the bigger problem of inconsistency in how things are measured and reported, which makes quality reviews much more difficult.

Submitted by Iain Campbell on

The DRC: Does anyone have any new statistics about employment, sectors, corruption etc etc please?
Iain Campbell

Submitted by Greg Norfleet on

Anyone else feel like we're asking an overly general question here? I've always learned to use a PICO (population, intervention, control, outcome) question at the onset for a systematic review.

I've gotten through a couple of these and so far they seem to miss the point of a consistent comparison to assess impact. What's the control here? I'm positive it varies with context...

Without a consistent comparison, you could basically "prove" something is effective just bc the alternative option is terrible.

Submitted by Martina Vojtkova, Birte Sniltveit and Daniel Phillips on

David and Anna’s work raises a number of interesting questions - so much so that 3ie’s Synthesis and Reviews team have written a blog of our own exploring a few of them in depth. You can find our blog at:

Martina, Birte, and Daniel raise interesting points in their post; I recommend it.

Based on their comments and others, we prepared a table using the Campbell Collaborations criteria for a systematic review and providing a provisional rating of each of our systematic reviews on each criteria. As several commenters have highlighted, not all reviews are equally systematic. That said, I'd argue that several do "pass the bar".

And at the same time, it may be important for we researchers to recognize that the policymakers we seek to influence may not be in touch with the nuances of different types of reviews, and so it may be worthwhile to think about how to characterize both the reviews and others in the future.

Submitted by Martina Vojtkova, Daniel Phillips on

David's table provides a nice summary of the extent to which the included reviews meet some of the criteria for a systematic review. One criterion that could be added to the table is the (intended) comprehensiveness of the search. Potential bias can arise from accidental or purposive omission of evidence from synthesis. Therefore, a comprehensive search for all relevant published and unpublished research is an important characteristic of a systematic review. The Campbell Collaboration refers to this as identifying the 'relevant' and 'best available' research. 3ie refers to the use of ‘explicit and transparent procedures to identify all available research evidence relevant for a specific question’.

Submitted by Neal Haddaway, Laurenz Langer and Magnus Land on

Like Martina, Birte and Daniel, we were concerned by some fundamental misunderstandings about systematic reviews and have sought to clarify the matter in a blog on the Africa Evidence Network site, here:

Submitted by Stuart Cameron on

One more for your list from UNU-WIDER

WP/2015/033 What works to improve the quality of student learning in developing countries?
Serena Masino and Miguel Niño-Zarazúa

This one explicitly claims to be systematic and following Cochrane.

Submitted by Marguerite Berger on

Hi, David,
Thank you very much for this great and demistifying contribution. I am wondering about how many person hours, days or months it took to get all the information that you boiled down into this blog post. Also, did you use a lot of proprietary data bases to search for studies/references?
The reason that I ask is that I am trying to understand how much of a hurdle it is for an organization that wants to do a study on a subject like this (or even a narrower one), and would want to begin with the existing research/evidence base before carefully setting up the study. Let alone a PhD student. Do you think that this constitutes a barrier to entry for organizations or researchers who come from outside of heavily resourced universities, multilateral or bilateral agencies and the spin-off nonprofits and consulting firms of the above? If so, what tips would you have for jumping the hurdle?

Marguerite, thanks so much for your query. I'm not sure of the exact person-hours (the two of us, Anna Popova and I, each spent part of our time over a couple of months), but most of what we researched was available without financial outlays. For example, we spoke with experts, drew on our own knowledge, and used Google Scholar to identify the reviews. All of the reviews were available for free. When we were looking for the underlying studies, many existed in the form of reports or working papers (about one third?) or were posted on the authors' websites (many more). Some were accessed through journals that we access through our library network, and some we used inter-library loan for. So those latter sources are more proprietary. I think that the work would still have been possible without those: It just would have slowed us down a bit.

One of the meta-analysis, that done by Conn, was actually a PhD dissertation. PhD students have time (often, at least), and they have a university affiliation, so it may be easier.

I think the key is establishing a well defined question and a clear strategy for how to answer it. For that, it can be worth getting other, experienced researchers on board to weigh in. We also benefited greatly from sharing an early draft with colleagues and with the authors of the reviews that we reviewed.

Good luck!

Add new comment