Syndicate content

How can health systems “systematic reviews” actually become systematic?

Adam Wagstaff's picture

From Karl Pillemer’s post on Cornell’s Evidence-Based Living blogIn my post “Should you trust a medical journal?” I think I might have been a bit unfair. Not on The Lancet, which I have since discovered, via comments on David Roodman’s blog, has something of a track record of publishing sensational but not exactly evidence-based social science articles, but rather on Ernst Spaan et al. for challenging the systematicness of their systematic review of health insurance impacts in developing countries. It’s not that I now think Spaan et al. did a wonderful job. It’s just that I think they probably shouldn’t have been singled out in the way they were.

Last week saw the publication in Health Policy and Planning of another health systems review – albeit not billed as a “systematic review” – by Taghreed Adam and others. The review was restricted to work published in 2009 and 2010, and was broader than health insurance. I became suspicious of it when I saw that it omitted the 2009 and 2010 articles listed at the end of my “Should you trust a medical journal” blog post, as well as several others I know of, including 2009 and 2010 papers by Rodrigo Moreno Serra and me on reforms in Europe and central Asia in health financing and in hospital payment methods. I’m sure – for reasons explained below – there are lots of other key papers omitted as well.

I’m beginning to see a pattern, and it’s a worrying one: systematic reviews in the health systems field seem to be systematically unsystematic!

There are two bits of good news. First, the field’s a young one, and now’s a good time to nip this unsystematicness in the bud. Second, last week also saw the publication of an excellent paper in the Journal of Development Effectiveness by Hugh Waddington and others on “How to do a good systematic review of effects in international development: a tool kit”. This toolkit is a goldmine for all writers of systematic reviews. It also helps us see where these two health systems systematic reviews went wrong.

Pitfall #1 – too few databases
A key idea behind a systematic review is to do a thorough and comprehensive literature search. Waddington et al. urge authors of systematic reviews in international development to start off with no less than four types of database: (a) multidisciplinary databases like Web of Science and Google Scholar, (b) international development databases like the Joint Libraries of the World Bank and the IMF, (c) social science databases like EconLit and RePEc for Economics, and PsychInfo for behavioral studies, and (d) subject-specific databases like Medline for health.

To give them credit, Spaan et al. trawled through no less than 20 databases, including ScienceDirect, a couple of international development databases, EconLit, RePEc, the International Bibliography in Social Sciences, CSA Sociological Abstracts, and Medline. By contrast, Adam et al. searched just two online databases – Medline and Embase. The latter’s new to me. I’m not surprised – it covers the biomedical literature, and I’m an economist. Adam et al. justify their limited search on the grounds that their review wasn’t really supposed to be a systematic review as such, but just an illustration of how the field has evolved since they last looked at it. I must say I’m not at all convinced, especially as the caveat is buried toward the end of the paper. Much better to have done a proper systematic review, or failing that better to have given the paper a subtitle like “A review of a spectacularly thin sliver of the literature”.

Pitfall #2 – overreliance on automation using search terms
Waddington et al. make another really important point: while the abstracts of medical journals tend to conform to a predefined structure, and require the use of a restricted vocabulary to describe the setting, the type of study, the outcomes studied, etc., social science journals adopt a much more laisser-faire approach. An impact evaluation may not be pitched as such either in the title or the abstract. The type of intervention may not be spelled out in generic terms – proper names of programs are commonly used. And the outcomes used in the study may not be listed in the title or abstract. Even the setting may not be specified in the abstract. This means that in the social science literature it is highly risky to rely on a semi-automated search. Rather, a lot of manual searching is needed, with a lot of snowballing, including chasing up citing articles using Google Scholar and other citation tools like Scopus and Web of Science.

My sense is that both studies probably fell down at this point. Spaan et al. spell out their search terms. They seemed to require the study contain the terms “health insurance” and “developing country” or variations thereof. A study referring to the proper name of a health insurance program (like “New Cooperative Medical Scheme”) and the name of a specific country (e.g. “China”) would presumably not get picked up. Adam et al. included their search terms in an online web annex. Unfortunately as of 9:35 pm EST on Tuesday 9 October 2012, the page “could not be found”. My guess is that there was overreliance here too on a semi-automated search strategy well suited to medical journals but ill suited to nonmedical journals. Of course, since very few nonmedical journals were searched, the point is rather academic.

Pitfall #3 – naïve critical appraisal
Studies vary in the degree to which they provide compelling evidence on causal effects, and an important task facing the author of a systematic review is to grade studies in this regard. Some will be well below the inclusion threshold, and should be excluded. One nice approach is to group included studies according to the methods used (some may have a larger expected bias than others) and show the estimated effects of each of the studies graphically, clustered by group: Yu-Chu Shen et al. do this to great effect in their NBER working paper on hospital ownership and financial performance.

Again Waddington et al. make some good points. They emphasize the importance of not falling into the trap that the RCT is the gold standard. In fact, as they point out an RCT can perfectly easily be implemented badly and do worse than a non-randomized study. Yet despite this, there is little understanding – especially in the medical journals, and I suspect health journals too – of quasi-experimental methods. For a couple of great introductions to the field of impact evaluation, see here and here.
How did the two health systems systematic reviews fare? Spaan et al. developed – but did not report in the paper – a composite index of quality based on 19 indicators. One is bias. However, the criterion for assessing bias isn’t explained, and – much to my surprise – bias gets the same 0-2 points as the other 19 indicators. So a study scoring 0 on bias but getting 36 across the other indicators would overall be considered “high-quality”.  I find this mind-boggling. If the methods used in a study were likely to result in biased estimates, the authors should have thrown the study out, however strong it was on other dimensions.

By contrast, Adam et al. don’t offer a view about what constitutes credible evidence on impacts. They list studies using an RCT design alongside studies labeled “plausibility” designs and studies labeled “comparative cross-country analysis”. As far as I can make out from the paper, a “plausibility” design is a before-and-after study. In my book, that’s not considered very plausible at all, given that most variables of interest in the health sector vary over time. Cross-country comparative studies can be compelling but only if a credible strategy has been used to overcome the likely endogeneity of the program indicator. A simple comparison – as I suspect the studies in the paper are – typically isn’t credible. Yet in the review, the three types of study are presented on a par with one another, as if they were all just as reliable as one another.

Some parting thoughts
Systematic reviews hold great promise: because they’re systematic, they’re likely to carry a lot of weight with policymakers. But these two health systems systematic reviews and the Waddington et al. toolkit make it clear that it’s perfectly possible to do a systematic review unsystematically, and in the process give a potentially biased view of the literature.

Doing a systematic review according to the Waddington et al. gold standard requires care. But it also requires the right team, and should include someone who knows statistics and econometrics.

I also wonder whether one element of good practice might be to get authors to put up on the Internet for a period of a month or two and actively seek feedback from the global research community on the results of their preliminary literature search. It’s the analogue of the priest asking at a wedding whether anyone knows any reason why the couple shouldn’t be married! The systematic review team would be asking whether anyone knows of relevant work they’ve missed. Editors could require such a public consultation as a condition of submission.

One final thought. How do systematic reviews get updated as new studies get published? One session caught my eye on the online program of the Cochrane Colloquium that just finished in New Zealand. The authors are developing software that “could be used to develop living reviews using online author communities to continually update reviews as new evidence becomes available.” That sounds really promising. A bit of a threat to traditional journals, of course, but if it means fewer published one-off unsystematic systematic reviews, and more systematic reviews that are systematic, kept updated, and have passed the test of online scrutiny, then I think I’m for it!


Submitted by M. Over on

Great post Adam. I agree that when a literature review pertains to a topic in health economics or any social science, reliance on PubMed keyword searches is likely to mislead the reviewer. I like your idea of requiring such reviews to be posted for comment before they are set in stone. Health and medical journals are coming late to an appreciation of the value of quasi-experimental methods, but don't forget that propensity score matching, one of the favorite tools in the modern econometrician's impact evaluation toolkit, was developed outside economics with its earliest applications being to the medical sciences.

Also including an econometrician is not enough since two equally well-qualified econometricians can reasonably disagree on the ranking of the internal-, not to mention the external-, validity of any two quasi-experimental studies.

Since you cite my colleague David Roodman, let me mention an alternative way of reviewing a set of papers which apply quasi-experimental methods to infer a causal impact: replication. When I used to teach econometrics, I used to require students (at Williams College) to replicate the results in a published journal article. This required that they request the original authors to submit their data and, sometimes, their code so the student could ideally replicate their results - and then test them for robustness. As a professional economist, I was shocked at how many authors (a) failed to answer queries from my students; or, if they answered, (b) failed to provide the data in a usable format. And when authors did respond and provide data, my students not infrequently failed to replicate the exact same regressions the authors had run. Or discovered a transcription error or a sign reversal the author had apparently missed. Or discovered extraordinary vulnerability of the results to a single questionable assumption.

So my suggestion is for a real review of quasi-experimental impact evaluations, the job should be crowd sourced to econometrics students throughout the world! The review organizer and lead author would compile the results from all these student replication papers and note whether the results were replicable at all, and how fragile they are. Authors who fail to provide their data and code would be downgraded or excluded altogether. Now THAT would be a good systematic review of impact evaluation studies.

Thanks, Mead, and a great idea! I guess it would be good to agree on checklists of (a) things students absolutely should do (including replicating the results), (b) things they might usefully do (e.g. robustness to dropping observations, etc.), and (c) things they could perhaps usefully do with some guidance from their professor but where there might be some disagreement about the merits of the changes. Adam

Submitted by Sara Bennett on
Adam, I was quite alarmed when I first read your post. Heck, I appeared to have co-authored a systematic review without even knowing that I had done so…….I went back and re-read the Adam et al paper, the two words “systematic review” do not occur anywhere in the paper other than in the bibliography. It seems a little unfair to hold the paper to the standards of a systematic review, if it does not purport to be one??? Perhaps akin to criticizing benefit incidence analyses for not ensuring that findings were validated with the poor: a standard hallmark of qualitative research (but typically not econometric research). Your post however raises some deeper concerns in my view. There is currently a momentum to document and standardize research approaches used within the health systems field, both for systematic review methods (as you describe) and health systems research methods more broadly. The PLOS Medicine series on health systems guidance for policy makers (see Bosch-Capblanch et al 2012) reflect this. As David Peters and I argued in our commentary on the PLOS Medicine series, though there is undoubtedly a need to build rigor and standards in the field, there is also a real danger of discrediting important but perhaps less well recognized approaches within the HSR field and closing off promising avenues of enquiry in an attempt to ensure standardization. To apply these thoughts to systematic reviews…..the Cochrane Collaboration has developed rigorous standards for the effectiveness reviews included in its database. But the standards around which there is broad agreement are only applicable for research questions that concern the effect of an intervention. As we know, this is not the only thing that policy makers are interested in, they are also concerned about how feasible it may be to implement a reform within their particular health system, what the likely reaction of the population may be to a reform, or potentially how a reform may affect other aspects of their health system (as discussed by Adam et al). It would be a major mistake to employ the same inclusion criteria (in terms of study design) in responding to this diversity of questions: while a good ethnographic study may illuminate how people react to a reform, it will not be very good at revealing impacts on service utilization for example. No one could support “naïve critical appraisal” – but getting agreement around appropriate critical appraisal for anything other than a straight effectiveness review, appears difficult. I have had several conversations recently with people in policy and decision-making positions in international or donor organizations who have expressed frustration with the rather limited insights that they find in systematic reviews of health systems questions….this goes much deeper than missing some studies, or not using quite the right research terms. The primary concern I have heard expressed is that such reviews can be time consuming and costly, and yet despite significant efforts to search and extract data from the literature in a systematic fashion, they fail to deliver new insights and often remain frustratingly inconclusive. I am not convinced that all reviews need to be systematic reviews and or exhaustive in their search. I think that in addition to Cadillac effectiveness reviews there is a need for scoping reviews, and relatively quick and dirty reviews that provide relevant evidence in a timely fashion. Even more importantly we need to be thinking more carefully about how we engage stakeholders in systematic reviews, whether this is filtering questions and evidence to ensure their relevance, or asking stakeholders to help interpret review findings based on their own experiences. The work of Sandy Oliver and colleagues at the EPPI center is very helpful in this respect. Clearly any research endeavor needs to be systematic, explicit about the methods employed, and rigorous (in the sense that methods used should be appropriate to the question asked). In this vein we clearly need guides to systematic review methods and processes….but we also need open and enquiring minds that are willing to experiment with alternative types of review questions and systematic review approaches.

Sara, my apologies for the slowness in your comment being posted and my response. Bill Savedoff has also posted some thoughtful comments. I hope it’s ok with you if I reply to yours and his together below.

Submitted by Bill Savedoff on
I appreciated Adam's blog because I do think that the growing number of systematic reviews for research in applied policy (not just health) needs some attention to standards and quality. I also think Sara is right in saying that this particular article is not the most appropriate one for Adam to use in highlighting this point. I say this because the paper's focus, as I understand it, was not to draw policy conclusions from the evaluations. Rather it was a critique of the literature for failing to address health system questions in ways relevant to policy. The paper is very carefully qualified from beginning to end about what it is, and isn't, trying to do. For all that, it still looks like a systematic review. It is written in the format of a systematic review and it draws conclusions by referring to included studies which are supposed to be representative in some way. The methods section presents information that is normally the basis for judging whether the included studies are comprehensive and representative. I suspect that including searches of RePEc and Google Scholar would have strengthened the conclusion that very little "holistic" analysis is being published. Yet that is exactly what is necessary for the validity of the paper - evidence from the entire body of literature, not just from Medline and Embase. I'm more troubled by the paper's dismissal of the need to assess study design and methods. I noted this line: "The objective is not to appraise the quality of evidence, e.g. whether the evaluation used appropriate study design or methods. It is rather to assess whether they ask a broader set of questions relevant for policy making." Here, Adam's critique is quite accurate. If the paper had found 50% of the studies *did* address relevant questions but failed to consider whether those studies were done well or not, the reader would be left with a very skewed sense of the condition of the literature on this subject. I'm sympathetic to the straightjacket argument that Sara references. A well-done complex system study is unlikely to pass muster with a systematic review process tailored for clinical trials. But i don't think the recommendations made by Waddington or Wagstaff preclude systematic reviews that incorporate such studies. All they demand is that the evolving standards for addressing bias and establishing proof are explicitly treated in the inclusion criteria and appropriately incorporated when reaching conclusions. For systems analysis to move forward, we need to have ways of judging the quality of a study. And to do that the *concepts* of bias, validity, and reliability still hold even if, currently, we don't have a consensus on which feasible methods will get us there.

Bill and Sara, thanks both of you for these great comments. I do appreciate that Adam et al. wasn’t billed as a systematic review, and like Bill I have a lot of sympathy with Sara’s point that not all reviews should fit into the systematic review straightjacket. That said, like Bill, I wonder whether there aren’t some key principles of the systematic review approach that ought to be a part of any review. True Adam et al. weren’t seeking to review evidence on impacts that the studies unearthed but rather ask whether the studies were trying to get at a broad set of impacts. But does that justify looking at just a couple of databases? It tells us whether a specific set of journals are publishing holistic studies. But unless we can be sure that the journals indexed by other databases are similar, doesn’t it raise some doubts about the generalizability of the article’s “key messages”? Like Bill, I’m also not convinced that assessing holism eliminates the need to judge the validity of a study’s methods. Would we be interested in a bunch of holistic studies using discredited methods? This raises a broader question about the objective of Adam et al. Isn’t there scope to assemble a comprehensive picture of a health system from partial studies? Might we not prefer to put together an overall assessment of health system X from a bunch of partial but well executed studies of health system X than rely on a bunch of holistic but poorly executed studies of the same health system? Would we really want to discourage someone from doing an opportunistic retrospective evaluation of a reform simply because the data they had stumbled on didn’t allow them to shed light on all outcomes of interest, or didn’t allow them to tell a fully fleshed-out story about the process? Just a thought!

Submitted by Rich Mallett on
Late to this but great post. Especially like the idea of living, breathing reviews - certainly makes a lot of sense given the speed at which certain evidence bases grow. Myself and some colleagues at the Overseas Development Institute wrote a briefing paper on exactly this topic earlier this year (although our focus was on the use of systematic reviews within development studies more broadly). We came to some very similar conclusions... We also put out an accompanying blog, which attracted a fairly animated debate about evidence-making and -building. (I realise these come across as a couple of very shameless plugs...)

Add new comment