In my post  “Should you trust a medical journal?” I think I might have been a bit unfair. Not on The Lancet, which I have since discovered, via comments  on David Roodman’s blog, has something of a track record of publishing sensational but not exactly evidence-based social science articles, but rather on Ernst Spaan et al. for challenging the systematicness of their systematic review  of health insurance impacts in developing countries. It’s not that I now think Spaan et al. did a wonderful job. It’s just that I think they probably shouldn’t have been singled out in the way they were.
Last week saw the publication in Health Policy and Planning of another health systems review – albeit not billed as a “systematic review” – by Taghreed Adam and others. The review was restricted to work published in 2009 and 2010, and was broader than health insurance. I became suspicious of it when I saw that it omitted the 2009 and 2010 articles listed at the end of my “Should you trust a medical journal” blog post, as well as several others I know of, including 2009 and 2010 papers by Rodrigo Moreno Serra and me on reforms in Europe and central Asia in health financing and in hospital payment methods . I’m sure – for reasons explained below – there are lots of other key papers omitted as well.
I’m beginning to see a pattern, and it’s a worrying one: systematic reviews in the health systems field seem to be systematically unsystematic!
There are two bits of good news. First, the field’s a young one, and now’s a good time to nip this unsystematicness in the bud. Second, last week also saw the publication of an excellent paper  in the Journal of Development Effectiveness by Hugh Waddington and others on “How to do a good systematic review of effects in international development: a tool kit”. This toolkit is a goldmine for all writers of systematic reviews. It also helps us see where these two health systems systematic reviews went wrong.
Pitfall #1 – too few databases
A key idea behind a systematic review is to do a thorough and comprehensive literature search. Waddington et al. urge authors of systematic reviews in international development to start off with no less than four types of database: (a) multidisciplinary databases like Web of Science and Google Scholar, (b) international development databases like the Joint Libraries of the World Bank and the IMF, (c) social science databases like EconLit and RePEc for Economics, and PsychInfo for behavioral studies, and (d) subject-specific databases like Medline for health.
To give them credit, Spaan et al. trawled through no less than 20 databases, including ScienceDirect, a couple of international development databases, EconLit, RePEc, the International Bibliography in Social Sciences, CSA Sociological Abstracts, and Medline. By contrast, Adam et al. searched just two online databases – Medline and Embase. The latter’s new to me. I’m not surprised – it covers the biomedical literature, and I’m an economist. Adam et al. justify their limited search on the grounds that their review wasn’t really supposed to be a systematic review as such, but just an illustration of how the field has evolved since they last looked at it. I must say I’m not at all convinced, especially as the caveat is buried toward the end of the paper. Much better to have done a proper systematic review, or failing that better to have given the paper a subtitle like “A review of a spectacularly thin sliver of the literature”.
Pitfall #2 – overreliance on automation using search terms
Waddington et al. make another really important point: while the abstracts of medical journals tend to conform to a predefined structure, and require the use of a restricted vocabulary to describe the setting, the type of study, the outcomes studied, etc., social science journals adopt a much more laisser-faire approach. An impact evaluation may not be pitched as such either in the title or the abstract. The type of intervention may not be spelled out in generic terms – proper names of programs are commonly used. And the outcomes used in the study may not be listed in the title or abstract. Even the setting may not be specified in the abstract. This means that in the social science literature it is highly risky to rely on a semi-automated search. Rather, a lot of manual searching is needed, with a lot of snowballing, including chasing up citing articles using Google Scholar and other citation tools like Scopus and Web of Science.
My sense is that both studies probably fell down at this point. Spaan et al. spell out their search terms. They seemed to require the study contain the terms “health insurance” and “developing country” or variations thereof. A study referring to the proper name of a health insurance program (like “New Cooperative Medical Scheme”) and the name of a specific country (e.g. “China”) would presumably not get picked up. Adam et al. included their search terms in an online web annex. Unfortunately as of 9:35 pm EST on Tuesday 9 October 2012, the page “could not be found”. My guess is that there was overreliance here too on a semi-automated search strategy well suited to medical journals but ill suited to nonmedical journals. Of course, since very few nonmedical journals were searched, the point is rather academic.
Pitfall #3 – naïve critical appraisal
Studies vary in the degree to which they provide compelling evidence on causal effects, and an important task facing the author of a systematic review is to grade studies in this regard. Some will be well below the inclusion threshold, and should be excluded. One nice approach is to group included studies according to the methods used (some may have a larger expected bias than others) and show the estimated effects of each of the studies graphically, clustered by group: Yu-Chu Shen et al. do this to great effect in their NBER working paper on hospital ownership and financial performance.
Again Waddington et al. make some good points. They emphasize the importance of not falling into the trap that the RCT is the gold standard. In fact, as they point out an RCT can perfectly easily be implemented badly and do worse than a non-randomized study. Yet despite this, there is little understanding – especially in the medical journals, and I suspect health journals too – of quasi-experimental methods. For a couple of great introductions to the field of impact evaluation, see here  and here .
How did the two health systems systematic reviews fare? Spaan et al. developed – but did not report in the paper – a composite index of quality based on 19 indicators. One is bias. However, the criterion for assessing bias isn’t explained, and – much to my surprise – bias gets the same 0-2 points as the other 19 indicators. So a study scoring 0 on bias but getting 36 across the other indicators would overall be considered “high-quality”. I find this mind-boggling. If the methods used in a study were likely to result in biased estimates, the authors should have thrown the study out, however strong it was on other dimensions.
By contrast, Adam et al. don’t offer a view about what constitutes credible evidence on impacts. They list studies using an RCT design alongside studies labeled “plausibility” designs and studies labeled “comparative cross-country analysis”. As far as I can make out from the paper, a “plausibility” design is a before-and-after study. In my book, that’s not considered very plausible at all, given that most variables of interest in the health sector vary over time. Cross-country comparative studies can be compelling but only if a credible strategy has been used to overcome the likely endogeneity of the program indicator. A simple comparison – as I suspect the studies in the paper are – typically isn’t credible. Yet in the review, the three types of study are presented on a par with one another, as if they were all just as reliable as one another.
Some parting thoughts
Systematic reviews hold great promise: because they’re systematic, they’re likely to carry a lot of weight with policymakers. But these two health systems systematic reviews and the Waddington et al. toolkit make it clear that it’s perfectly possible to do a systematic review unsystematically, and in the process give a potentially biased view of the literature.
Doing a systematic review according to the Waddington et al. gold standard requires care. But it also requires the right team, and should include someone who knows statistics and econometrics.
I also wonder whether one element of good practice might be to get authors to put up on the Internet for a period of a month or two and actively seek feedback from the global research community on the results of their preliminary literature search. It’s the analogue of the priest asking at a wedding whether anyone knows any reason why the couple shouldn’t be married! The systematic review team would be asking whether anyone knows of relevant work they’ve missed. Editors could require such a public consultation as a condition of submission.
One final thought. How do systematic reviews get updated as new studies get published? One session  caught my eye on the online program of the Cochrane Colloquium that just finished in New Zealand. The authors are developing software that “could be used to develop living reviews using online author communities to continually update reviews as new evidence becomes available.” That sounds really promising. A bit of a threat to traditional journals, of course, but if it means fewer published one-off unsystematic systematic reviews, and more systematic reviews that are systematic, kept updated, and have passed the test of online scrutiny, then I think I’m for it!