Syndicate content

If you want your study included in a systematic review, this is what you should report

David Evans's picture

This post is co-authored with Birte Snilstveit of 3ie
Impact evaluation evidence continues to accumulate, and policy makers need to understand the range of evidence, not just individual studies. Across all sectors of international development, systematic reviews and meta-analysis (the statistical analysis used in many systematic reviews) are increasingly used to synthesize the evidence on the effects of programmes. These reviews aim to identify all available impact evaluations on a particular topic, critically appraise studies, extract detailed data on interventions, contexts, and results, and then synthesize these data to identify generalizable and context-specific findings about the effects of interventions. (We’ve both worked on this, see here and here.)
But as anyone who has ever attempted to do a systematic review will know, getting key information from included studies can often be like looking for a needle in a haystack. Sometimes this is because the information is simply not provided, and other times it is because of unclear reporting. As a result, researchers spend a long time trying to get the necessary data, often contacting authors to request more details. Often the authors themselves have trouble tracking down some additional statistic from a study they wrote years ago. In some cases, study results can simply not be included in reviews because of a lack of information.

This is of course a waste of resources: funds are spent on studies where the results are unusable, systematic reviewers waste time chasing standard deviations or programme descriptions, and the results of systematic reviews end up being less useful. This waste can be avoided easily by better reporting.
In this post we summarize the information researchers need to report for impact evaluations to be more useful and easily included in a systematic review of intervention effects.
The numbers we need to use your results most effectively in a systematic review
Studies typically use different scales and measures to assess the same or similar outcome constructs. This makes it difficult to combine and compare results across studies, which is one of the objectives of systematic reviews. Therefore, a key step in a systematic reviews is to convert results from individual studies into a common metric – a standardized effect size.

This is essential for meta-analysis, but even systematic reviews that don’t use meta-analysis benefit from more easily comparable effect sizes. That said, standardized effect sizes aren’t automatically comparable, either due to differences in underlying populations – discussed here – or in education evaluations, due to differences in test make-up – discussed here. But they merit use with discretion.

To help systematic review authors calculate a standardized effect size researchers should report the following:

  • Outcome data separately for treatment and control group (means for dichotomous outcomes, frequency for binary outcomes, regression co-efficient for adjusted estimates);
  • Sample standard deviation pooled across treatment and control groups;
  • Standard error or confidence intervals of the treatment effect (for cluster RCTs, standard errors should be adjusted for clustering, and the intra-cluster correlation should be provided – here is a simple way to calculate the ICC in Stata);
  • Sample size in treatment group (if clustered, number of clusters + average number of students per cluster) at baseline and at follow up;
  • Sample size in control group (if clustered, number of clusters + average number of students per cluster) at baseline and at follow up.
That’s it. By our count, that’s just 6 variables for a non-clustered impact evaluation, and 9 for a cluster randomized trial. Not so hard. Now that your study is in the review, you can help us make the review better.

Methodological details that will help with appraisal
Systematic reviewers also need methodological details to ensure studies are combined appropriately and to critically appraise the risk of bias of individual studies. The risk of bias assessment allows researchers to evaluate the certainty of systematic review findings by evaluating whether the researchers were able to avoid factors which are known to bias results, whether due to selection, attrition, reporting, or others
The figure below provides the result of such a risk of bias assessment from a recent systematic review. Studies are rated as having high, low or unclear risk of bias across seven categories of bias. An unclear rating is typically given if the study does not provide enough information for reviewers. As the figure highlights, for many categories the risk of bias remains unclear for almost 40 per cent of studies. So this highlights how limitations in study reporting can limit our ability to make clear statements about the certainty of the evidence.
To help this assessment researchers should report the following:
  • Unit of allocation and unit of analysis (and if they differ, whether standard errors were corrected for clustering)
  • The type of treatment estimate provided (e.g., is it an average treatment effect or treatment on the treated?) 
  • Details about treatment allocation, including how any randomization was implemented and if it was successful (baseline balance)
  • Clearly report and justify methods of analysis, especially if you use unusual approaches. (In other words, convince readers that you’re not just reporting the results of the analysis with the most interesting results.)
  • Describe the conditions in the comparison group, including distance to the groups receiving the intervention and any steps to address risks of contamination.
  • Report results for all primary and secondary outcomes clearly, including results that were not statistically significant or negative.
Result of risk of bias assessment (Snilstveit et al., 2015)
Intervention design and implementation: What is the what that works?

The phrase ‘what works’ is commonly used in debates about the use of evidence to inform decision making. But often the description of intervention design and implementation is so vague in impact evaluations that even if the ‘what’ is found to be effective, it would be difficult to replicate the programme elsewhere. This limits the usefulness of individual studies in and of themselves, but in systematic reviews the issue is magnified.

In the worst case this can lead to misleading conclusions. But routinely it also limits the usefulness of systematic review findings. This can be avoided if researchers report details of intervention design, implementation and context.

Finally, it is not enough to know if something is effective. We also need to know what it cost and to be able to compare costs across different programme options. Information on resource use and costs is rarely provided in impact evaluation reports, and therefore few systematic reviews are able to say anything about costs. (JPAL has some useful resources on how to do cost analysis.)
To make study results more useful, consider the following:
  • Describe the intervention design in sufficient detail for replication (what was delivered, by whom, for how long). If you can’t put it all in the body of the paper, use a technical appendix.
  • Describe what actually happened: Was everything delivered as planned? If not why not?
  • Provide a description of the context where the programme was delivered, including demographic characteristics of the population and relevant social, cultural, political and economic characteristics of the context.
  • Report details about resource use and costs to facilitate cost-effectiveness analysis. (What to report? There are good resources here and here.)
Most existing checklists are generic and the specific details necessary to address each point above will vary between types of programmes. But the In-Service Teacher Training Survey Instrument (ITTSI) developed by Popova et al. provides an example of a tool developed specifically to document the design and implementation details of in-service teacher training programmes.
Better reporting is a cheap way of increasing the value of research
Studies will be more useful if they are better reported. And improving study reporting is a relatively cheap way of increasing the value of research. It will require some additional effort by researchers in documenting study procedures. We may need to develop or adapt existing guidelines. But the cost of doing so is very low compared to the waste of resources on studies that cannot be used because of poor reporting.

  • In health, the issues of inconsistent reporting are being addressed through the development of reporting guidelines and enforcement of those guidelines by journal editors and research funders. Such guidelines have yet to be developed for social sciences more broadly, but the CONSORT guidelines for pragmatic trials is a good place to start. Researchers from the University of Oxford are also working on a CONSORT extension for social and psychological interventions.


Submitted by Jason Kerwin on

This is a very helpful post from the perspective of someone running impact evaluations, thanks for writing it. A question about one point:

>Sample standard deviation pooled across treatment and control group

Do you mean the pooled SD as defined here:

Or the overall SD of the dataset?

Do you advocate for effect sizes to be computed in terms of the pooled SD, the overall SD, or the SD of the control group? I have found standard references on this to be pretty unclear, and an intervention that affects not just means but higher moments of the outcome distribution can have very different effect sizes depending on which of those three options you choose.

Submitted by Birte Snilstveit on

Thanks for your question Jason. You are right in highlighting that the choice of SD can influence results. Our list of data to report is based the formula provided in most standard reference texts, which calls for the treatment and control group pooled SD. This is assuming you have a two group design and that the SDs of the two groups are similar.

But as you note if the treatment is changing not just the mean but also the distribution then using the pooled mean could bias results. But on the other hand, if you use the control SD only then you have a smaller sample size so the SD will be less precise. So in this case the baseline SD for the full sample may be a better choice.

We do not to advocate for a specific approach to calculating effect sizes. This is best assessed by researchers based on the study design and characteristics of the data, with a consistent approach to effect size calculation within a single review based on what is appropriate and feasible.

The 'short list' provided in our post is really that - the minimum researchers should report to allow us to use their studies more easily. The better the reporting the easier it is for others to appraise and use studies in the most appropriate way. So if studies report both the pooled SD and SD for separate groups, as well as baseline SD for the full sample (assuming you have baseline data) that would be better.

Add new comment