In a world saturated with research and policy recommendations, it's more important than ever to synthesize evidence well. A single study in a single setting may not yield estimates of impact that can be generalized to other contexts; we may worry about spurious effects. Aggregating results across multiple studies can give us more confidence in a finding or highlight important heterogeneity in impacts. But here’s the catch: evidence aggregation is not automatically bias-proof. In fact, if done poorly, it can reproduce or even amplify the biases we hope to avoid.
This is the motivation behind our ongoing work on evidence aggregation and our meta-analysis of preprimary education programs around the world. In this blog post, we explain how bias can sneak into the evidence synthesis process, share strategies to minimize it, and describe our own experience applying these principles.
By combining treatment effects and their precision from multiple studies, meta-analysis and other types of systematic reviews can offer more precise estimates of impact when individual studies lack sample size, indicate how effects may exhibit systematic heterogeneity (see here for a deworming meta-analysis and here for a growth mindset meta-analysis), and increase confidence in the robustness of findings that we may worry are context-specific. For those giving policy advice (and those acting on it), this type of analysis can be invaluable.
But simply combining studies isn’t enough. If we aggregate biased studies or aggregate them in biased ways, the resulting conclusions may be just as flawed as any single weak study. Even well-meaning researchers can unintentionally introduce bias at multiple stages of the synthesis process.
Types of bias in synthesis
Bias can sneak in at any stage:
- Study-level bias: If the studies included in the review have design flaws, such as non-random assignment, high attrition, weak compliance, or poor measurement, those flaws will influence the average effect size. Bias in, bias out.
- Synthesis-level bias: Even if individual studies are strong, the review can be biased by how studies are selected, how data are extracted, or how study attributes are classified. This includes publication bias (only including published studies) or the omission of relevant studies (inclusion of studies solely from a discipline familiar to the researcher or only those from the first ten pages of search results), selective outcome reporting (only extracting “headline” results), or ignoring important variation when classifying study attributes (not distinguishing among different interventions or outcome measures).
- Analytical bias: Even if a researcher manages to find all relevant studies and extracts all important attributes at multiple levels of disaggregation (and they haven’t reached retirement age when they’re finished), the way in which she aggregates results can lead to very different conclusions. Pooling estimates that differ in key ways—different populations, different outcome measures, different counterfactuals—can obscure meaningful variation. Neglecting to standardize effect sizes across studies can make some interventions appear to have high impact simply because of the properties of the outcome measure rather than the true effectiveness of the intervention. Likewise, only eyeballing effects in a forest plot (or in a narrative review) or taking a simple average ignores the uncertainty associated with each estimate of impact.
This is why global efforts like the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines and the GRADE (Grading of Recommendations Assessment, Development and Evaluation) approach have emerged. These initiatives emphasize transparency, standardization, and good practice in how we search for, select, analyze, and report on studies and how we assess the bias associated with an evidence review.
A case in point: Meta-analysis of preprimary education programs
We designed our meta-analysis of preprimary education interventions to address these types of biases. Our review aggregated evidence from 56 experimental and quasi-experimental studies evaluating center-based preprimary education programs for children aged 3–6. We focused on outcomes related to cognitive and social-emotional development during the preprimary period and basic education and then educational attainment, health, labor market participation, and criminal behavior in adulthood. Here’s how we tried to minimize bias at each stage of the aggregation process.
Step 1: Systematic search to minimize selection bias
We began with a transparent and replicable search process. Using multiple academic databases and expert consultations, we compiled a list of nearly 400 studies. We followed PRISMA’s guidance to document our search terms, sources, and inclusion criteria, and we shared a flowchart that shows how we narrowed our sample from 397 studies to 56 high-quality ones.
We cast a wide net. Our review sought to include multiple disciplines, both published and working papers, and studies from both high-income and low- and middle-income countries.
All four authors separately searched for evidence to minimize the risk of excluding relevant studies. We also searched the bibliographies of all the studies that passed our screening criteria and asked known experts to identify studies we may have missed.
Step 2: Screening and quality review to avoid inclusion of weak designs
We screened studies to assess their quality and relevance, with each study reviewed independently by at least two authors. Studies were excluded if they lacked a credible comparison group or had poorly defined counterfactuals, exhibited high or unreported attrition, failed to transparently report outcomes or estimation methods, or used ambiguous or nonstandard measures. Ultimately, 75 percent of the included studies were randomized controlled trials. This emphasis on high-quality evidence helped minimize threats to internal validity and ensured that our conclusions were grounded in rigorous research.
Step 3: Data extraction in a way that prevents selective reporting
A key risk in evidence aggregation is the temptation to extract only the most favorable results—i.e., the largest or most significant impacts. We avoided this by committing to extract all relevant outcomes across domains and time periods. Specifically, our extraction process followed three principles:
- Comprehensiveness: We extracted 699 treatment effects across the 56 studies, of which 572 were accompanied by sufficient information to be standardized and included in the meta-analysis.
- Consistency: We mapped outcomes into standardized domains (e.g., language, math, social-emotional skills) using a predefined classification.
- Double coding: At least two researchers verified each extracted estimate, and disagreements were resolved by consensus.
Step 4: Aggregation and analysis to account for variation
After extracting the data, we standardized effect sizes across studies using Wilson (2011) for binary outcomes and other standard approaches for non-binary outcomes.[1] We then applied robust variance meta-regression to estimate average effect sizes to properly account for the potential of high correlation among multiple outcomes extracted from the same study. This approach allowed us to weight effects appropriately and adjust standard errors for within-study correlation.
We disaggregated the effects by country income level (comparing high-income to low- and middle-income countries), by type of intervention (distinguishing between expansions in access and efforts to improve quality), and by the timing of outcome measurement (preprimary, school-age, and adulthood). These disaggregations helped us examine whether the overall average effects concealed important variation—and to better understand where and how preprimary education programs are most effective.
Step 5: Transparency and replicability
Transparency is essential to building trust in empirical findings, and in our review, we made all data and code publicly available in an open repository, and we documented every assumption and decision made during data cleaning, coding, and analysis. In addition, we shared comprehensive metadata on the interventions, samples, estimation strategies, and outcome domains used in the review. To assess the certainty of our findings, we applied the GRADE tool, which confirmed that the body of evidence was strong, consistent, and precise.
A call for better aggregation practices
Evidence aggregation is here to stay—and that's a good thing. But if we want the conclusions of reviews to be reliable, we need to apply the same level of rigor and transparency to synthesis as we do to individual studies.
Here’s what we recommend:
- Follow PRISMA when conducting and reporting systematic reviews.
- Use tools GRADE to assess evidence certainty
- Share your data and code so others can replicate or extend your analysis.
While adhering to this list may not eliminate all biases associated with evidence aggregation, they can increase confidence in your findings and the take-up of your recommendations. When it comes to informing policy, how we synthesize evidence matters just as much as what the evidence says.
[1] See more details in Appendix “Methods” section in the paper.
Join the Conversation