Open science practices have reshaped development research over the past 15 years, with the adoption of data and code sharing, pre-analysis plans, and reproducible research standards. That progress is now at an inflection point, as artificial intelligence (AI) is rapidly reshaping how development evidence is produced, synthesized, and communicated. AI tools offer new opportunities to strengthen transparency, reproducibility, and access to research, but their opacity and non-determinism also challenge existing open science norms. The 12th annual Measuring Development conference — organized by the World Bank, the Center for Effective Global Action (CEGA) at UC Berkeley, and the Becker Friedman Institute at the University of Chicago — explored how open science must evolve to keep development research transparent, reproducible, and credible in the age of AI.
Edward Miguel, Distinguished Professor of Economics and CEGA Faculty Director, delivered a keynote that situated these debates within the broader evolution of the open science movement over the past decade. While practices such as preregistration, replication initiatives, and data-sharing norms have significantly reshaped empirical research, longstanding challenges remain. Miguel presented new evidence showing that only 42 percent of pre-registered hypotheses from sampled economics studies had publicly available results years later. He argued that AI tools could help scale transparency efforts, but that the “latest ChatGPT model is not a substitute for cultural change or institutional change.”
The featured panel honed in on institutional leadership to foster open research while safeguarding privacy. Haishan Fu, the World Bank Group’s Chief Statistician; Arianna Legovini, Director of Development Impact at the World Bank Group; and Markus Goldstein, Vice President and Senior Fellow at the Center for Global Development, emphasized that transparency today requires more than simply making data available. Rather than just open, data must be AI-ready: structured, documented, machine-readable, and with safeguards to ensure it is used responsibly by AI systems. This is essential to maintain public trust and sustain open data. Participants warned that without deliberate investments in high quality, representative data, particularly in low- and middle-income countries, AI risks reinforcing existing inequalities rather than reducing them. As Haishan Fu put it, “Open data is no longer enough at this age of AI. Our push now is open and AI-ready data.”
Scaling transparency and reproducibility with AI
AI tools are already reshaping reproducibility and evidence synthesis practices. Jonas Weinert presented MetaScreener, an open-source tool that uses large language models (LLMs) to assist with systematic literature reviews. The approach combined AI-assisted screening with structured human adjudication and recall-risk estimation to maintain transparency while dramatically reducing manual review time. The work shows that the value of AI lies not only in automation, but in creating auditable review processes where humans remain focused on difficult boundary cases.
Reproducibility standards themselves will need to evolve as AI becomes more embedded in research. Lars Vilhuber introduced TRACE, a framework designed to verify computational workflows involving confidential or restricted-access data through standardized manifests and cryptographically signed records. Bruno Barbarioli presented the AI Replication Engine, an autonomous verification process designed to scale reproducibility checks through automated code execution and structured audit reporting. Aubrey Jolex similarly showed that while AI agents can now reproduce published analyses end-to-end, prompts, generated code, and validation checks may themselves need to become part of the reproducibility record.
AI-era open science will require new norms, but new tools will make it easier for authors to document and certify the credibility of their own research.
Balancing openness, privacy, and trust in the age of AI
The more development research relies on large datasets, the harder it becomes to balance openness and privacy. Methods previously considered sufficient for anonymization may no longer be enough if AI tools can infer or reconstruct sensitive information. Conference speakers presented new tools to support openness while maintaining confidentiality. Nitin Kohli presented a framework for targeted differential privacy, developed for humanitarian targeting programs used during the COVID-19 pandemic in Togo and Nigeria, demonstrating that privacy protections and program effectiveness are not necessarily opposing goals if designed together from the outset. Kaitlyn Webb similarly presented privacy-preserving tools for randomized controlled trial replication packages that preserve statistical validity while protecting participant confidentiality.
AI tools for development evidence
AI is rapidly changing policymakers’ access to research findings, creating insights from thousands of research products in easily digestible formats. Tools such as ImpactAI and 3ie’s DevChat combine large evidence repositories with AI-assisted retrieval and synthesis to help users navigate thousands of impact evaluations and systematic reviews.
Evaluation of AI tools is critical, and too often focuses on technical performance alone. In their “Living Playbook” (jointly produced with the Center for Global Development and IDInsight), the Agency Fund proposed assessing generative AI applications across four levels: model, product, user, and impact, highlighting the need to evaluate not only accuracy, but also usability, behavioral change, and policy relevance.
The conference also explored the reliability and interpretability of AI tools themselves. Joao Pedro Azavedo shared a benchmarking exercise using UNICEF data that evaluated nearly 45,000 LLM interactions and found that more than 63 percent of responses were outright refusals, underscoring the limitations of relying on ungrounded models for retrieving official statistics. Virginia Ziulu extended these concerns beyond language models to geospatial applications used in development analysis, showing how models trained on well-instrumented urban environments can systematically underperform in rural and informal contexts while remaining difficult to audit or interpret. Together, these presentations underscored the importance of grounded data infrastructure, layered validation frameworks, and decision-level accountability for trustworthy AI.
Across the conference, participants returned repeatedly to the same underlying challenge: AI systems are advancing far more quickly than the institutions and standards designed to govern them. MeasureDev 2026 made clear that maintaining trust in development research will require more than stronger models. It will depend on whether institutions can build the infrastructure, incentives, and governance frameworks needed to ensure that AI-enabled evidence is transparent, reproducible, and accountable.
MeasureDev 2026 was organized by the World Bank’s Development Impact Group (DECDI), Development Data Group (DECDG), and Data Academy in collaboration with the Center for Effective Global Action (CEGA) and the University of Chicago’s Becker Friedman Institute for Economics.
Join the Conversation