Published on Data Blog

The transformative role of AI for development data

This page in:
Image


Recent Artificial Intelligence (AI) advances have shown promise as a transformative tool to drive societal progress and economic development. In 2018, the outlook on AI’s impact on the global economy projected a 1.2% increase annually by 2030, bringing around US $13 trillion of additional economic output. With the new state of AI, current estimates project that generative AI will further contribute to growth by as much as between US $2.6 to $4.4 trillion. The launch of generative AI, such as ChatGPT—a large language model (LLM), has helped mainstream AI awareness, catalyzing public discourse on the benefits and risks of AI.

The positive outlook on AI is accompanied by potential labor shocks, which can consequently generate stress for developing economies. Impact on the planet as a consequence of the computational demand to train large AI models poses a setback to our continuing effort of abating climate crises. Inequity in access to and benefits from AI technologies due to the digital divide disproportionately impacts the poor. With AI's continuing disruptive potential, new data-driven knowledge must guide how we navigate the uncertainties and challenges AI poses in the context of development. This calls for reliable, timely, and relevant data to be accessible.

The Office of the Chief Statistician and the Development Data Group are working with partners inside the World Bank to establish a microcosm of innovative and responsible applications of AI that benefit data. This is achieved through collaborative efforts to develop and promote best practices and AI-driven solutions that enhance every stage of the entire data lifecycle. We aim to leverage AI to innovate and bridge existing gaps in several key areas: evaluating data quality, enhancing metadata, disseminating data, and measuring data utilization—AI for Data. Richer metadata achieved with the help of AI would allow us to have more information about the data. This high-quality metadata and well-documented data, in turn, unlock opportunities to create better AI systems and solutions—Data for AI. This positive feedback loop can improve how we provide reliable, timely, and relevant data. Successfully leveraging AI for data will make data more robust, trusted, discoverable, and reusable.  

Join data experts and AI innovators from government, the private sector, foundations, international organizations, civil society, and academia to explore opportunities, success stories, and investment priorities at the intersection of data, statistics, and AI in our two-day event “Data and AI for Sustainable Development.” Follow the livestream.

 


This post is the first in the series “AI for Data, Data for AI,” where we will share highlights of activities related to AI around development data. Here, we introduce an overview of five topics we actively explore and leverage AI to improve development data: from metadata augmentation to synthetic data.  


 

AI for Metadata Augmentation: Transforming how data is documented

While data drives knowledge, metadata ensures that researchers can find the best available data from which to extract knowledge. A significant obstacle to making data understandable, discoverable and reusable lies in the availability and quality of metadata. Documenting data and curating metadata, crucial for these purposes, is often a manual, tedious, and time-consuming process. This results in many datasets lacking the detailed metadata to ensure accessibility and reusability. Consequently, the scarcity of rich metadata limits our ability to find and utilize valuable datasets effectively, causing them to remain hidden and underexploited, hindering evidence-based knowledge.

AI is poised to revolutionize data documentation by directly addressing the challenges of poor metadata quality. AI offers a transformative solution with the potential to automate the generation and enhancement of metadata—metadata augmentation.

Generative AI advancements are unlocking novel ways to manage and document data—enabling the generation of abstracts, descriptions, and relevant themes from survey reports and data dictionaries, automating keyword extraction and generation to improve searches, and even harmonizing metadata. These advancements greatly benefit data producers, curators, and users, who would now have access to richer metadata, which increases data utilization and reuse for producers. Curators, for example, experience reduced cognitive burden and improved efficiency, while users can better discover relevant data and understand its context and scope.

 


 

AI for Data Discovery: Delivering the most relevant data

The value of data lies in the knowledge it can provide. If data remains “invisible” due to suboptimal discovery systems, then the potential of data is curtailed.

Traditional systems often rely on lexical or keyword searches, where the available metadata is indexed, allowing users to find data by entering specific search terms. However, this approach has a limitation: discovery is confined to the exact keywords in the metadata, potentially overlooking related data. Furthermore, such keyword-based systems disadvantage users unfamiliar with specialized terminology, as they may miss relevant data due to not knowing the exact terms to search for.

For instance, a search for “child malnutrition” should intuitively include results for “stunting,” even if the metadata does not explicitly mention “malnutrition.” To overcome these challenges, data catalogs need to implement more sophisticated systems. These systems should not only index metadata but also understand the context and relationships between terms, ensuring that searches yield comprehensive and relevant results regardless of the user's familiarity with the field's jargon.

Data systems can be significantly enhanced by implementing various AI-enabled technologies, such as hybrid search, semantic search applications, knowledge graphs, and recommendation systems. Richer metadata is the foundation for these improvements, ensuring that relevant and timely data are easily discoverable, which, in turn, significantly improves user experiences through increased access to pertinent information. 

In practice, addressing data discoverability is still challenging because one must also accurately capture users' information needs. AI-enabled solutions and methods, such as query parsing and query expansion—both are not new, but advances in AI unlock novel strategies—can help us better understand the user’s searches.

We envision an ideal data platform as a place where users always have a “delightful experience” and can find the data they need with as little effort as possible. Implementing an AI-centric data discovery system is a step towards this ideal data platform.

 





 

AI for Data Use and Release: Frontiers

So far, we have focused on AI applications that can help make data more discoverable—through metadata augmentation and in the development of discoverability systems. In the next sections, we briefly share AI's potential for measuring the landscape of how data is used. Further, we explore the possibilities where synthetic data can be helpful for development data.

Reporting on how data is utilized

Investments in data have resulted in the generation of data-driven knowledge and policies. However, measuring how or whether data has been used remains a challenge: how researchers identify or mention what data they used in the literature can vary significantly.

The current state of Large Language Models (LLMs) offers impeccable performance in extracting structured information from unstructured text. Leveraging Natural Language Processing (NLP) and LLMs can help us scale the assessment of data use. It can help harmonize the diversity in ways a single dataset is mentioned. This will allow us to create a “Database of Data Use,” enabling the analysis of the impact of data on informing and generating knowledge.

Understanding where data is used, in what context, and what themes and policies the data helped inform, among others, will help close the data lifecycle by providing a concrete picture of how data is leveraged to generate knowledge.


Synthetic data generation

While it may not be immediately apparent, synthetic data is not new and has long been part of many analytical outputs. For example, modeled estimates and predictions from models are synthetic data. The scale of synthetic data generation has long since evolved from univariate prediction to multiple imputations until the development of fully generative models capable of generating entire rows of data.

The United Nations Economic Commission for Europe (UNECE) recently published a Synthetic Data for Official Statistics guide. It discusses ways for generative AI models, among others, to generate synthetic data. Advances in synthetic data generation using more advanced AI provide means to create realistic data while minimizing disclosure risks.


Why consider synthetic data?

Before further discussing why synthetic data can be useful, we must emphasize that data should be made public when it is not constrained by privacy or security issues. The actual data must always be used whenever possible. However, synthetic data can help solve gaps in dissemination, such as when data may not be readily available due to privacy or other factors.

Access to data by researchers results in knowledge being unlocked. However, some data contain sensitive information, which often hinders data dissemination. Various methods have been developed to address this problem. Statistical disclosure control methods employing data anonymization techniques such as k-anonymity are typically used. Increasingly, generative models capable of creating high-utility synthetic data are being considered for the same purpose.

We can highlight three use cases where synthetic data can be considered. First, researchers will have faster access to data. If synthetic data is sufficiently realistic, researchers can initiate analysis and prepare pipelines to process and generate preliminary insights. Depending on the data provider's policies, the pipeline can easily be run on the actual data to derive empirical insights. Second, synthetic data can be used for training and educational purposes. For example, creating synthetic data from sensitive real-world datasets would be beneficial for training related to statistical disclosure control methods.  Furthermore, synthetic data can enhance data science education by offering datasets containing information that simulates the characteristics of sensitive data. This enables students to gain practical experience and develop their skills in managing and safeguarding data that resembles real-world sensitive information. Third, synthetic data may be generated as input to simulations when no dataset containing all necessary variables is available.



 

Moving forward

In the next few posts, we will discuss in detail some of the applications we developed that leverage AI for the various use cases we have discussed here.

We are just scratching the surface of what AI can do for data. As technology evolves, we expect more reliability in AI performance, which should address critical issues such as hallucinations. Consequently, more reliable AI will allow the development of more highly sophisticated systems to enhance the experience of data producers, curators, and users further.

Ultimately, knowledge can only be helpful when backed by data. Therefore, we must continue to find ways to make relevant data discoverable and timely—AI provides us with the means to do this.

Haishan Fu

Chief Statistician of the World Bank and Director of the Development Data Group

Olivier Dupriez

Deputy Chief Statistician, World Bank

Craig Hammer

Senior Program Manager, Development Data Group, World Bank

Aivin Solatorio

Data Scientist, Development Data Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000