Published on Data Blog

Beyond keywords: AI-driven approaches to improve data discoverability

This page in:
AI for Data Data for Ai

This blog is part of “AI for Data, Data for AI”, a series aiming to unwrap, explain and foster the intersection of artificial intelligence and data. This post is the third installment of the series—for further reading, here are the first and second installments.  

 

Data is essential for generating knowledge and informing policies. Organizations that produce large volumes of diverse data face challenges in managing and disseminating it effectively. One major challenge is ensuring users can easily find the most relevant data for their needs, a problem known as data discoverability.

Organizations like the World Bank have systems to make their data assets discoverable. Traditionally, these systems use lexical or keyword search applications, indexing available metadata to enable data discovery through search terms. However, this approach limits discovery to the keywords in the accompanying metadata documentation, returning nothing beyond those terms.

Artificial intelligence (AI), primarily large language models (LLMs), can enhance data systems to ensure relevant and timely data are discoverable. With richer metadata and taking advantage of AI-enabled solutions, semantic search, hybrid search, knowledge graphs, and recommendation systems can be utilized.

In this post, we explore how simple AI applications can overcome the limitations of keyword-based search. We also discuss AI-enabled techniques that improve our understanding of users' information needs, leading to a better data search experience.

 

Searching with meaning: Semantic search

Semantic search is a simple yet powerful way to leverage AI for improved data discovery. Unlike traditional keyword searches, semantic search interprets the meaning of the text, enabling more expressive queries and yielding relevant results. It provides a means to find information beyond keywords and jargon.

For example, the image below compares search results for “livable planet indicators” in the World Bank's World Development Indicators (WDI). System A, powered by semantic search, provides highly relevant indicators, while System B, using traditional search, fails to do so.

Figure 1. A comparison of a semantic search system's output (System A) with an existing system (System B) that likely does not leverage semantic search for the same query “livable planet indicators”.

Image


Notably, the semantic search system identified “Population living in areas where elevation is below 5 meters (% of total population)” as a relevant indicator for a “livable planet,” alongside greenhouse gas emissions. This demonstrates how AI-powered search can yield diverse and relevant results.

Below, we present an interactive application using an AI embedding model we trained to demonstrate semantic search in action. This model is small (only 23 million parameters) and does not capture as expressive representations as larger models. It won't return similar results for the “livable planet indicator” as a larger model with 335 million parameters would. However, the smaller model allows the application to run entirely in the user’s browser. This highlights the tradeoff between model size, computational resource, and performance.

The application also compares results from a keyword-based system. Testing it reveals the strengths and weaknesses of both methods. For instance, the small semantic model struggles with queries like “SDG,” where keyword-based search performs better. Conversely, for “spending in the armed forces,” the semantic model excels by understanding related concepts like expenses and the military. In contrast, the keyword system returns irrelevant results despite matching the keywords.

These limitations in semantic search are largely due to model size—larger models perform better. Ultimately, we aim to create a hybrid search system that leverages the strengths of both semantic and keyword-based approaches.
 


Understanding what user needs: Query parser

While semantic search greatly enhances data discovery compared to simple keyword search, capturing specific information from user queries in a structured way can further improve precision. This process, known as query parsing, maximizes the value of metadata.

For example, when a user searches for “GDP philippines 2023,” a query parser should identify “philippines” as a country, “2023” as a year, and “GDP” as an indicator. Structuring the query this way allows the search to focus precisely on relevant data about the Philippines for 2023. Recent advancements in AI have made query parsers more reliable.

Figure 2. This shows the output of our query parser. The system can capture relevant entities, which can be used to improve the precision of the search.

Image

Understanding users' information needs is crucial, especially in data catalogs with limited “information real estate”—space for displaying results. By leveraging precise query information, we can filter out semantically relevant but contextually irrelevant data. For instance, GDP data for Indonesia, while semantically relevant, is not useful if the user is searching for GDP data for the Philippines.


Potential of large language models (LLMs) for dynamically improving searches: Query expansion

We must anticipate the diverse ways users search for data. Some users know specific jargon, which helps narrow down information. However, others may not know the relevant keywords. This could result in users being unable to find the data they need despite being present in the system. To address this, we can use generative AI (LLMs) for query expansion. Query expansion takes a user's query and generates variations, improving the system's ability to find the correct data.

In the context of development data, we can use structured metadata and an LLM to generate field-specific search information from a given query. This enhances data discoverability and enables the development of hybrid search systems—combining semantic and lexical search for more accurate and explainable results.

Figure 3. This shows how we can leverage an LLM to create targeted query expansion for each field in the metadata. This can potentially unlock tremendous value in improving the way data can be discovered in data catalogs.

Image


A well-implemented query expansion system highlights data with richer metadata, encouraging producers to release better-documented data to ensure it is discoverable and reusable.

Though current LLMs may be limited by generation speed, advances in more efficient, smaller LLMs could make query expansion practical and accessible even in low-resource settings.
 

Moving forward

As we tackle the challenges of managing and disseminating data, it is exciting to see how AI can make finding data easier and more relevant. Technologies like semantic search, query parsing, and query expansion can simplify and improve the process of sifting through large datasets. These advancements not only make search results more accurate but also enhance the user experience by making important information more accessible.

Our journey to optimal data discoverability is ongoing. By continuously improving and rethinking how we leverage AI models, we can build systems that truly understand and meet users' needs. This ensures researchers, decision-makers, and other AI applications can find relevant data to generate robust knowledge, make informed choices, and drive meaningful use cases. We must continue finding ways to make relevant data discoverable and timely, and AI provides us with the means to do this.

Ultimately, the various improvements in data discoverability that AI can help with will not be possible without rich and high-quality metadata. In a subsequent post, we will discuss how we leverage metadata standards, innovative tools and methods, and AI-enabled solutions to produce useful metadata efficiently.


Aivin Solatorio

Data Scientist, Development Data Group, World Bank

Olivier Dupriez

Deputy Chief Statistician, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000