Published on Data Blog

Tracking dataset use at scale with AI: How synthetic data helps overcome data scarcity

This page in:
Tracking dataset use at scale with AI: How synthetic data helps overcome data scarcity Mock code for an AI Large Language Model (LLM) that could intelligently answer questions. / Photo: Shutterstock

Understanding how data is used in research is key to strengthening the evidence base that informs policy and investment decisions. Yet despite the immense value of development data, we often lack visibility into how datasets are cited, reused, or combined in policy-relevant literature. Systematically tracking data usage helps identify which datasets drive research and innovation, where gaps remain, and how knowledge circulates across disciplines and geographies — insights essential for shaping smarter data strategies, improving returns on data investments, and maximizing development impact.

Artificial Intelligence (AI) can play a transformative role in this effort. High-quality data are vital for building effective AI systems, yet the data needed to train models that recognize and track dataset citations in research are remarkably scarce. Unlike other AI domains with large annotated corpora, no comprehensive resource captures how datasets are referenced — authors use full names, acronyms, aliases, or vague descriptions that evolve over time—making systematic tracking difficult. This scarcity limits model generalization, as even carefully annotated examples represent only a fraction of citation patterns.

Methods leveraging large language models (LLMs) offer a way forward. Starting from a small set of high-quality annotations, LLMs can generate synthetic data that mirror diverse citation styles. With careful validation, this synthetic expansion transforms limited resources into scalable training data — enabling AI systems that more effectively map and understand how data are used in research literature.

 

The many faces of dataset mentions

Datasets are referenced in diverse ways across disciplines, regions, and time. Some papers use full formal titles; others rely on acronyms or looser descriptive terms that point to the data without naming it directly (Figure 1).

The varied ways datasets are mentioned make it difficult for AI systems with limited annotations. Few labeled examples can't capture all reference styles, so models may work only in narrow contexts and fail elsewhere. This underscores the challenge of data scarcity and the need for broader coverage methods.

 

Figure 1. Illustrative styles of dataset mentions in research literature

Image

 

The limits of manual annotation

Training data for AI is often created through manual annotation, where documents are reviewed and dataset mentions labeled. Though effective, this method is not scalable due to the wide variety of dataset references in research. Large annotation efforts still only cover a small portion of cases, missing many rare instances.

The result is that annotation alone yields models that are strong specialists but weak generalists — effective within the domains reflected in their training data but prone to failure when faced with unfamiliar citation styles. To overcome this limitation, a complementary approach is needed, one that can expand training data without requiring exhaustive manual labeling.

 

Expanding training data with synthetic examples

AI offers a way to overcome the limits of manual annotation. Large language models (LLMs) can generate synthetic data that reflects the many ways datasets are referenced in research literature.

The process starts with a small set of quality annotations used to generate new dataset mentions. With proper instructions, an LLM-generated passage can generalize across domains — like economics to health or education — making synthetic expansion effective for wider coverage (Figure 2).

Synthetic data does not replace human annotation but builds on it. By transforming a small initial seed into a larger, more diverse training resource, it enables models to capture variations that manual labeling alone cannot reach — bridging the gap between data scarcity and the need for scalable, generalizable systems.


Figure 2. Synthetic data generation prompt

Image

 

Identifying out-of-domain gaps

Even with high-quality annotations and synthetic expansion, a key question remains: does the model generalize to new material? To assess this, we examine how well the training and synthetic datasets cover the broader research landscape.

Texts are first encoded into a shared embedding space, allowing direct comparison between training data and unseen documents. These embeddings are then dimensionally reduced, clustered, and labeled by prompting an LLM with representative samples, producing a map of the research landscape that reveals which themes and domains are covered (Figure 4).

 

Figure 3. Dataset revalidation prompts

Image

 

Clusters with no training or synthetic samples indicate out-of-domain regions — topics or citation styles the model has not encountered. These clusters provide natural test cases for evaluating the model’s generalizability.

 

Figure 4. Embedding clusters of dataset mentions across sources

Image

 

Distributional analysis with cluster labeling shows where synthetic data expands coverage and where gaps remain (Figure 5). This confirms that LLM-generated data improves generalization and identifies areas for further data generation or annotation.


Evidence of generalization

To test whether the system could move beyond its training distribution, we examined exclusive clusters—regions of the research landscape where no training data was present. Within these clusters, we extracted mentions and kept only those that met predefined validity criteria.

The results are shown in Figure 6 and Figure 7. The word cloud in Figure 6 highlights the diversity of dataset mentions recovered in these unseen regions, from recurring forms that appear across documents to rare long-tail cases. Figure 7 presents a ranked view of the most frequent mentions, showing which references dominate outside the training set.

 

Figure 5. Out-of-domain clusters identification via embeddings

Image

 

Figure 6. Word cloud of validated dataset mentions extracted from exclusive clusters

Image

 

Figure 7. Top validated dataset mentions extracted from exclusive clusters

Image

 

Together, these visualizations provide qualitative evidence of generalization: the model retrieves valid dataset mentions in regions where it had no prior exposure. This does not claim full coverage; rather, it shows that the system can engage with unfamiliar text and still surface valid dataset references.


From scarcity to scale

The scarcity of annotated data for tracking dataset use remains a core challenge. Research documents will continue to reference datasets in varied and evolving ways, and no fixed training set can anticipate every form. What we have shown is that synthetic expansion, combined with structured validation and analysis of unseen clusters, provides a path to scale beyond these limitations.

The next step is to strengthen this pipeline. Automating richer metadata around dataset mentions would improve their utility for downstream applications. Closing the loop between generation, validation, and out-of-domain detection would make the system more adaptive to new contexts. Expanding coverage further ensures that the diversity of how datasets are written into research is reflected in the training data.

By moving from scarcity to scalability, synthetic expansion and validation demonstrate how AI can extend coverage responsibly, supporting systems that generalize better and remain reliable as new forms of dataset mentions continue to emerge.


Rafael Sevilla Macalaba

AI and ML Engineer Consultant, Development Data Group, World Bank

Aivin Solatorio

Program Manager, Development Data Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000