Published on Data Blog

When is there enough data to create a global statistic?

May 10, 2022

This page in:

Nighttime image of the Earth, with lights illuminating the countries — Earth at Night. Image: Dima Zel/Shutterstock.com

Open a newspaper and chances are you will find some global statistic referring to how the world is faring: “Global growth is projected to recover”, “the number of refugees worldwide is set to increase for the third straight year”, “global CO2 emissions are reaching an all-time high.” The demand for global statistics is perhaps best embodied in the Sustainable Development Goals, whose 231 indicators for the most part can be and are aggregated to the global level.

There is rarely complete global data behind such statistics. Due to lack of resources, capacity, and political will, some countries do not produce information on the indicators of interest. When creating global statistics, estimates for these countries are either imputed or simply ignored.

This inevitably creates a trade-off between the availability of global statistics and the accuracy of these statistics. If global statistics are only published when data are universally or near-universally available, there will be many important topics that cannot be illuminated. If global statistics are published even when the data coverage is weak, the accuracy of the statistics may be doubtful in the sense that they are likely to deviate from the figure had all data been available.

In a new Policy Research Working Paper, we quantify this trade-off using the World Development Indicators. We select 165 indicators spanning a wide range of topics where data are available for at least 99% of the world’s population. For these 165 indicators, we randomly delete a subset of the data, calculate the new global mean value, and compare it to the mean value when all data are used. This gives us an estimate of the error when only a fraction of the global population has data. By repeating this exercise more than 10 million times using different indicators and different probabilities of missingness, we can calculate the expected error as a function of population coverage.

To compare indicators in different units we standardize all variables to have mean 0 and variance 1. This allows us to express the error as standard deviations from the mean. Since most data producers may not be used to thinking of their indicator in terms of standard deviations from the mean, the table below shows what one standard deviation implies for five indicators. If one is a standard deviation away from the true mean when creating a global statistic, one could get life expectancy off by 7 years, global growth off by 3 percentage points, and the share using at least basic sanitation services off by 24 percentage points. Even if these errors are cut by four, and one is 0.25 of a standard deviation off the truth, they still represent large errors.

The figure below shows the results of our simulations. The expected error increases linearly with the share of population without data. The linear fit suggests that if the share of the world’s population on which one lacks data is x, then one should expect to be 0.37*x standard deviations off the true mean , with the upper bound of this estimate being about x standard deviations off the true mean. Put reversely, if one is willing to tolerate being y standard deviations away from the true mean, then one can tolerate missingness on y*2.7 (=y*1/0.37) of the global population. The wide confidence interval reflects that when one only has data for some of the population, one might be lucky and get the mean right, or unlucky and be far off.

Relationship between global data coverage and accuracy

A Flourish chart

In further results, we show how these errors change (i) if one is interested in regional statistics, (ii) if data are imputed, (iii) if the probability of data missing is correlated with the indicator of interest, (iv) if one uses the share of countries rather than the share of population as a coverage threshold, and (v) if one has specific coverage requirements for populous countries, such as India.

In conclusion, we offer some advice on how to decide when there is sufficient data to create global statistics. The most important to note is that there is no single threshold that can guide when to publish global statistics or not. The decision will depend on the context. In particular, we think the data producer should ask her- or himself the following questions:

How large errors am I willing to tolerate?
How pervasive is missing data in my indicators of interest?
Is the probability of a country not having data likely correlated with the indicator of interest?
[If producing time series] How much do the global statistics change from year to year and do the same countries consistently have missing values?
[If missing data is imputed] How confident am I in the precision of the imputations?
[If producing sub-global statistics] How large are the groups and how much of the variation happens between subgroups rather than within subgroups?

Judging from the table comparing standard deviations with original units, our (admittedly, subjective) take is that errors should never on expectation exceed 0.25 standard deviations. Even in less optimistic cases presented in the paper, this roughly corresponds to not publishing statistics when data for less than half of the relevant population is available.

The World Region

Get updates from Data Blog

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.

When is there enough data to create a global statistic?

Relationship between global data coverage and accuracy

Get updates from Data Blog

Authors

Daniel Gerszon Mahler

Umar Serajuddin

Hiroko Maeda

Join the Conversation