The widespread availability of free geospatial data, and advances in computing power and machine learning has accelerated interest in data fusion. The process augments traditional sample surveys with geographically comprehensive data to generate more precise and accurate estimates of socioeconomic indicators at more granular levels.
Much of the World Bank’s recent work on this topic has focused on using data fusion techniques to obtain small area estimates for districts or subdistricts, one or two levels below what sample surveys typically report (Here are some recent examples from Malawi, Mexico, and Tanzania and Sri Lanka). In contrast, an important recent study by researchers at UC Berkeley and Meta generated hyperlocal estimates of wealth at the village level and released them publicly as the Meta Relative Wealth Index. Evaluations indicate that this wealth index is good at predicting asset ownership at the village level. For example, the correlation between the Meta wealth index estimates and an asset index calculated from an independent rural micro census in Kenya is about 0.84. But this correlation falls to about 0.40 when comparing with a predicted poverty measure from the same survey.
Why is the Meta index so much worse at predicting poverty than asset ownership?
Poverty is based on household consumption or income and adjusts for household size. All else equal, larger households are more likely to be poor, because poor households spend most of their income on goods like food or clothes that can only be used by one household member at a time. Assets, on the other hand, are assumed to be shared equally by all household members. Because of this and other conceptual differences between asset ownership and poverty, the Meta wealth index is a much better measure of the former than the latter.
Hyperlocal estimates for poverty, based on traditional consumption or income-based measures, would be very useful for targeting purposes. That’s where data fusion comes in. In a new study, we compared four methods for predicting average village consumption without a full census, in a sample of villages across 10 Malawian districts. The four methods we tested were:
1. A PMT score calculated based on the Unified Benefit Registry collected by Malawi’s National Statistics Office
2. The Meta Relative Wealth Index
3. Combining household survey data with publicly available geospatial data.
4. Combining household survey data, publicly available geospatial data, and a hypothetical partial registry.
The hypothetical partial registry in the fourth option would collect proxy means test indicators from all households in a small sample of villages. We simulated one by sampling a portion of the census.
The rank correlation between the partial registry estimates and the benchmark census estimates was 0.61, while the rank correlation with the next best option, the survey plus geospatial data, was only 0.19. The relative wealth index did worse, with a rank correlation of 0.14, followed by the Unified Benefit Registry. We also tried to predict the average per capita consumption of the bottom half of the households in each village, which is arguably a better measure of poverty than the average over all households. In that case, the partial registry did even better at ranking villages, with a rank correlation of 0.75 with the gold standard estimates constructed analogously using the census. The next best option was the Meta relative wealth index at 0.20. These are stark differences in predictive accuracy.
Why did the partial registry do so well?
Mostly because it provided new and better data that could unlock the potential of the publicly available geospatial data. To use geospatial data effectively, we need good training data that can accurately identify which villages are poor. It’s hard to get that training data from household surveys, because they typically sample a small number of households in each village. In addition, using the census data created a measure of predicted consumption which stripped away the random noise and transient shocks in measured consumption.Overall, the partial registry and publicly available geospatial data delivered 60 to 75 percent of the accuracy of a full census, at about 10 percent of the cost.
What does this imply for data systems?
So far data systems mainly emphasized household surveys, periodic census data, and occasionally full registries to guide targeting. Why not add partial registries, which collect a small number of easily obtained and predictive proxies for all households in a village, to the mix? These could perhaps piggyback on the existing infrastructure for community surveys.by improve the evidence base for targeting and evaluating interventions.