Syndicate content

From Jordan to Liberia: Imputing, modeling and measurement in a world of imperfect data

Kristen Himelein's picture

Simply stated, we never have enough data. This is true from smallest low income countries in Africa to the largest more complex economy in the West.  And the need grows continuously as interconnected world markets and leapfrogging technologies smash through any remaining notions of a standard path to prosperity. For many countries in the developing world, the unfortunate paradox is that they have the greatest needs but the fewest resources, both financial and in terms of capacity.  In this setting, researchers in statistics and economics have been developing new techniques to expand the usefulness of limited data. The broad body of work is collected under the umbrella “survey-to-survey imputation” and includes two recently-published papers in the World Bank Policy Research Working Paper series, “Updating Poverty Estimates at Frequent Intervals in the Absence of Consumption Data: Methods and Illustration with Reference to a Middle-Income Country,” by Hai-Anh Dang, Peter Lanjouw, and Umar Serajuddin, and “Estimating Poverty in the Absence of Consumption Data: The Case of Liberia,” by Andrew Dabalen, Errol Graham, Kristen Himelein, and Rose Mungai. (Fortunately the authors are much more creative in their approach to analysis than in their approach to naming papers.) 

The principle behind imputation is deceivingly simple. If poverty is correlated with other household and individual characteristics, and if we knew these characteristics from another more recent or more frequent source, then we could predict poverty.  In a simple example, if we discovered from previous analysis the only two things that matter for household consumption were the age of education of the members and their years of education, then instead of spending tedious hours collecting consumption data, we could just quickly collect age and education and use our model to predict consumption. 

The devil always lurches in the details however. First, it is not always easy to develop a good model of consumption. Even when the predictive power is high, spurious correlation and weird interactions can throw off results. And there is always to danger of over-fitting. The world also stubbornly refuses to stand still for us. Take the infamous case of cell phones. In the early 2000s in most parts of Africa, cell phones were a luxury item. Handsets were expensive and cell towers were found only in major urban centers.  A model developed at this time would place a very high predicted level of consumption for households with cell phones. But as time passed, prices dropped and coverage expanded. The relationship between cell phones and wealth changed dramatically. The outdated model would then predict the poorest farmer with an obsolete Nokia to be a rich man. 

A diverse group of researchers across many fields are constantly working to develop new and better methods. Dang and co-authors gallantly attempt to corral the wide ranging discussion into coherent framework, include bridging some of the perceived canyon between statistics and economics. To do this, they try to keep the restrictive assumptions to a minimum and focus on simple variance formulas. They also “propose formal tests for our general assumption as well as for this traditional but more restrictive assumption,” which very helpfully would assist in spotting issues similar to the cell phone problem discussed above. And finally, because no applied econometrics meal would be complete without a quantitative dessert, there is a case study using the Household Expenditure and Income Survey and the Unemployment and Employment Survey in Jordan. 

In contrast to the focus by Dang and co-authors on methods and assumption testing, the Dabalen et al. paper works through a more practical example with the development and application of a model for Liberia. It starts by comparing the survey-to-survey imputation techniques to another common methodology for well-being estimation in the absences of consumption, the asset index.  While the asset based model has definite uses in the health and education fields, in the type of ranking comparisons done in poverty research, it comes up a bit short. The remainder of the paper focuses on the choice of variables for an imputation model. Using variance decomposition, the paper finds first that the model works less well in rural areas than in urban. This is true even when the models are developed completely separately. In both areas, demographics (mainly household size) were important, as were the characteristics of the dwelling (roof, floor, walls, etc). In urban areas though, the characteristics of the household head (including education) and the household assets were able to explain more of the variance, giving urban areas overall a better fit. Land holdings also explained more of the variance in urban areas, which seemed contradictory in highly agrarian society. The authors further tried to incorporate auxiliary information from outside of the dataset into the model, using rainfall and ‘greeness’ measures, but there was too much variation between the years in rainfall, so much so that model  estimation  failed. 

Both of these papers note that there is more to be done (probably not coincidentally as both sets of researchers have more ongoing work in the subject). The hope is that one day soon we will be able to develop a clear set of guidelines on approaching survey-to-survey imputation, including the necessary assumptions and tests to be confident of the results. Better understanding of what modeling can and cannot do has the potential to change how we think of data collection (though perhaps not quite to the degree that we think we have enough).

Add new comment