# Robustly wrong? New methods for cleaning survey data on incomes: Guest post by Martin Ravallion

Survey responses to questions on incomes (and other potentially sensitive topics) are likely to contain errors, which could go in either direction and be found at all levels of income. There is probably also non-random selection in terms of who agrees to be interviewed, implying that we get the weights wrong too (as used to "gross-up" the sample estimates to the population).

Naturally we do not know a lot about the extent of these errors, though we do know they have implications for measures of poverty and inequality, as often used in assessing development impact. Some statistics are more robust than others to data contamination. Alas, standard inequality indices are not particularly robust, and may even be quite sensitive to errors in measuring incomes. (For a recent discussion see the paper by Frank Cowell and Emmanuel Flachaire in the Journal of Econometrics 2007.)

Understandably, analysts have often asked: what might be done to “clean” the data before calculating the poverty or inequality measures, so as to obtain more reliable measures?

Amongst many interesting sessions at the International Statistical Institute’s 58^{th}World Congress in Dublin in August there was a presentation of five papers from the AMELI project. AMELI is a cute acronym for (not-so-cute) “Advanced Methodology for European Laeken Indicators.” The project is based at University of Trier in Germany but brings together a team of statisticians across a number of European universities and governmental statistics offices. Their common aim is to find better ways of cleaning socio-economic data on incomes for measuring poverty and inequality—notably the income-based measures within the European Council’s set of so-called "Laeken indicators" of poverty and social inclusion. The AMELI project is funded by the European Commission. The session at the ISI conference was organized by Beat Hulliger from the University of Applied Sciences, Switzerland.

The AMELI researchers are coming up with some ingenious methods of cleaning “data that do not look good” (as one presenter in the ISI session put it). There are some curve-fitting models for income distributions (using both established parametric and new semi-parametric methods). Such methods have long been used with grouped data. But the idea here is different. By these methods, the actual data points are replaced by the model-based predicted values even when the micro-data are available. For example, one paper studied the fit of the Generalized Beta Type 2 distribution, while another paper was concerned with fitting of a Pareto distribution to the upper tail. (The Cowell- Flachaire paper proposed a semiparametric method for this purpose and showed that it is less sensitive to errors.)

Other work presented in the ISI session used an “epidemic algorithm” to detect outliers and impute seemingly better values. This method simulates an epidemic in the cloud of data points on incomes. Starting in the middle, the epidemic spreads through the point cloud by infecting the close data points first, with those points that are infected late being deemed to be “outliers” that are “corrupting” the implied poverty and inequality measures. The idea comes from a paper by Cédric Béguin and Beat Hulliger in the Journal of the Journal of the Royal Statistical Society Series A, 2004. A paper by Hulliger and Tobias Schoch presented at the ISI session proposed a reverse epidemic algorithm, which uses the same idea for imputing a seemingly better value for the supposed outlier, by starting a new epidemic at each outlier and using the earliest infected units for imputation. (I found a version of their presentation here.)

There are some nagging questions left begging in all this. How do we know that predicted values from a model calibrated to the survey data are better than the actual data? How do we know that the extreme values for incomes that are detected by methods such as the epidemic algorithm are really errors? How do we know that these high reported incomes are too high? While we can agree that inequality indices over time may be “under-smoothed,” given measurement errors, how do we know just how smooth they should be? Could we even “smooth-away” important economic changes and shocks with distributional impacts?

Plainly, an extreme value for income is not necessarily an error, and (even if it is) it not necessarily an over-estimate of the true value. However (as I pointed out in the open discussion at the ISI session on the AMELI project), methods such as the reverse epidemic algorithm will never lead the analyst to impute an even larger income to a high income data point. The method will impute lower incomes to very rich people.

There’s the rub: We have a strong prior that the rich do not participate in surveys as readily as do the poor, and the rich may often under-state their incomes when they do participate. If so, then the raw survey data under-weight high incomes. Yet these sophisticated cleaning methods risk making things even worse, by attenuating (or down-weighting) the high incomes in survey data. We could well end up with an even larger under-estimate of the extent of inequality.

Other data can probably help. If reliable data are available on covariates of incomes from the same survey then one could use a regression-adjustment, focusing instead on the residuals. Of course, it remains that the residuals will invariably include true but unobserved factors as well as errors. Bringing in non-survey data from other sources might also help. Comparisons of high incomes from surveys with those from income tax returns—which are presumably more reliable given penalties—have regularly confirmed that surveys under-state incomes of the rich (as was pointed out by the discussant for the ISI session, Yves Tillé from the University of Neuchâtel).

An alternative approach might be to try to better understand the psychological and economic behavior of survey respondents, as a means of re-weighting the data. My paper with Anton Korinek and Johan Mistiaen in the Journal of Econometrics 2007 developed such a method, in which we showed how to estimate the micro probability of compliance in a randomized assignment based on the spatial distribution of ex-post survey response rates and the (miss-measured) area-specific income distributions. It is a tricky problem, since we cannot observe incomes of non-compliers. But under certain conditions one can still retrieve a unique micro-compliance function consistent with the observed data on response rates and income distributions across areas. Armed with such a model, one can re-weight the data to cancel out the effect of selective compliance with the randomized assignment. Our application for the US indicated that individual response rates fell sharply at high incomes and that re-weighting the data indicated substantially higher inequality in the US than has been thought—an extra 5% points in the Gini index. Measures of poverty are more robust, though still lower after our re-weighting, as expected.

Income distribution data invariably do not “look good,” given their skewed distributions and fat tails. That is the reality. So too is the problem of measurement error in survey data. We should not assume that the extreme (high) values for incomes are over-estimates. Indeed, it would be more reasonable to assume that high reported incomes are under-weighted rather than over-weighted. In my view, the most promising route to convincingly correcting the data is to bring in other information (such as covariates or data from independent income tax records) and insights (such as from modeling survey response behavior). Short of that, we have to be wary of measures of poverty and inequality that have gone through heavy massaging of the primary survey data.

## Join the Conversation