# Robustly wrong? New methods for cleaning survey data on incomes: Guest post by Martin Ravallion

## This page in:

Survey responses to questions on incomes (and other potentially sensitive topics) are likely to contain errors, which could go in either direction and be found at all levels of income. There is probably also non-random selection in terms of who agrees to be interviewed, implying that we get the weights wrong too (as used to "gross-up" the sample estimates to the population).

Naturally we do not know a lot about the extent of these errors, though we do know they have implications for measures of poverty and inequality, as often used in assessing development impact. Some statistics are more robust than others to data contamination. Alas, standard inequality indices are not particularly robust, and may even be quite sensitive to errors in measuring incomes. (For a recent discussion see the paper by Frank Cowell and Emmanuel Flachaire in the Journal of Econometrics 2007.)

Understandably, analysts have often asked: what might be done to “clean” the data before calculating the poverty or inequality measures, so as to obtain more reliable measures?

Amongst many interesting sessions at the International Statistical Institute’s 58^{th}World Congress in Dublin in August there was a presentation of five papers from the AMELI project. AMELI is a cute acronym for (not-so-cute) “Advanced Methodology for European Laeken Indicators.” The project is based at University of Trier in Germany but brings together a team of statisticians across a number of European universities and governmental statistics offices. Their common aim is to find better ways of cleaning socio-economic data on incomes for measuring poverty and inequality—notably the income-based measures within the European Council’s set of so-called "Laeken indicators" of poverty and social inclusion. The AMELI project is funded by the European Commission. The session at the ISI conference was organized by Beat Hulliger from the University of Applied Sciences, Switzerland.

The AMELI researchers are coming up with some ingenious methods of cleaning “data that do not look good” (as one presenter in the ISI session put it). There are some curve-fitting models for income distributions (using both established parametric and new semi-parametric methods). Such methods have long been used with grouped data. But the idea here is different. By these methods, the actual data points are replaced by the model-based predicted values even when the micro-data are available. For example, one paper studied the fit of the Generalized Beta Type 2 distribution, while another paper was concerned with fitting of a Pareto distribution to the upper tail. (The Cowell- Flachaire paper proposed a semiparametric method for this purpose and showed that it is less sensitive to errors.)

Other work presented in the ISI session used an “epidemic algorithm” to detect outliers and impute seemingly better values. This method simulates an epidemic in the cloud of data points on incomes. Starting in the middle, the epidemic spreads through the point cloud by infecting the close data points first, with those points that are infected late being deemed to be “outliers” that are “corrupting” the implied poverty and inequality measures. The idea comes from a paper by Cédric Béguin and Beat Hulliger in the Journal of the Journal of the Royal Statistical Society Series A, 2004. A paper by Hulliger and Tobias Schoch presented at the ISI session proposed a reverse epidemic algorithm, which uses the same idea for imputing a seemingly better value for the supposed outlier, by starting a new epidemic at each outlier and using the earliest infected units for imputation. (I found a version of their presentation here.)

There are some nagging questions left begging in all this. How do we know that predicted values from a model calibrated to the survey data are better than the actual data? How do we know that the extreme values for incomes that are detected by methods such as the epidemic algorithm are really errors? How do we know that these high reported incomes are too high? While we can agree that inequality indices over time may be “under-smoothed,” given measurement errors, how do we know just how smooth they should be? Could we even “smooth-away” important economic changes and shocks with distributional impacts?

Plainly, an extreme value for income is not necessarily an error, and (even if it is) it not necessarily an over-estimate of the true value. However (as I pointed out in the open discussion at the ISI session on the AMELI project), methods such as the reverse epidemic algorithm will never lead the analyst to impute an even larger income to a high income data point. The method will impute lower incomes to very rich people.

There’s the rub: We have a strong prior that the rich do not participate in surveys as readily as do the poor, and the rich may often under-state their incomes when they do participate. If so, then the raw survey data under-weight high incomes. Yet these sophisticated cleaning methods risk making things even worse, by attenuating (or down-weighting) the high incomes in survey data. We could well end up with an even larger under-estimate of the extent of inequality.

Other data can probably help. If reliable data are available on covariates of incomes from the same survey then one could use a regression-adjustment, focusing instead on the residuals. Of course, it remains that the residuals will invariably include true but unobserved factors as well as errors. Bringing in non-survey data from other sources might also help. Comparisons of high incomes from surveys with those from income tax returns—which are presumably more reliable given penalties—have regularly confirmed that surveys under-state incomes of the rich (as was pointed out by the discussant for the ISI session, Yves Tillé from the University of Neuchâtel).

An alternative approach might be to try to better understand the psychological and economic behavior of survey respondents, as a means of re-weighting the data. My paper with Anton Korinek and Johan Mistiaen in the Journal of Econometrics 2007 developed such a method, in which we showed how to estimate the micro probability of compliance in a randomized assignment based on the spatial distribution of ex-post survey response rates and the (miss-measured) area-specific income distributions. It is a tricky problem, since we cannot observe incomes of non-compliers. But under certain conditions one can still retrieve a unique micro-compliance function consistent with the observed data on response rates and income distributions across areas. Armed with such a model, one can re-weight the data to cancel out the effect of selective compliance with the randomized assignment. Our application for the US indicated that individual response rates fell sharply at high incomes and that re-weighting the data indicated substantially higher inequality in the US than has been thought—an extra 5% points in the Gini index. Measures of poverty are more robust, though still lower after our re-weighting, as expected.

Income distribution data invariably do not “look good,” given their skewed distributions and fat tails. That is the reality. So too is the problem of measurement error in survey data. We should not assume that the extreme (high) values for incomes are over-estimates. Indeed, it would be more reasonable to assume that high reported incomes are under-weighted rather than over-weighted. In my view, the most promising route to convincingly correcting the data is to bring in other information (such as covariates or data from independent income tax records) and insights (such as from modeling survey response behavior). Short of that, we have to be wary of measures of poverty and inequality that have gone through heavy massaging of the primary survey data.

## Join the Conversation

Thanks to Martin Ravallion for starting this blog and for the discussion at the ISI meeting! I am glad that the issue of outliers gets the attention it deserves! The AMELI team has had heated discussions on issues Martin Ravallion raises. A few remarks hopefully clarify some issues and bring the discussion forward.

1) The discussion of outliers is to a large extent a discussion about models. The basic idea of robust statistics is not to model the outliers but to have methods which remain useful even if the model is not perfectly right. If there is enough data to model the outliers then of course this might lead to better results. However, we are talking here of a few outliers and fitting models to a few observations might not be a stable procedure.

2) The distinction between representative and non-representative outliers introduced by Chambers (JASA 1986) is important also for the AMELI project. The AMELI universes contain representative outliers since they are constructed from real data and outliers are considered correct. The targets for the poverty and inequality measures of the AMELI project consider these representative outliers fully. Chasing the non-representative outliers, which are added later to the universes as mistakes, and comparing methods with the original targets shows how vulnerable the standard methods are.

3) Among the few multivariate methods that are capable to cope with multivariate outliers when missing values are present and when the data stems from a complex sample survey are the BACON-EEM (Survey Methodology 2008: http://www.statcan.gc.ca/cgi-bin/af-fdr.cgi?l=eng&loc=http://www.statca…) and the Epidemic Algorithm (JRSS A 2004). Since only non-representative outliers should be detected the imputation should not create non-representative outliers either. Of course there might be room to be less stringent when imputing observations than when detecting them. However, the basic question in the face of an "outlier" is always whether it is in fact a correct observation or not. In practice this is difficult to establish, costs quite a lot, and in addition is usually not repeatable/reproducible.

4) Even representative outliers which are correct might have to be down-weighted in order to control the variability of inequality measures. My article on the robustified Horvitz-Thompson estimator (Hulliger 1995, Survey Methodology) and others explains this basic trade-off between variance estimation and bias. To put it plainly it is not useful to have a Gini-Index or a Quintile Share Ratio which jumps up and down every year just because the sample contains a few outliers more in one year than in the other! The problem is where exactly do we have to settle the trade-off.

5) Of course it is best if we get more information, be it tax information or other auxiliary information. We then can model our data. But do we really believe that the models apply to all observations? To the very rich in particular? Tax information has been tested in Switzerland and compared with SILC. For the moment it seems neither possible to use tax information due to privacy concerns nor advisable because differences between the standardized income variable of the SILC surveys and the income declared to the tax administration are too large.

6) Modelling the participation process is part of the survey estimation procedures used for the SILC surveys. The SILC surveys try to do their best. But of course the derived weights, be it by Propensity Score Adjustments or from Calibration methods etc., never fully capture the complex participation process in particular of rich people. Up-weighting rich people further might be advisable when they are under-reported. However the amount of up-weighting of the rich is probably as difficult as deciding on a tuning constant of a robust procedure. Anyway, the procedures used for outlier detection and imputation in the AMELI project take the survey weights and non-response adjustments fully into account. The simulations show how heavy the impact of outliers is beyond non-response adjustments and that robust procedures give tools to deal with them.

7) All income data which is used for inequality measures goes through heavy massaging before any inequality measure is calculated. The massaging might be so heavy and so irreproducible that we prefer not to know. Robust procedures have the advantage that they are transparent and reproducible and they put the statistician before the problem of choosing tuning constants to address the trade-off between non-representative and representative outliers and between bias and variance.

We hope that the AMELI project contributed to the solution of the problems about extreme incomes and inequalities. There should be a series of deliverables available soon on http://ameli.surveystatistics.net and a series of articles should be published in journals. However, and this might be the main point: There is no simple solution to the problem of outliers and different approaches may be appropriate in different circumstances.

A final remark on inequality measures: They are inherently non-robust since they measure a very sensitive functional of the population distribution. Should we not change our attitude and replace a theoretically nice functional like the Gini-Index or the Quintile Share Ratio by a more reasonably estimable functional, which does not cater for the very extremes of a distribution?