Checking survey quality with Benford's law

|

This page in:

Last week I blogged about faked data in household surveys and a neat paper which told us how it might matter for results, but also gave us some tools to find the fakes.    This week, I want to focus on one of the tools that that paper used:   Benford’s Law.   
 
I am basing my discussion today on a nice paper in the Journal of Human Resources in 2009 by George Judge and Laura Schechter (gated  and non-gated  (older) versions).    Judge and Schechter take us through a number of datasets and give us a bunch of things to chew on as we try and think about data quality.
 
Let’s start with a recap of Benford’s law.    It turns out that while it is named for Benford (who discussed it in a 1938) paper, it was an idea that came from a mathematician in 1881 named Simon Newcomb.   And the idea is that every digit doesn’t have an equal chance of being the first digit of numbers reported in datasets.    In fact, they often (more on this later) follow the following pattern:

P(first digit is d)=log10(1+(1/d)).  

So 1 shows up 30 percent of the time and 9 shows up 4.6 percent.  
 
So what can this law tell us?    Judge and Schechter lay out some cautions.   First, this obviously doesn’t work for all kinds of variables (e.g. categorical variables).    Judge and Schechter also point out that it applies to cases where there is a “dynamic mixture of data outcomes whose resulting combination is unrestricted in terms of the possibility of spanning the nine-digit space.”    Even with this condition, not all datasets will comply – some of the datasets in Benford’s paper didn’t meet his exact distribution.   But as Judge and Schechter point out, most of them show monotonically decreasing probabilities of individual numbers showing up as the first digit.    So that’s the math side of things, on the practical side of things, we can use Benford’s law to see if data is being faked (as we saw last week) and/or if individual questions are getting less than accurate answers from the respondent.   
 
Judge and Schechter take the law to 9 datasets – ranging from US Department of Agriculture surveys to smaller academic endeavors.    And their findings give us not only guidance of when to test, but also some pointers on where things could go wrong.  
 
The first point is that some questions or categories of questions are going to get worse responses than others.  I know this comes as a shock to those of you who collect data, but the neat thing here is that Benford’s law gives you a quick way to check your priors.     Judge and Schechter illustrate this with a 2002 dataset from Paraguay and show that crop production data doesn’t do a good job of following Benford, but hectares of land and number of animals do.    When they compare the 2002 crop data with 1999 data that didn’t catch as many different crop types as 2002 (i.e. wasn’t as broad), they find that more important crops (across a number of potential definitions) do better in complying with Benford’s Law than the less important ones across the two years.   
 
So agriculture yield estimates are likely to be off because recall hard, or respondents are getting annoyed.    There are also some things that we the researchers think are innocuous, but aren’t.   Sure, you can pretest your questions, but if you get answers, they may be sufficiently plausible that it can be hard to make a call as to whether the question is working.   Judge and Schechter show that in Paraguay, the reports of church contributions were out of line with Benford’s law.   Now it could be folks forgot, but it’s more likely that this was not something they wanted to talk about, accurately.  
 
Another interesting point that comes out of the Paraguay examination is that since for this survey they have more detailed meta-data, they can look at whether the principal investigator being present makes a difference.    Turns out:  not in this case.   
 
Now sometimes, I throw a question on the end of a survey (or tagged to a particular question) asking the enumerator their impression of whether the respondent is responding truthfully or ask the enumerator to gauge the respondent’s responsiveness.    Judge and Schechter take a look at reporting in a Bangladesh dataset where enumerators were asked to judge the quality of the interview.   Comparing the enumerators’ response to this with data on crops produced and animals owned, it seems like the enumerators get it exactly…wrong.   The worse they think the interview is, the better those variables comply with Benford’s law.   Hmmmm.      
 
Do women or men give answers more in line with Benford’s law?    Five of Judge and Schechter’s datasets allow them to take a crack at this.   It turns out for the developing countries, there is no difference in male and female farmer responses.   But in the US, female answers (for whatever reason) are less in line with Benford’s law.
 
Finally, Judge and Schechter take a look at quantities of crops harvested and number of animals owned across their nine datasets.   The datasets that they have that are collected by academics score among the top, while some of the government/international organization datasets don’t do as well.   Now they don’t have a massive number of datasets, so it would be unfair to draw wide conclusions from this.   And, of course, the one variable they seem to be missing is whether or not a grad student whose dissertation was riding on the data worked on the dataset.   Two of their higher performing datasets meet this condition, which makes me think that I need to find more grad students because it’s clearly all downhill after your PhD.  
 
Postscript – if you would like to try Benford’s Law on a dataset you have, John Morrow has a web-based application that you can find here.     
 

Authors

Markus Goldstein

Lead Economist, Africa Gender Innovation Lab and Chief Economists Office

Jason Kerwin
July 18, 2014

There is also a Stata package called firstdigit that will run Benford's Law tests for you: http://ideas.repec.org/c/boc/bocode/s456842.html. I used firstdigit as a first pass at evaluating the UNODC's official statistics on murder rates for African countries (http://nonparibus.wordpress.com/2013/01/20/can-the-unodcs-murder-statis…). That analysis led me to the eventual discovery that they are interpolated by WHO (based mostly on South African data) and not actual measurements.
In my experience from playing with different datasets, it's not always clear that the problems with survey data arise from intentional faking. For example, I was looking at some income data, and the pattern of monotone decrease was due to a spike at 5 - consistent with heaping/rounding in respondent's reports. On the other hand, that is still a problem with data quality that we should be concerned with.

Markus Goldstein
July 21, 2014

Jason,
your point on faking vs rounding is spot on.    Judge and Schechter make this point in their paper -- Benford is useful to check for both faking (identifying deviant data across enumerators on the same variable) and for questions which attract a less than precise answer (looking at variables which stand out in a dataset for all enumerators). 

AT
July 16, 2014

See Birnbaum et al. (2013)
"Using behavioral data to identify interviewer fabrication in surveys"
http://dl.acm.org/citation.cfm?id=2481404
Greetings from IPA!