Syndicate content

When bad people do good surveys

Markus Goldstein's picture
So there I was, a graduate student doing my PhD fieldwork.    In the rather hot office at the University of Ghana, I was going through questionnaire after questionnaire checking for consistency, missed questions and other dimensions of quality.   All of a sudden I saw a pattern:  in the time allocation questions, men in one village seemed to be doing the exact same things, for the same amount of time, on two very different days of the week.  
To make a long story short, the male enumerator in that village was fabricating that data.    Further investigation revealed that he was not asking the respondents about the second day, but just copying over their answers from the first day.   This was my first introduction to the production of faked data.  
Now, if you would rather not think about where our data comes from and what might be wrong with it, you should stop reading this post now.    But if you do, an interesting new paper by Arden Finn and Vimal Ranchhod  gives us a lot to think about. 
Finn and Ranchhod tackle this issue in the context of South Africa.   And they are working with a dataset where this issue was caught, and the respondents re-interviewed.     This allows them to show us how the fabrication might matter for analyses.   But before we get to this, there are a bunch of other things to consider.  
First:  What are we talking about when we say faking data?     Data can go wrong for a host of reasons.    Here the focus is on cheating or negligence on the part of the enumerators (fieldworkers) collecting the data.   This can manifest in a number of ways:  skipping whole interviews, skipping sections, changing one answer so that you get to skip a whole section (job? no no, you don't have a job), and ignoring the existence of household members.   Enumerators can do this for a bunch of reasons:  some questions can be hard to ask (e.g. have you been unfaithful to your partner), some sections are really long, households can be far away or in a less than pleasant area, and the survey remuneration structure may prioritize speed over accuracy (e.g. payment for surveys completed and no large penalties for cheating).  
So is this a widespread problem?   Finn and Ranchhod take us on a tour of a number of large, prominent surveys in South Africa.  Two observations.  One, I am psyched to see that a significant number of people are paying attention to this and have the courage to be open about this.   Second, we should be worried.   They document errors ranging from enumerators gaming the payoffs for time preference questions (in likely collusion with the respondents) to enumerators shirking on the within household sampling (crippling the representativeness of the sample).  And the enumerators are sophisticated about this.   I used to think telephoning a subset of interviewees after the interview was a good way to check but  Finn and Ranchhod cite one example of an enumerator who set up her sister-in-law to answer these calls.   So yes, cheating happens.   And when it does it can do everything from reducing sample size to wiping out certain variables to killing the representativeness of a survey.  
Finn and Rachhod provide a really useful set of tools to track enumerator cheating when they turn their attention to wave 2 of the National Income Dynamics Study (NIDS), with which they were both involved.    They provide 9 possible methods to detect cheating.   Only two of them prove useful in their context, but given technology differences as well as contextual differences, it is worthwhile going through all of them (keep in mind that the application here is to the second wave of a panel survey):
  1. Number of deaths across waves.  Less people = faster surveys.   Enumerators with abnormally high mortality among their respondents are suspect.
  2. Number of refusals/not available.   No faster way to get through your list than not finding people.  (and don't think that paying only for completed interviews solves this -- the hard to get households might still not be worth it)
  3. Look for folks disproportionately activating skip codes that get them out of a lot of additional questions.  (South African problem with this:  high unemployment means that a lot of legitimate interviews use the skip codes)
  4. Look at the length of the interview.    (Even though the NIDS wave 2 was CAPI (computer based), the time stamp for finishing an interview was manually activated and a bunch of the enumerators only activated it when they were done for the day and ready to upload).
  5. Use the GPS coordinates to check where the interview happened.   (this will work if this is automated in your data collection, if it's manual the main flaw is that the enumerator needs the GPS location to find the household in wave 2)
  6. Compare the signatures on consent forms across waves  (way too labor intensive)
  7. Look for low-rates of people joining the household through in-migration or birth.
So while these didn’t work in the NIDS context, he two ways that actually paid off were to use Benford's law and to compare anthropometric measures across survey waves.  
Benford's law is interesting.   Going back to Benford's original 1938 paper, the basic argument is that in the data that is out there, naturally occurring, there is a pattern to the first digit of numbers.   And while you might think it the pattern is a uniform distribution -- it isn't. It turns out to follow a logarithmic distribution, ranging from 30.1% of the first digits being 1 to 4.6% of them being 9.   Using this distribution, you can find deviations and hence enumerators to target for verification.     (I am going to do my next post on Benford's law, so I will save more detailed discussion for later)
The anthropometric checks Finn and Rachhod use are: 1) looking for systematic (at the enumerator level) outliers in BMI values, 2) a lot of adults who changed height from round 1 to round 2, 3)mean BMI change from round 1 to round 2 (by enumerator) and 4) spikes in the weight distribution by enumerator (here they are looking for heaping around easy numbers). 
These checks reveal some enumerators where the data looks decidedly dodgy.   And, of interest for those of us who work on surveys, the dodginess was concentrated in teams (including the team’s leader).  So the NIDS oversight team did some intensive call backs (no sister-in-laws included).    They managed to contact 781 of the 991 households that needed verification.   Of these, 234 were ok.   But 547 had problems: 223 partial fabrications, 322 total fabrications and 2 unclassifiable.   And that's about 7.3 percent of the respondents for the wave (or 10 percent of the enumerators).  
The NIDS team went back and re-interviewed these folks, and this gives Finn and Rachhod both a clean and dirty version of the data to compare in the analysis.   They look at employment and health variables.   They find that the univariate statistics (e.g. the mean) are fairly unaffected by this level and form of falsification.    For transition matrices and first difference regressions, the fabrication matters more: while not resulting in large changes in absolute values, it can lead to qualitatively different conclusions.  
These results give us one set of insights into how falsification of data might matter.    Clearly, the way in which the data goes wrong matters.    In the examples farther above, there are some cases where the answer is totally wrong, others where the sample is smaller but the variables aren't compromised (particularly when enumerators cover similar distributions of respondents).  
All in all, this is further insight into how the sausage we call data is made.    Unfortunately, in this case, we can't skip the meat entirely.   It's about how to get a better hot dog -- and I will tackle that further in my next post.   Have a good July 4th. 


Submitted by Jeff Weaver on

Great post, and one that most grad students starting out in development research could really benefit from. Something that this post brings to mind, and that researchers might do well to note, is that it can be really beneficial to check the distribution of responses to questions by surveyor for all questions, not just the ones that activate skip patterns or might be prone to fraud. There are some questions where a certain amount of probing is required to get a proper answer, such as "how many loans have you taken in the last month?" or "If you needed a quick loan of $100 who could you ask?". While we try to avoid that as much as possible in survey design, it is usually unavoidable. Surveyors may exert different amounts of effort in getting a response, ranging from too little (no loans) to too much (respondent feels pressured to make up loans). Probing too much or too little isn't necessarily a sign of fraud, but something that you want to standardize across surveyors; looking at your data after a week can help you figure who which surveyors you need to retrain in doing more or less. A second benefit is that surveyors sometimes misunderstand subtle distinctions in questions without deliberately meaning to commit fraud. For example, I had a survey where we were asked women if they had received antenatal care. When we looked at our data, we realized that one surveyor was putting "yes" much more often than others because she had misunderstood what antenatal care entailed. Our supervisors/backcheckers hadn't yet picked this up, since the misunderstanding only was apparent some of the time. Finally, this can help in giving specific feedback to surveyors in a way that both lets them know that you are checking on them carefully, helping to lower incidence of fraud, but is more constructive than conversations revolving around backchecks, which in my experience often create a lot of tension.

In general, I think that this is one the main benefits of electronic surveying: getting the data back instantly allows you to find patterns that otherwise could be missed in scrutiny and even backchecking. With backchecking, the sample size is usually too small to detect these subtle errors, especially since backcheck forms often omit the questions where we know the responses are likely to be unstable. And researchers are looking into ways to improve fraud detection within survey software in some pretty interesting ways, e.g.

Great comments Jeff, thanks.    Indeed, checking for anomolous distributions is key -- one example that was in the Finn and Ranchhod paper was of one enumerator who wasn't using the anthropometric equipment right -- that wasn't fabrication, it was a case for retraining.    Getting this early is really helpful.   

I also wholeheartedly second your point on optimal probing -- it's a tough spot to find -- if you push too much not only do respondents make things up, they sometimes get really annoyed.   We had this with an information question we were asking -- where we really wanted the team to probe.   It turns out there was an underlying norm about not explicitly discussing certain things (between individuals in the community, not with the survey team) and that to admit to discussing this was a no-no. 

Add new comment