Published on Development Impact

False positives in sensitive survey questions?

Berk Özler

July 17, 2017

This page in:

This is a follow-up to my earlier blog on list experiments for sensitive questions, which, thanks to our readers generated many responses via the comments section and emails: more reading for me – yay! More recently, my colleague Julian Jamison, who is also interested in the topic, sent me three recent papers that I had not been aware of. This short post discusses those papers and serves as a coda to the earlier post…

Random response techniques (RRT) are used to provide more valid data than direct questioning (DQ) when it comes to sensitive questions, such as corruption, sexual behavior, etc. Using some randomization technique, such as dice, they introduce noise into the respondent’s answer, in the process concealing her answer to the sensitive question while still allowing the researcher to estimate an overall prevalence of the behavior in question. These are attractive in principle, but, in practice, as we have been trying to implement them in field work recently, one worries about implementation details and the cognitive burden on the respondents: in real life, it’s not clear that they provide an advantage to warrant use over and above DQ.

A variant of RRT, I came to learn from the papers I will summarize below, is the crosswise model (CM RRT), which makes use of two “yes/no” questions presented to the respondent simultaneously: e.g. “Last time you had sex, was it unprotected?” and “Was your mother born in January or February?” The respondent can either say, “the answers to both questions are the same” (i.e. both “yes” or both “no”), or “the answers are different” (one “yes” and one “no”). This, at least to me, seems easier to explain, does not require a randomizing device, and is a nice runaround the problem of forcing someone to say “yes” to the socially undesirable statement in the forced response RRT variants.

In one of a series of three papers, Höglinger and co-authors (Diekmann in two and Jann in two) show that the CM RRT works pretty well in increasing the prevalence of reported sensitive behaviors. They have three important conclusions:

Variations of RRT (even changing the randomizing device) can affect the estimates of sensitive behaviors. This is obviously undesirable for researchers…
Second, some of the commonly used variants of RRT, such as the forced response, did not produce higher prevalence than DQ: in fact, in some cases, produced negative differentials. The cognitive burden and the reluctance to (falsely) say “yes” may have been culprits…
CM model did produce higher prevalence estimates than DQ, statistically significant in some cases, making this variant quite promising – with a caveat that it was also the noisiest of the methods trialed, producing high standard errors.

Great, right? Unfortunately, not so fast: the conclusion that CM RRT works better than DQ is based on the assumption of what the authors call “more is better.” Because we’re asking about behaviors undesirable to report, the assumption is that they’re underreported, so that the higher prevalence is interpreted as closer to the truth. But, what if, for some reason, the RRT methods were not only reducing false negatives (our primary aim in employing them) while simultaneously increasing false positives (not our aim at all – people are reporting sensitive behaviors they have not engaged in). Let’s leave the question of why on earth that would happen aside for a moment and see what the authors do to unearth the existence of “false positives.”

For the second paper, the authors recruited respondents online using the Amazon Mechanical Turk (AMT) platform. In addition to the regular sensitive questions about shoplifting, tax evasion, and not voting in the 2012 US elections, the participants were offered a chance to win $2 in two dice games. In the “prediction game,” the respondents are asked to predict the roll of a digital die and are rewarded on their self-reported success. Because the prediction is private, cheating cannot be detected. In the “roll-a-six game,” the participants are rewarded if the roll of the digital die is a six. They’re rewarded on their report of whether they rolled a six or not, but this time the researchers know the outcome. The respondents were not told that the actual rolls of the die would be tracked, but it was clear to them that this was possible. They are then asked about whether they cheated in each of these games, with the expectation that people would be more likely to give truthful answers in the “roll-a-six game” than the “prediction game.”

What happened? Well, comparing variants again to the CM RRT for these two questions, if we were simply looking at overall prevalence, we would have concluded that CM RRT works better than other variants. However, comparing the two games, in one of which the truth is known, the authors show that CM RRT produced 11-12% false positive rates (rates of cheating among people who did not cheat) for the two games: the reason prevalence is consistently higher in this variant is largely due to this phenomenon. While the true positive rates are higher in this variant, false positives are also higher, making the correct classification rates lower in CM RRT than DQ.

While this is distressing, one could argue that the fact that the respondents knew that the researchers knew they were lying makes this a bit of an odd case that may not generalize to other settings. The authors are aware of this criticism and acknowledge it in the concluding section. However, they also state:

The CM provided a prevalence estimate that came closest to the true prevalence. Hence, one could again conclude that the CM has superior validity. The analysis at the individual level, however, revealed that this is a false conclusion. The CM came close to the true prevalence primarily because it misclassified some of the non-cheating respondents as cheaters. That is, our study not only shows that the CM might not be as promising as suggested by previous studies …, it also points to a general weakness in past research on sensitive question techniques. Because complicated misreporting patterns are possible, we must be very cautious when interpreting results from comparative evaluation studies employing the more-is-better assumption, from validation studies that rely on aggregated prevalence validation, or from one-sided validation studies in which the sensitive trait or behavior applies to all or none of the respondents. We argue that an integral evaluation of the performance of a sensitive questioning technique is only possible if answers can be validated at the individual level so that false negatives and false positives can be disentangled.

In a final paper to address some of the concerns with the study above, the authors introduce a study design that does individual validation but in a way that does not require it at the individual level. In this “(near) zero-prevalence sensitive questions” technique, the authors include a couple of questions, the population prevalence of which (in Germany) is known to be (near) zero: whether the respondent has ever received an organ donation and whether the respondent suffered from Chagas disease. Lo and behold, they get significant positive prevalence for these events, around 4-8%, which are enough to cast doubt on the higher prevalence received on other items such as excessive drinking (12% over and above DQ). Again, this does not solve all the problems: the unrelated questions could be problematic (meaning that they don’t produce the correct prevalence), “yes/no” is a bad design for some of these questions, which should include a “do not know” (say, for the Chagas question), the respondents getting confused about the instructions to some of the variants, etc. Better designed studies can address some of these, but it is not clear that they’d be enough to remove the uncertainty surrounding the topic.

That uncertainty? Whether employing these methods are worth your time and effort given that they may not produce more valid answers than direct questioning – even though they will almost certainly produce different answers, likely in the expected direction.

Get updates from Development Impact

Authors

Berk Özler

Lead Economist, Development Research Group, World Bank

More Blogs By Berk

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.