Syndicate content

Sex, Lies, and Measurement: Do Indirect Response Survey Methods Work? (No…)

Berk Ozler's picture

Smart people, mainly with good reason, like to make statements like “Measure what is important, don’t make important what you can measure,” or “Measure what we treasure and not treasure what we measure.” It is rumored that even Einstein weighed in on this by saying: “Not everything that can be counted counts and not everything that counts can be counted.” A variant of this has also become a rallying cry among those who are “anti-randomista,” to agitate against focusing research only on questions that one can answer experimentally.

However, I am confident that all researchers can generally agree that there is not much worse than the helpless feeling of not being able to vouch for the veracity of what you measured. We can deal with papers reporting null results, we can deal with messy or confusing stories, but what gives no satisfaction to anyone is to present some findings and then having to say: “This could all be wrong, because we’re not sure the respondents in our surveys are telling the truth.” This does not mean that research on sensitive topics does not get done, but like the proverbial sausage, it is necessary to block out where the data came from and how it was made.

“But, hold on,” you might say: “don’t we have survey methods to elicit the truth from people when we’re asking them about sensitive topics?” You would be right that such methods do exist, they are increasingly used by researchers, and we have written about them in this blog before. My general conclusion was that while various indirect response survey methods produce different results than direct response methods, we’re not quite sure which of those methods, if any, is producing the correct answer. Now, a simultaneously clever and humble paper by Chuang, Dupas, Huillery, and Seban (2019) adds to the pessimism about the promise of these methods, but at the same time provides researchers with simple tools to assess whether they’re working or not. [The title of this blog is the same as the title of the paper (I simply added the answer to the question): try as I might, I could not come up with a better title for the blog than the authors…]

What is a List Experiment (LE)?

I’ll let the authors tell you:

“In LE, a sample is randomized into two groups, A and B – one which receives a list of non-targeted, non-sensitive (“baseline”) statements (e.g. I1, I2, I3, I4), the other which receives the same list of baseline statements plus one extra sensitive statement (S). S is the object of interest: the researcher wants to gauge the prevalence of S behaviors. The respondent provides the enumerator the number of statements that are true without indicating how true any one statement is, and the difference between the means of the sensitive versus non-sensitive list recipients is the estimated prevalence of the behavior the researcher is trying to measure.”


Can we test the validity of LE?

There is no reason why each group needs to get only one list. For the same question S, you could give group A (I1, I2, I3, I4) and group B (I1, I2, S, I3, I4), while at the same time giving group A (e.g. I5, S, I6, I7, I8) and group B (e.g. I5, I6, I7, I8). Since groups A and B are randomly selected (and, of course, you are running a statistically well-powered study), you should get similar prevalence levels for S using B-A from the first set and A-B from the second set.

Furthermore, as you might have more than one sensitive question you might be interested in, such as S1, S2, S3, …SN, you can vary the degree to which the innocuous baseline statements are non-sensitive across these lists. If the innocuous statements are too silly (I like papayas), the sensitive statement (I cheated on my taxes) might stand out too much. Having statements that are not (or mildly or borderline) sensitive, but close to the subject matter, may then change the prevalence estimates.

Acknowledging that the idea of providing two lists per sensitive question is not new, the authors make two useful contributions to the field: first using 12 sensitive questions, i.e. 12 lists for each group, they examine whether the prevalence of S is statistically indistinguishable between the two sets. Testing the difference between the two sets rather than averaging them to reduce variance is what is novel here. Second, by analyzing the data according to how sensitive the baseline questions are, they reveal clear patterns that are interesting and valuable.

Using data from a survey in Côte d’Ivoire, Table 1 shows that, of the 12 hypothesis tests of A-B=B-A among females (males), five (four) are rejected with p-values<0.01, and another two (two) are rejected with p-values<0.1. Setting statistical significance aside, the difference in the prevalence of each S are substantive between the two sets. Furthermore, set B produces, on average, higher prevalences of sensitive behaviors, likely because it contains more sensitive baseline questions.

This is not good news for LE, in that it is not internally consistent. The fact that the answer is dependent on what the baseline questions are does not bode well for having confidence that the method has moved us closer to the truth than direct response (DR). The reader should note that validity is a necessary condition, but not sufficient to get at the "truth." The value of the test the authors propose is clearest when you fail. You get to say: “we tried the LE method, but it wasn’t even internally consistent in our sample.”

What is the Randomized Response Technique?

Again, the authors:
 

“Randomized Response technique (RRT) was first proposed by Warner (1965). In this method, a surveyor gives respondents a list of questions. The respondent is then given an instrument, for example a six-sided die, and instructed to tell the truth for the question(s) given if the die lands on a particular side, such as six–otherwise lie. In order to preserve anonymity of responses, the survey should be implemented such that the surveyor cannot see or learn which side the respondent landed on. As long as the probability p that the respondent is asked to be truthful (e.g., p = 1/6  for a six-sided die) is different from 50%, and assuming that people comply 100% with the protocol, it is possible to back-out the true prevalence s of the sensitive behavior as follows: the share r of individuals who report engaging in the behavior will be the sum of those that truly did it and those that did not do it but were told to lie..."


(see page 5 of the paper for the simple formula)

Can we test the validity of RRT?

From the formula, it is obvious that if p = ½, then r should always be equal to 0.5 – regardless of the prevalence of s. So, if the researcher could implement the RRT with p=0.5 (using a coin toss or a die) for one of their sensitive questions, they could test whether r=0.5 or not. Note that, unlike the LE test, this simple test (which may require you to randomly split your sample for some loss in power) is necessary and sufficient to say whether this method worked in your setting.

Using data from an experiment some of the authors ran in Cameroon, the authors test this hypothesis for four sensitive statements about sexual behavior (Table 3). None of the four ratios even come close to 0.5, with the highest equal to 0.36 on average, and 0.38 when self-administered. When the authors disaggregate the findings by those who admitted to the behavior (rightly or wrongly) vs. those who did not (again, truth or not) when asked directly, the low prevalence is largely due to those who say they have not engaged in the sensitive behavior in question. This group includes those who are non-pseudo incriminators (those who will not like to say they did it) and self-protectors (those who do not trust the procedure to hide their true behavior) in unknown proportions. [In a yet another useful exploration, the authors investigate this issue further in the Appendix.]

As in my previous blog post, Chuang et al. also find evidence for false positives, this time among those who have said that they engaged in the sensitive behavior: 58% of those people say they had sex without a condom, significantly and substantively above the expected 50% if people complied with the RRT design. In panel C, the authors find that the treatment (HIV education interventions) in their main experiment did not differentially affect the underreporting of these behaviors under RRT and go on to propose that experimenters should incorporate their proposed tests into their studies as treatments have the potential to affect compliance with RRT – just as they have the potential to affect DR.

The authors conclude by emphasizing that it is easy and relatively low-cost to include their proposed tests in surveys, but also that both techniques can easily fail to fulfill the validity conditions. They state that “Requiring self-incriminating responses from people who behaved according to the acceptable social norm may be the most important challenge for this technique,” which reminds me of the finding in Manian’s paper on the reluctance of sex workers for certification in Senegal, partly because of their unwillingness to accept, even if only admitting to themselves, “sex worker” identity.

The paper is short and has a lot of interesting side discussions, so you should definitely read it for yourself. I am already applying some lessons from it into ongoing field work...
 

Comments

Submitted by Aurelia on

Thanks for sharing this paper.
As you know my co-author Carole Treibich and myself have applied the list randomisation quite extensively recently and hence we were interested in the paper you posted.
We have read it and we do not think that the paper provides enough robust evidence to claim that the list experiment method has poor internal consistency. Looking at the paper, we saw several issues in the design of those different list randomisations and we are not convinced that any difference in the prevalence should be interpreted as failure of the list experiment method but we think it should be interpreted as failure in the design of these specific lists. Actually we believe that the main reason for discrepancy in results is the fact that the design of some of the lists did not provide confidentiality to respondents. For instance, it is surprising to see that sometimes the two lists give a P (prevalence) with an opposite sign, this can occur if participants from the treated group did not truthfully answered to the non-sensitive items in order to keep confidentiality. If this is the case, it is a violation of one of the three hypothesis on which the list randomisation is based, which is called the "no design effect". The "no design" effect can be verified by checking that the difference between the proportion of individuals in the treated group and the one in the control group who agreed with at least k statements is always positive. In addition, in order to make sure the list randomisation provided enough confidentiality to the respondent, one needs to check for the absence of floor and ceiling effects, in other words the proportion of individuals in the control group who disagree with all non-sensitive items or who agree with all of them must remain very low. If we look at the proportion of respondents who disagree with all items (Table A2), it is high for many lists, as high as 48% (list 7, set 1) while for the same behaviour elicited with set 2 only 4.7% of participants disagreed with all items, hence for them confidentiality was better guaranteed than for participants who answered list 1. To avoid this issue, some of the non-sensitive items should be negatively correlated. This has been done in list 5 set 1 but not in the other list looking at the same behaviour which is list 17 set 2. In addition, another issue we could see is that the two lists do not have the same type of non-sensitive items. Some have non-sensitive items that are linked to the topic of the sensitive item (set 2) and some are completely disconnected (set 1). I agree with you about the fact that asking participants if they agree with the statement "I often eat fruits during the rainy season" and then ask about prevalence of transactional sex may lead to different response on transactional sex as if the non-sensitive item is something that relates to sexual behaviours. So the two lists designed (set 1 and set 2) are not identical in their design, hence it is not surprising that they lead to different prevalence of the sensitive behaviour.
We recently implemented a double list randomisation to elicit the prevalence of unprotected sex among 600 sex workers in Senegal (the same setting as in the other paper you share). The prevalence of unprotected sex with list 1 was 22% and it was 21.6% with list 2. While we cannot claim that this is the true prevalence of unprotected sex, we believe that the list randomisation works very well if it is correctly designed, even in context where stigma is very high and where answering such question might provide some disutility to participants.

Submitted by Pascaline Dupas on

Thanks for your comment Aurelia. We completely agree with you: the design of the list is key. What we are argue is that it is difficult to be sure ex ante that the list was designed well -- but it is possible to check ex post. So our paper makes the very simple point that people should systematically use double list randomization as a way to check that their list was not bad -- or to edge. In other words: if you have two different lists, and they give you the same result, as in your case, it's not sufficient to be *sure* it's correct, but at least it's reassuring. If they give different results, then you can try to think through why one list worked better than the other possibly, and at a minimum you know to be cautious in interpreting the results. Once enough people have done DLE we can do meta-analysis and start understanding even better how to design lists that work. In our case, the fact that we see a lot of "floor effects" with Set 1 (when the baseline items are totally innocuous) is evidence, we think, of the fact that the sensitive item stood out so much that people froze and answered "0" (even if the prevalence of the baseline items was far from zero). From our study we learn that having the sensitive items stand out too much is probably not a good idea -- we don't think this point had been made before. But let us know if you think this is already well known and what reference we should cite. Thanks again.

Add new comment