Smart people, mainly with good reason, like to make statements like “Measure what is important, don’t make important what you can measure,” or “Measure what we treasure and not treasure what we measure.” It is rumored that even Einstein weighed in on this by saying: “Not everything that can be counted counts and not everything that counts can be counted.” A variant of this has also become a rallying cry among those who are “anti-randomista,” to agitate against focusing research only on questions that one can answer experimentally.
However, I am confident that all researchers can generally agree that there is not much worse than the helpless feeling of not being able to vouch for the veracity of what you measured. We can deal with papers reporting null results, we can deal with messy or confusing stories, but what gives no satisfaction to anyone is to present some findings and then having to say: “This could all be wrong, because we’re not sure the respondents in our surveys are telling the truth.” This does not mean that research on sensitive topics does not get done, but like the proverbial sausage, it is necessary to block out where the data came from and how it was made.
“But, hold on,” you might say: “don’t we have survey methods to elicit the truth from people when we’re asking them about sensitive topics?” You would be right that such methods do exist, they are increasingly used by researchers, and we have written about them in this blog before. My general conclusion was that while various indirect response survey methods produce different results than direct response methods, we’re not quite sure which of those methods, if any, is producing the correct answer. Now, a simultaneously clever and humble paper by Chuang, Dupas, Huillery, and Seban (2019) adds to the pessimism about the promise of these methods, but at the same time provides researchers with simple tools to assess whether they’re working or not. [The title of this blog is the same as the title of the paper (I simply added the answer to the question): try as I might, I could not come up with a better title for the blog than the authors…]
What is a List Experiment (LE)?
I’ll let the authors tell you:
“In LE, a sample is randomized into two groups, A and B – one which receives a list of non-targeted, non-sensitive (“baseline”) statements (e.g. I1, I2, I3, I4), the other which receives the same list of baseline statements plus one extra sensitive statement (S). S is the object of interest: the researcher wants to gauge the prevalence of S behaviors. The respondent provides the enumerator the number of statements that are true without indicating how true any one statement is, and the difference between the means of the sensitive versus non-sensitive list recipients is the estimated prevalence of the behavior the researcher is trying to measure.”
Can we test the validity of LE?
There is no reason why each group needs to get only one list. For the same question S, you could give group A (I1, I2, I3, I4) and group B (I1, I2, S, I3, I4), while at the same time giving group A (e.g. I5, S, I6, I7, I8) and group B (e.g. I5, I6, I7, I8). Since groups A and B are randomly selected (and, of course, you are running a statistically well-powered study), you should get similar prevalence levels for S using B-A from the first set and A-B from the second set.
Furthermore, as you might have more than one sensitive question you might be interested in, such as S1, S2, S3, …SN, you can vary the degree to which the innocuous baseline statements are non-sensitive across these lists. If the innocuous statements are too silly (I like papayas), the sensitive statement (I cheated on my taxes) might stand out too much. Having statements that are not (or mildly or borderline) sensitive, but close to the subject matter, may then change the prevalence estimates.
Acknowledging that the idea of providing two lists per sensitive question is not new, the authors make two useful contributions to the field: first using 12 sensitive questions, i.e. 12 lists for each group, they examine whether the prevalence of S is statistically indistinguishable between the two sets. Testing the difference between the two sets rather than averaging them to reduce variance is what is novel here. Second, by analyzing the data according to how sensitive the baseline questions are, they reveal clear patterns that are interesting and valuable.
Using data from a survey in Côte d’Ivoire, Table 1 shows that, of the 12 hypothesis tests of A-B=B-A among females (males), five (four) are rejected with p-values
This is not good news for LE, in that it is not internally consistent. The fact that the answer is dependent on what the baseline questions are does not bode well for having confidence that the method has moved us closer to the truth than direct response (DR). The reader should note that validity is a necessary condition, but not sufficient to get at the "truth." The value of the test the authors propose is clearest when you fail. You get to say: “we tried the LE method, but it wasn’t even internally consistent in our sample.”
What is the Randomized Response Technique?
Again, the authors:
“Randomized Response technique (RRT) was first proposed by Warner (1965). In this method, a surveyor gives respondents a list of questions. The respondent is then given an instrument, for example a six-sided die, and instructed to tell the truth for the question(s) given if the die lands on a particular side, such as six–otherwise lie. In order to preserve anonymity of responses, the survey should be implemented such that the surveyor cannot see or learn which side the respondent landed on. As long as the probability p that the respondent is asked to be truthful (e.g., p = 1/6 for a six-sided die) is different from 50%, and assuming that people comply 100% with the protocol, it is possible to back-out the true prevalence s of the sensitive behavior as follows: the share r of individuals who report engaging in the behavior will be the sum of those that truly did it and those that did not do it but were told to lie..."
(see page 5 of the paper for the simple formula)
Can we test the validity of RRT?
From the formula, it is obvious that if p = ½, then r should always be equal to 0.5 – regardless of the prevalence of s. So, if the researcher could implement the RRT with p=0.5 (using a coin toss or a die) for one of their sensitive questions, they could test whether r=0.5 or not. Note that, unlike the LE test, this simple test (which may require you to randomly split your sample for some loss in power) is necessary and sufficient to say whether this method worked in your setting.
Using data from an experiment some of the authors ran in Cameroon, the authors test this hypothesis for four sensitive statements about sexual behavior (Table 3). None of the four ratios even come close to 0.5, with the highest equal to 0.36 on average, and 0.38 when self-administered. When the authors disaggregate the findings by those who admitted to the behavior (rightly or wrongly) vs. those who did not (again, truth or not) when asked directly, the low prevalence is largely due to those who say they have not engaged in the sensitive behavior in question. This group includes those who are non-pseudo incriminators (those who will not like to say they did it) and self-protectors (those who do not trust the procedure to hide their true behavior) in unknown proportions. [In a yet another useful exploration, the authors investigate this issue further in the Appendix.]
As in my previous blog post, Chuang et al. also find evidence for false positives, this time among those who have said that they engaged in the sensitive behavior: 58% of those people say they had sex without a condom, significantly and substantively above the expected 50% if people complied with the RRT design. In panel C, the authors find that the treatment (HIV education interventions) in their main experiment did not differentially affect the underreporting of these behaviors under RRT and go on to propose that experimenters should incorporate their proposed tests into their studies as treatments have the potential to affect compliance with RRT – just as they have the potential to affect DR.
The authors conclude by emphasizing that it is easy and relatively low-cost to include their proposed tests in surveys, but also that both techniques can easily fail to fulfill the validity conditions. They state that “Requiring self-incriminating responses from people who behaved according to the acceptable social norm may be the most important challenge for this technique,” which reminds me of the finding in Manian’s paper on the reluctance of sex workers for certification in Senegal, partly because of their unwillingness to accept, even if only admitting to themselves, “sex worker” identity.
The paper is short and has a lot of interesting side discussions, so you should definitely read it for yourself. I am already applying some lessons from it into ongoing field work...