Published on Development Impact

Economists have experiments figured out. What’s next? (Hint: It’s Measurement)

This page in:

“Everybody lies.” This is the famous refrain of Dr. Gregory House that is repeated in almost every episode of the TV show House. But, we need not need to take our guidance from an eccentric TV character: academics have been heard stating similar sentiments. Susan Watkins once said in a meeting on study designs for HIV prevention (and I am paraphrasing) that everyone has their own reasons to give the answers they give to a survey enumerator’s questions. The comment came in the context of some HIV prevention researchers arguing that they had to use self-reported sexual behavior data to evaluate the effectiveness of their behavior change interventions because low levels of HIV incidence made it prohibitively costly to detect any meaningful effects. When some of us, including myself, objected that we had no way of knowing whether the program effects estimated on self-reported sexual behavior data were a good reflection of the program effects on HIV incidence, the researchers questioned why people would lie. Susan’s comment, coming from someone who has spent a long time listening to people for her research on the HIV epidemic in Africa and the role of social networks in this epidemic, was in response to that.

Well, count me in their camp. Recently, I have also become increasingly skeptical of measuring program impacts using answers people give to survey questions. That people will often provide untruthful answers to researchers’ questions is likely to be obvious to those who actually answer those questions or to those who observe people lie. But, somehow, while we recently developed a small dose of skepticism; we’re still too quick to accept findings based on such answers in the absence of better data.

For example, the point was obvious to a bar manager in Malawi, who advised Francine van den Borne to use ‘mystery clients’ in her research on the sexual behavior of bar girls and freelance sex workers, described in this paper (gated), because:

Even if you introduce yourself correctly and are completely honest about your study activities, they might not believe you. They will still doubt you and think that you have been sent by, for example, the Ministry of Women and Children’s Affairs to take their children away. Or they might think that you, in disguise as a social researcher, are from the police coming to put them in prison. You shouldn’t forget that society judges these women harshly. Thus, approaching them openly as a researcher would harm them even more and would stigmatize them further.

I can hear you saying “You and three other people in development economics are working on a sensitive subject like sexual behavior: of course you have to figure out a better way of measuring your outcomes.” But, you’d be mistaken. Innocent topics such as school attendance or hand washing can suffer from a combination of social desirability bias and intervention effects on self-reporting. Don’t believe me? See these papers:

·         Barrera-Osorio et al. (2011) show that because everyone overstates their school attendance the treatment effects of the CCT program is compressed, i.e. underestimated compared to data from random classroom visits to collect attendance data. In this paper with Sarah Baird (gated), we show that the over reporting of school attendance differs by treatment status in a cash transfer experiment and produces biased impact estimates.

·         Stephen Luby and co-authors show in this paper that self-reports of handwashing after defecation substantially overstate its prevalence compared with data obtained through structured observation – a method often used to evaluate handwashing behavior. Then, in another paper, they show (by the use of soap that contained acceleration sensors!) that the sensor soap movements increased by 35% during structured observation (P=0.0004) than days where there were no embedded observers. Social desirability bias is so strong that people at least picked up the soap much more when there was an observer present (I wish this happened more when I am in a public bathroom. My presence hardly ever seems to cause other men to wash their hands with soap after coming out of a stall…) And, these findings are in the absence of an intervention that encourages people to wash their hands, teach them proper methods, talk about public health consequences, etc. Researchers suggest that even doing ethnographic work to decide what kind of an intervention would be appropriate is extremely difficult in the presence of such bias.

·         Going back to sexual behavior, Gong (2012) shows that, of people who contracted chlamydia or gonorrhea during the past 6 months, the percentage of people who reported being sexually active during that period was 93% among those who were in the control group of a randomized VCT experiment, but only 78% among those who had been tested for HIV 6 months ago, received their results and counseling. Among those who tested HIV-positive, the difference was larger (92% vs. 69%). As the author argues, the VCT intervention is providing the treatment group with ‘the “correct” responses to follow-up surveys.’

·         Many times, in the absence of HIV data, changes in pregnancy rates are used as a proxy for unobserved changes in HIV infection risk. Many public health specialists are aware of the perils of drawing such inferences from program effects on pregnancy. But, many others are too accepting and they shouldn't be. Duflo, Dupas, and Kremer (2011) find, in their experiment with multiple treatments that interventions that reduce teen pregnancy and marriage do not reduce the risk of STI (and vice versa). In Malawi, while we found significant effects on the likelihood of being ever married orpregnant as a result of a CCT program among school-age girls who had already dropped out of school before the program started in this paper, we found no effects on HIV in the same group in this paper with Baird, Garfein, and McIntosh.

I could go on, but many of you have already got the point. It is important to make a crucial distinction here. There is survey data, in which there’ll be all sorts of noise in the answers of the respondents. But, hopefully, there’ll also be enough signal to estimate relationships. If, and that is a big ‘if,’ there is no correlation between the variables of interest and any misreporting of the variable of interest, you’ll be fine. It is a big ‘if’ because we already know, for example, that ‘recall errors’ (or any other misreporting) of consumption expenditure can be correlated with household characteristics, potentially biasing poverty profiles on which policymakers may act (Beegle et al. 2012).

However, use of self-reports in experiments is altogether something else. We then have every expectation that the intervention itself will affect reporting. So, evaluating a large government program using an unrelated routine government survey may be fine (although I suspect that they too will have biases depending on what the respondents think the survey is for, how large, important, and ‘in the news’ the intervention is, etc.), but evaluating your own experiment that aims to change some behavior by asking study participants whether they have changed that behavior is unacceptable.

I have a few remaining thoughts on the issue. First, everyone working with self-reported data should consider very carefully whether the paper they’re about release is better than not writing that paper at all. This is not unlike my worry about working papers: many of the caveats that you make about your outcome variable go out the window when a practitioner is skimming your paper. If you finished your abstract by stating that “X has important implications for Y” when actually your paper showed X reduced Z, which is a self-reported proxy of Y, many policymakers and development practitioners who repeat your results to others, who design follow-up programs and policies, will only remember your bottom line. When it comes to important policy, no information may be better than misleading information.

Second, given the choice between having to raise more funds and being more ambitious about good data collection and using quick and dirty methods, we have to strive for the former. One really expensive study could be more valuable than 10 cheap ones. Designing a beautiful experiment, lining up all the funding, coordinating with various stakeholders, donors, and researchers, overseeing a carefully implemented intervention only to be evaluated using data that cannot be objectively verified is like building an Aston Martin and then putting in it a fuel tank designed for the Ford Pinto: your evaluation might soon blow up, wasting all that effort and perhaps even misleading policymakers.

I also wonder what other outcome variables that go unquestioned in the literature are leading us astray because we’re measuring them improperly or using them as proxies for other variables that go unobserved. So, here is my plea. Send me your candidates. All you have to do is describe the commonly used survey-based measure in your field of expertise and propose a more objective way in which it could be measured, and then propose a setting in which these two measures could be compared to each other (under an intervention and under no intervention). If I get enough of these that look promising, we might put together a multi-site funding proposal to examine the importance of measurement in a variety of fields.

P.S. If you are interested in the topic of measurement, the Journal of Development Economics has recently published a symposium on measurement and survey design. I commend the editors for their foresight and highly recommend the special issue.



Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000