Syndicate content

Assessing the Severity of Experimenter Demand/Social Desirability Effects

David McKenzie's picture
Many impact evaluations rely on survey data collected from program participants to measure key outcomes. Examples include measuring the profits and sales of informal businesses, gender attitudes, study habits, job search behavior, etc. A key concern is then that participants tell you what they think you want to hear – which is known as social desirability bias, or, when this behavior changes with treatment, is called experimenter demand effects. The typical approaches to dealing with this concern are to attempt to minimize it, through techniques such as separating the survey from the program as much as possible (both in terms of time, and who is conducting it); using methods for sensitive questions like list experiments and randomized response; and attempting to use objective or action-based proxy measures. Two recent papers offer an alternative approach to assessing how bad such a bias might be, by seeking to maximize it.

Mummolo and Peterson (2018)
A recent political science paper by Mummolo and Peterson measures how severe experimenter demand effects could possibly be by attempting to induce experimental variation in how salient the purpose of the experiment is to participants. They conduct 5 online survey experiments that replicate published political science experiments (e.g. a framing study regarding white supremacist’s planning a rally, in which the control group gets a news article about the request, and the treatment group has a similar article that highlights the first amendment right to hold the rally, and then both groups are asked how likely they are to support the rally; another study which considers a hypothetical scenario in which the U.S. is deciding to use force against a nation developing nuclear weapons, and the experiment lists attributes of the unnamed country, randomly assigning whether it is a democracy – and the outcome is support for use of force).

They then randomly assign participants to receive information about the experimenter’s intent, intending to maximize experimenter demand effects from a subset of the sample. This includes groups that are i) explicitly told the hypothesis being tested; ii) directional treatments, where they are randomly told the hypothesis is that treatment will induce a positive shift in the outcome, or that it will induce a negative shift in the outcome; and as an extreme iii) a financial incentive to assist the researcher in confirming their hypothesis (e.g. “The researchers conducting this survey expect that individuals are more likely to choose a news story if it is offered by a news outlet with a reputation of being friendly towards their preferred political party. If your responses support this theory, you will receive a $0.25 bonus payment").
The encouraging news is that revealing the purpose of the experiment/hypotheses does not change the results of the experiment – that is, experimenter demand effects aren’t a big deal in their settings. They can induce experimenter demand effects in a couple of samples when paying financial incentives, but even this is not robust.

Dhar, Jain and Jayachandran (2018)
A second approach comes from Dhar et al., who evaluate a school-based intervention intended to reduce support for restrictive gender norms in India. Their key outcome is adolescent’s gender attitudes (e.g. is it wrong for women to work outside the home?).  Their approach is to include at baseline a version of the Crowne-Marlowe social desirability scale - this asks the respondent whether s/he has several too-good-to-be-true personality traits such as never being jealous of others' good fortune and always admitting when he makes a mistake. The authors then look to see whether there are heterogeneous treatment effects according to this measure – i.e. whether people who have a high propensity to give socially desirable answers have different treatment effects. In their setting they do not find any such heterogeneity.

My thoughts
These papers suggest experimenter demand effects may not be that much of a concern in many settings. This fits with a question I am sometimes asked “aren’t the control group really upset to have missed out on the program, and reluctant to respond?” -  I think the answer is that many times are programs are much less important in people’s lives than we think they are, especially if we are asking after some time has passed. But it is easy to imagine occasions where stakes are higher and the chance of this bias is raised, as well as this being an easy thing for reviewers to complain about for many studies. So can these approaches help you?
  • I think the Mummolo and Peterson approach could be a useful way to check for the presence of experimenter demand bias in your setting. When collecting your follow-up survey, you could randomly prime half the sample by making the research hypothesis or treatment more salient. This is great if you then still find no bias- it helps strengthen your findings. But what happens if you find a bias? Then it is unclear how to proceed – it might well be that there was no bias if people weren’t primed to think about the research hypothesis, in which case you would end up having to throw away half the data from the primed group. But it could also mean that there was some bias in responses even without priming, which then gets bigger with the priming treatment. If we are prepared to make some assumptions that there is no “annoy the experimenter” effects, whereby some respondents deliberately report the opposite of what they think you want to hear, the directional treatments could then be used to provide some bounds.
  • I have more concerns about the persuasiveness of examining treatment heterogeneity by baseline Crowne-Marlowe. A key reason is that treatment heterogeneity could arise for two reasons: a) the treatment genuinely has differential effects for people with the characteristic of seeking socially desirable responses; and/or b) there is no genuine treatment heterogeneity by this characteristics, only heterogeneity in desire to please the interviewer. So finding that there is no significant heterogeneity might mean that there is no experimenter demand effect, or might mean that the genuine treatment heterogeneity and experimenter demand effects cancel one another out.
From a methodological standpoint, I would be interested in seeing these approaches used in a situation where (unknown to the respondent) you have an objective measure of the outcome, as well as a survey response. Then one could test directly whether there is systematic misreporting, and how this varies with the priming of the Mummolo and Peterson approach, or with the baseline Crowne-Marlowe. For example, if you had school admin data on attendance in a CCT program, and then also asked parents separately in a survey about sending kids to school this could be done. But for the moment, these two papers at least offer some comfort that experimenter demand effects may be not as prevalent in reality as is often feared. 


Submitted by Anna Tompsett on

Thanks for this interesting post and the reflections. It seems like this paper, by my colleague in Stockholm Jon de Quidt and coauthors Johannes Haushofer and Chris Roth, would also be super relevant to those interested in how to bound these effects?

Add new comment