Syndicate content

Add new comment

More on Experimenter Demand Effects

David McKenzie's picture
After my post on experimenter demand effects on Monday, several readers kindly informed me of a paper on experimenter demand effects forthcoming in the AER (ungated version, 73-page appendix) by de Quidt, Haushofer and Roth. Like the Mummolo and Peterson paper I blogged about then, de Quidt et al. propose a way to construct bounds on experimenter demand effects by seeking to induce this demand. Since they offer some more evidence on the issue, I thought I’d also briefly summarize what they do, and offer some additional thoughts.

Online lab experiments with a large sample
They conduct seven online experiments with 19,000 participants (mostly on Mechanical Turk, along with one representative online panel), designed to elicit time, risk, and uncertainty preferences, as well as having them play a dictator game, lying game, ultimatum game and trust game. They give “weak” demand treatments in which a group of participants are told the hypothesis signed one way, and another group the other way (e.g. “We expect that participants who are shown these instructions will invest more/less in the project than they normally would”) and “strong” demand treatments in which participants are told e.g. “You will do us a favor if you give more/less to the other participant than you normally would”.

A few key results and points to note:
  • Subtle wording to avoid deception: The authors note that if you want to tell them the hypothesis, but get bounds, this could lead to deception if you tell one group you expect a treatment to increase the outcome, and the other group that you expect treatment to decrease. By saying we expect “participants who are shown these instructions to do X”, they get around this, since they do indeed expect high actions from those in the high demand treatment and low actions from those in the low demand treatment.
  • Experimenter demand not a big deal when people just know the hypothesis: Responses to the weak demand treatments are modest, averaging 0.13 s.d., and are not statistically different from zero in most tasks. Responses to the strong demand treatments are much larger, averaging 0.6 s.d. [although do recall our previous blog posts on why standard deviations can be problematic measures of whether an effect size is modest or not)
  • Women seem to be more likely to try to please the interviewer than men – they respond 0.15 s.d. more to the strong treatments.
  • They show how you can combine these demand treatments with a structural model to uncover unconfounded estimates, and to measure the value of pleasing the experimenter.
  • Few Defiers: Using a within-subjects design, they do not find much evidence of what I called an “annoy the experimenter” effect, suggesting a monotonicity assumption on the direction of demand effects is likely to be valid. To do this, they have people play a dictator game or investment game twice, where one time they get no demand treatment, and the other time a positive or negative demand treatment. Defiers are those who change their behavior in the opposite direction of the treatment – which happens only for 5% of people.
There is a lot more in the paper, including a model, lots of robustness checks, and discussion of the conditions under which giving both the treatment and control group the same-sign demand treatment can be enough to difference out experimenter demand effects and still recover the treatment effect.

So can we just ignore concerns about experimenter demand effects?
The authors conclude that “Across eleven canonical experimental tasks we find modest responses to demand manipulations that explicitly signal the researcher’s hypothesis...We argue that these treatments reasonably bound the magnitude of demand effects in typical experiments, so our findings give cause for optimism.”. Along with the results of the two papers I discussed Monday, this then gives further reason to think that perhaps we don’t need to worry too much about experimenter demand effects.

However, while some of the tasks done in the paper are for real stakes, the payouts are trivial – e.g. the dictator game involves splitting $1, the value of time question involves deciding between 10 cents in 7 days versus an amount today. The experiments are all done online, so the participants never meet those running the experiments, and there is not going to be repeated interactions in the future, so they are not worried about their responses today affecting their likelihood of benefiting in the future. Contrast these features with a field experiment measuring the impact of a schooling CCT or business grant program – where the treatment group may have received life-changing amounts of cash, have had multiple interactions with the officials running the program, and may also think that what they answer today might affect future program eligibility. It is unclear how much the results of these lab-experiment and survey-based papers generalize to thinking about field experiment settings (e.g. Sarah Baird and Berk find in their CCT that school attendance is over-reported (although surprisingly more among the control group), and I find (appendix 11) in my business plan evaluation that treated firms over-report employment to the government). It would therefore be useful still for future work to test further for experimenter demand effects in such field settings – as well as for researchers to otherwise take care in measurement to mitigate experimenter demand in these settings.