I have been designing and running field experiments for almost two decades now, but things do change fast, and your knowledge can get stale in the blink of an eye. So, when I saw that Muriel Niederle had a new working paper, titled “Experiments: Why, How, and A Users Guide for Producers as well as Consumers,” I thought I should have a quick read through it to see if there are some lessons I (and our readers) can learn.
The first thing I found out is that the paper concerns itself with laboratory experiments. “No worries,” I said as I pivoted: “There must be some lessons from lab experiments for those of us who do field ones.” As I started reading, I had a rude awakening: I really don’t know much about lab experiments at all: having a few colleagues who call themselves “lab in the field” experts/practitioners, I thought I have a sense of what they do, but no, not really…
Well, it turns out that this paper is not for lab experiments for newbies: if you don’t know what a lab experiment is, how it is run, logistics, etc., you are not going to get that stuff here. This paper is for the people who know which papers in this field led to people winning Nobel prizes, as well as for graduate students and junior faculty who have already run a lab experiment (or involved with one or more in the past, at least tangentially), so that they can do them much better. From what I can gather from this paper, labs exist at various institutions (presumably a lot more in universities in the US) and they recruit and pay university students to play games. The games are designed to tackle the identification problem using the power of random assignment to confirm or debunk a hypothesis – with the payments providing some stakes for the subjects to “play” these games seriously – simulating, as much as possible, real life settings that the researcher has in mind: as the opening quote by Georgia O’Keefe is meant to foreshadow, lab experiments do this by creating a controlled environment, designing “controls” that account for alternative explanations, eliminating confusing details and background noise, so that the “treatment”(s) can confirm or deny a hypothesis. The paper is really about the “how to” of this designing phase.
While Section 1 goes through some comparison of lab experiments with field experiments and surveys, this is not really the focus of the paper for its target audience. There is some confusing discussion here about external validity and how, supposedly, economists are not worried as much about it for field RCTs: the definition that the paper provides for it in the context of lab experiments seems to apply equally to field ones and many people I know are equally worried about it: “external validity, strictly speaking, is fulfilled if the result is found in a different laboratory with perhaps slightly changed instructions and a different environment.” Yes, perhaps the US college students are not the Austrian farmers (they come up a lot) that we’re interested in, but just do your lab experiments with the latter (labs in their “literal” fields) and you’re there.
The bad and the good…
Before we get to the basics of designing lab experiments and lessons from these for the rest of us (Section 2), let’s mention my one takeaway each for the good and the bad about lab experiments. First, there are questions, for which lab experiments are simply not suitable at all. Certain policy actions cannot be simulated in a laboratory; certain effects will take a long time to materialize and, hence, cannot be observed in an hour or two in a laboratory; the target population is really different than the one that can be brought to one of these labs, and so on. This is a major shortcoming for someone like me, as it limits the number of questions that (I am interested in asking; or my audience/clients are asking me to answer) I can investigate using this tool. But, in their defense, the people designing these experiments are not interested in asking those questions, either. So, this is a “bug,” only because I am writing about it: it’s a “feature” for a lot of other people…
On the flip side, lab experiments are so much cheaper and quicker to carry out. This is not to minimize the amount of work that goes into designing them carefully and properly but, as we will see in the discussion of Section 2 (the basics of how to design lab experiments) below, sometimes, you can have an epiphany, change course, and run a whole new set of controls (or a treatment), or tweak your game, change your eligibility criterion, etc. Very few field experiments have this kind of luxury to tack midstream through an experiment and conduct something significantly different. I can imagine organizations doing internal A/B testing doing things somewhat more nimbly, but even they don’t come close to the convenience of thinking of another clever treatment that will nail the answer, convening a new group with new instructions, and being done in a matter of days/weeks. In contrast, if your field experiment blows up (due to factors under your control or not), you have lost a lot of money and time and it’s very hard to salvage it and it is definitely not quick…
[When we were starting the listing and baseline surveys for one of our field experiments, our local field teams (enumerators, supervisors, drivers, etc.) got chased away from several villages in our study district with panga knives. Fortunately, everyone was safe, and we had only spent less than 10% of our baseline survey budget at the time. I remember sitting on the side of the road with our field supervisor, with census data on our laptops, looking for another district with similar characteristics to the one we were having to abandon. We were lucky to find one and continue the study, but had that happened halfway through the baseline surveys, we would have had to fold and start over again.]
This is a big advantage of laboratory experiments. As we will see in the discussion below, it comes with the danger of what Niederle calls “g-hacking” (g for game, replacing the p, for p-values, in p-hacking): when it is that easy to run repeated experiments, it is tempting to do pilots, consciously or sub-consciously select what to run, write, and report eventually. However, this concern seems far outweighed by the low cost and the ease of changing course. That is, needless to say, not an invitation to be careless at the design stage: Niederle is clear on this point that experimentalists – both lab and field – should carefully design their trial beforehand, so that any unexpected findings that we had failed to plan for are the subject of the next important experiment.
Some takeaways from Section 2 (with field experiments for comparison)
Lab experiments have validity outside the lab, mostly for “average” individuals.
The paper makes a point about not expecting the results of a lab experiment to predict how an expert might behave under the same conditions. In field experiments, we’re worried about the flip side of this, which is that we might have some of those experts (and some of the very poor performers) in our experiment, who, best case scenario, increase the variance in outcomes and, worst case (when the trial is small) unduly influence/bias the outcome. In such situations, it is not uncommon to assign the top and bottom performers to a trial arm with 100% probability, while randomizing treatments within a band of the “average” subjects (such as firms in a business plan competition, McKenzie 2017).
Stay away from “stuff happens” experiments and “compare competing hypotheses.”
Sometimes we get curious about something – an observation, a hunch, etc. If it is easy to study our hunch/curiosity with an inexpensive and relatively quick experiment, we might just do it. As Niederle puts it: “Such one treatment experiments will yield results, since participants must do something. Which is why I call them stuff happens experiments.” [As it turns out, this was also the prejudice of my undergraduate advisor (a math professor) about a chemistry doctorate: he dissuaded me from pursuing a PhD in Math by saying that “chemists can just run an experiment and write the results for their thesis. Math students must come up with and prove a theorem that no one has thought of before.” He really got me…] This view, of course, is mostly inaccurate, as it is very hard to interpret the results of such experiments – not just in laboratory experiments but in field experiments as well…
Think through lots of treatments, lots of design parameters, but then edit ruthlessly and simplify
In experiments, there are always so many choices. First, there are different versions of the question we want to ask: what is it that we really want to know? Often, that question is not as sexy as the tweaks that you can brainstorm through: but the fun comes with costs of increased noise, reduced power, logistical complications, etc. By all means, the most fun part of an experiment (and, honestly, most of the important work) is the design phase: having intense debates with your teammates, entertaining all possibilities, sleeping on stuff and then coming back to it the next day (sometimes having switched sides in your arguments) is how breakthroughs happen and how good studies are designed while also enjoying one’s work. However, someone has to also be the fun police: you can’t run all the interesting treatments. In the end, science (and your budget and your other time commitments) requires that you make choices. Workshopping your design (or pre-pre-analysis plan) is a good idea at this stage.
These days, I often advise people to have fewer study arms, forgo pure control groups when possible, i.e., simplify and focus: Paraphrasing Niederle, the audience does not need to know the origin story of your study – the pathways through which you got to the final design might be meaningful to you and perhaps fun to share with your audience long after the papers are written, but they don’t need to be apparent through your study design…
Of course, you still have to make a million more decisions – about important design parameters. Often, in the field, we don’t want to span the entire gamut of the relevant parameter space. It seems that this is also the case for lab experiments, but perhaps less so. Then, the choices seem to get harder: what subset of the parameter space should my experiment focus on? What should the payment structure be? The exact wordings of instructions, and so on…Often, in the field, we let the relevance for policy be our guide (avoid self-selection into the study pool; mimic government/NGO actions; make transfers similar to what the government could afford; etc.) In the lab, considerations are likely to be different – follow the literature, depart from the literature, a new theoretical prediction, which brings us to…
Let theory be your guide but don’t let it become a hindrance…
Successful experiments have a theory, at least a conceptual framework, in mind. For example, in the early days of conditional cash transfers (CCTs), people evaluated these emerging programs against a control group that received no transfers. While useful, the evaluations left a theoretical ambiguity: CCTs cause an income effect and a price (substitution) effect, and such evaluations identify the sum of these two effects. What is the relative importance of each effect? This is a question that is pertinent to those who object to behavioral conditions on other, say ethical, grounds: is the condition really necessary? Perhaps the income effect does all the work, especially in poor settings. To answer that question, you need to include another arm in your trial that identifies the income effect alone, for example, by providing unconditional cash transfers (UCTs). Then, the substitution effect can be identified as the effect of the CCT minus the UCT.
So, while being guided by theory is clearly important (as can be seen in the increasing number of RCTs with structural models and experiments designed to identify key parameters in them), Niederle warns us of two things: first, the theory should apply in the first place. While this has a different implication for lab experiments, I see it as being careful applying theory to a particular place, especially if the researcher is not intimately familiar with that setting. For example, income effects on early marriage may diverge for adolescent girls and young women depending on whether they live in a bride price or dowry environment; or whether they are in a matrilineal or patrilineal society. It is not hard to encounter such settings close to each other, meaning that slight changes in study design (like moving to the neighboring district) may render your theory irrelevant.
There is a second shortcoming with taking theory too seriously, which is that theories make assumptions, some of which are unobserved, untestable, or at least will not be realistically tested in your study. If you make an assumption about subjects' (untested) behavior, on which the outcomes are interpreted according to your theory, you might end up with the wrong interpretation of the results, such as attributing behavioral changes to some predictions from theory when it was, in fact, due to other factors...
Don’t engage in deception
The definition of deception is restricted (and contested) in experimental economics, but there seems to be universal agreement that researchers should not engage in this (narrowly defined) practice: “Thy shall not provide explicitly false information to study subjects.” This is a clear example of deception, on which most agree that it should be avoided (Charness, Samek, and van de Ven, 2021). The reasons given for avoiding such practices generally revolve around ethical concerns or maintaining a clean experimental pool (if future subjects believe they will be deceived, they might select into experiments and/or behave differently than they would have otherwise).
But there are also gray areas, where information is omitted, or statements are misleading without explicitly being lies. A good example of the latter is where people are told that their status in a later stage of the game depends on their performance in an earlier stage, but it is really randomized (technically not a lie because the randomization is stratified by the scores, i.e., status does depend on the scores but not in the way that a reasonable subject would interpret/expect/understand). There seems to be more disagreement on such practices, although researchers – across disciplines – seem to be OK with the omission of benign study details, especially if such details are important for the clean interpretation of key results in the study, if they question at hand is important enough. In other words, concerns about beneficence…A number of papers cite John Hey (1998) “there is a world of difference between not telling subjects things and telling them the wrong things. The latter is deception, the former is not.” I think that this is too general a statement, but I tend to agree that some omission is often necessary to avoid influencing, or “priming,” the subjects.[1] Reasonable people can disagree with this selective ban on deception – see, for example Hersch (2015) on “Experimental economics' inconsistent ban on deception.”
Finally, watch out for G-hacking…
What Niederle means by g-hacking is, in her own words, as follows: “With g-hacking, for game-hacking, I want to denote the practice of selecting a specific game (or specific parameters) or a specific environment, perhaps using extensive piloting, while at the same time presenting results as if the game was randomly drawn from a larger set of possible parameters and games. The example they give is a researcher who tested many versions of a game before settling on the one they present in the paper. G-hacking is reminiscent of p-hacking because the former hacks through which data to collect in what environment, while the latter through which outcome measures to report using what type of analysis.
Obviously, there is nothing inherently wrong with piloting, making sure that you have potentially efficacious treatment, or with carefully selecting the environment in which you want your study to take place, or collecting multiple outcome measures. If you want to show effects on an outcome, you want to make sure that there is a deficit of that outcome in your study area (no point trying to increase 98% primary school attendance rates). The trick is in honesty and transparency in reporting. At least in field experiments, people have no expectation that you picked your design parameters randomly (location, target population, as well as key design aspects, such as transfer amounts, frequency of transfers, the identity of the recipient, monitoring and enforcement of any conditions or eligibility criteria, etc. in a social safety net program). Whereas I get the feeling that there are parameter spaces to be covered for certain lab experiments and the choice of the subset chosen for the study/paper is not clear.
This also reinforces for me the need to carefully communicate these details to policymakers after field experiments – especially when we are trying to convince them to try an approach that worked before and/or elsewhere: if subjects’ responses to the intervention are not robust to the design parameters chosen in the previous studies, they could get no effects or even negative ones when expecting large positive impacts. It’s important to say, “the implementation fidelity was such and such; the target population differed slightly from yours; the amounts were larger and given less frequently,” etc.
In the end, this is really about transparency in reporting and related to pre-analysis plans, about which Niederle has also written.
The paper goes on to mention the rising prevalence of online experiments (instead of physical labs at universities) and some of the difficulties (as well as pros) associated with that: g-hacking becomes easier and it may not even pollute the future participant pool that much. Section 3 then gets into more technical details about designing experiments, such as dealing with background noise in experiments. I ran out of steam at that point (and this post is over 3,000 words), so I will leave the interested readers to dig into that part on their own…
In the end, as a development economist experienced in designing and running field experiments, I am not sure that I learned enough from this paper to justify the time that I spent reading it. That does not mean that others, perhaps those getting started with field experiments, would not. What I did take away most importantly, however, is that the option to use lab experiments as a complement to field experiments or simply data collection, does exist and it does have its own set of rules and worries: I hardly considered them in this way before, although I now realize that the paper that David Mckenzie and I wrote on the impacts of economics blogs more than a decade ago – when this blog was just getting started – was a kind of online experiment (see Freakonomics coverage of it here). Otherwise, I have mostly been a consumer of the outputs of lab experiments – we probably all know some facts that can trace their origins to some famous lab experiment in the past.
[1] I was pleased to read this quote from Hertwig and Ortman (2008), cited in Charness et al. (2021) above: “… a consensus has emerged across disciplinary borders that intentional provision of misinformation is deception and that withholding information about research hypotheses, the range of experimental manipulations, or the like ought not to count as deception.”
Join the Conversation