Syndicate content

Mind Your Cowpeas and Cues: Inference and External Validity in RCTs

Berk Ozler's picture

There is a minor buzz this week in Twitter and the development economics blogosphere about a paper (posted on the CSAE 2012 Conference website) that discusses a double blind experiment of providing different seeds of cowpeas to farmers in Tanzania. The paper has drawn “wows” and “yikes” and “!”, and I am hoping they are for the double blind development economics experiment and not for the findings.

The authors conducted two experiments in Morogoro, Tanzania: the first one is an open RCT where some farmers were explicitly given modern cowpea seeds while others were given traditional seeds, let’s call them MS and TS. In the second one, as far as one can make out from the really (really) cryptic description of this experiment, half of the farmers were given MS and the other half TS, but without being told anything about the seeds, which looked identical (double blind – as the experimenter did not know which seeds were which variety, either). The paper provides no information about what the farmers were told, whether any mention of modern seeds were made, whether the probability of having received MS or TS was given, or what their prior beliefs were.

Anyway, whatever the farmers were told, the authors find that the farmers who received MS in the open RCT harvested significantly more cowpeas (9.9 KGs) than those who received TS (7.2 KGs). However, in the double blind experiment, both groups harvested as much as the farmers who knew they got MS (9.9 and 9.4 KGs, respectively). The authors call the difference between the two groups who received TS, i.e. 9.4 minus 7.2, a pseudo-placebo effect, for reasons that are not immediately clear. The fact that benefits from a new technology may vary by effort and/or there may be interaction effects between effort and treatment is hardly news and it is a stretch to call these placebo effects – especially for economists. Effort and other inputs may be complementary or substitutes to any treatment and being treated can alter the levels of these. In medicine, this is the difference between “efficacy” and “effectiveness,” the difference between something working in idealized conditions (like in a lab or a highly controlled clinical trial) and it working in the real world, when people react to treatments and incentives.

The authors also use the term “bias” referring to the impact estimates from the open RCT, which confuses internal and external validity: because I would never have a policy that gives a farmer some seeds without telling him what they are, the estimates from the open RCT pilot are what any policymaker would care about in her setting. However, if she then wanted to apply it in a different setting, where beliefs about efficacy may be different, complementary inputs are unavailable, etc. then she'd like like to know the mechanisms and would seek to disentangle the pure treatment effect from that of effort/inputs and interaction of these with treatment. However, this experiment does not do this: for that we can design different mechanism experiments, use structural models, or both. More on this later in the post, but let’s get back to the findings of the paper for now and see whether (a) they are credible; and (b) whether they tell a coherent story.

There are a number of things that seem to have gone wrong with the experiment, attrition, and the empirical analysis. First, it turns out that the modern seeds are treated with a purple powder in the market in Morogoro (to prevent cheating and protect the seed from insect damage during storage), so the experimenters sprayed the traditional seeds with the same purple powder. As you can immediately tell, this is less than ideal. First, as this is a not a new product, farmers in the blind RCT are likely to infer that the seeds they were given are modern seeds. Given that beliefs are a major part of the story the authors seem to want to tell, this is not a minor detail. Second, if the purple powder really does protect the seeds from insect damage, the difference between the MS and TS is now reduced.

Second, there is huge attrition in the study. The authors state that the attrition is more than 40%, but “rather equally spread across our 4 groups.” However, a close examination of the numbers presented in Table 1 suggests something different. 600 households were randomly selected for the study and allocated into four treatment groups. Given no more information, I assume equally, i.e. 150 per group. In Table 1, the numbers of households in the four treatment groups are reported as 70 and 73 for MS and TS (open RCT); while 63 and 56, for MS and TS (double blind RCT), respectively. Furthermore, again on the assumption that the 600 households were equally distributed to the four treatment arms, the attrition rate in the double blind experiment, at 60%, is substantially higher than that in the open RCT (52%), which makes comparisons across the two experiments problematic. (There are also balance problems in Table 1, such as the average land ownership in one treatment arm being more than 20% higher than the other in the open RCT, but I’ll set those aside.)

If you think that attrition here (which is defined as households who did not harvest cowpeas from the seeds provided by the experimenter) is an academic’s quibble, you may want to think twice. The authors cite failed harvests and failure to plant the distributed seeds as the reasons (without giving details of which ones account for how much of this very large attrition) for the attrition. This is obviously an outcome that would be affected by treatment statuses and particularly by the information content that differs across the two experiments: if I am not sure what seeds I received, then I might be much less likely to plant these seeds to begin with or put less effort into protecting them from damage (by insects or water), which would account for the higher number of households who did not harvest cowpeas from the seeds provided by the double blind experiment. But, the whole point of this paper was supposed to be disentangling the effort from pure seed effects: it cannot do that if it excludes the households who did not harvest cowpeas.

There is a way to solve this problem: the outcome variable should equal “zero” for anyone who did not harvest cowpeas: that is literally equal to how many KGs they harvested from the seeds provided by the experimenters and it is also the correct intention to treat estimator. Given that I have the attrition rates for each group, I can actually re-calculate the impact sizes in Table 2 by multiplying them with the compliance rate in each group (which is equivalent to assigning zeros to all the people who did not harvest cowpeas). That calculation produces 4.6 KGs for MS vs. 3.52 KGs for TS in the open RCT, while the same figures are 4.2 and 3.51, respectively in the double blind RCT: the so called pseudo-placebo effect for those who received traditional seeds is no longer there (3.52 vs. 3.51)! Furthermore, the effects for the households that received modern seeds are now further apart (4.6 vs. 4.2) than they are in Table 2, as one would expect if effort is complementary to improved seeds and people who knew that they received MS put in more effort than those who recived the same but did not know it. The story makes a little more sense now than it does in Tables 2 and 3.

Of course, in this case, we really should not have to be guessing about whether effort (or other inputs) would be a complement or a substitute to modern seeds. Perhaps only the size of the effort effect should be in question. There are other ways to establish efficacy. Why not have test plots that are farmed by highly trained people under extremely controlled circumstances? Of course, that may have no external validity, but the point is to prove efficacy first. We are told nothing about these modern seed varieties: hasn’t someone already shown they are efficacious that likely led to the term “modern” seeds? Is there a biological pathway? In biomedical sciences, researchers show that a new intervention works under highly controlled conditions for a particular sub-group of the target population, while yet others figure out whether any form of scale-up would be effective, i.e. work among the entire target population. Yes, it may sometimes be hard to establish efficacy if the effect of the treatment depends on unobservable actions (or characteristics) even under highly controlled environments, but this specific case does not seem to fit that mold. We know that fertilizers and condoms are efficacious but people need to use them correctly and at the right time for them to be effective – I don’t see how this is different for cowpeas…

There are yet other questions raised by this paper, which are too long to delve into here in this already long post, so I’ll quickly refer to one instead. Why this particular experiment with cowpeas? What is the question the experiment is designed to answer? Why this setting/group of farmers? As far as I can tell, these are all farmers with plots that they would have used for farming other crops and/or cowpeas, rented out, or left fallow. If I now give them a free bag of cowpeas, what do I expect? They will choose to plant it or not based on the opportunity cost of land and the relative costs of complementary inputs needed. So, examining the quantity of cowpeas harvested seems like a very narrow outcome. I would want to know whether the total value of all crops grown has increased or not. Depending on the goal, the study could have benefited from restricting itself to cowpea farmers, perhaps blocking on plot availability/size and other salient baseline characteristics. As the paper is written, it is very hard to understand why the experiment was designed the way it was.

For an excellent treatment of how to design experiments when we’re interested in the external validity of an efficacious intervention, i.e. what we can expect from scaling up a new treatment in a new setting among a new population with different beliefs, skills, constraints, etc., I highly recommend this paper by Chassang, i Miquel, and Snowberg, titled “Selective Trials: A Principal-Agent Approach to RCTs,” forthcoming in the American Economic Review (I had discussed it here about a year ago, after Sylvain and Erik presented it at the World Bank). By using selective trials (where the experimenter elicits the subjects’ willingness to pay for the treatment and assigns them to treatment and control accordingly), which can be open, blind, or incentivized (on final outcomes), they propose ways to disentangle the pure treatment effect of an intervention from the effect of effort and interaction effects. However, to do this, for example in blind selective trials, the probability of receiving the treatment has to be varied and, crucially, made known to the subject. When those probabilities approach zero or one, we can start disentangling effort from treatment. There are some practical (and ethical) hurdles to implementing these methods in the field (such as sample size and expenses), but recent work suggests that complex experiments can be successfully conducted in the field and combined with lab experiments as well as structural modeling.

Another relevant paper in this field is by Ludwig, Kling, and Mullainathan in the Journal of Economic Perspectives, which discusses how to design experiments to make them as useful as possible for policy purposes. This is the literature into which I’d prefer Bulte et al. would position their paper. By framing it as part of the debate between the randomistas and anti-randomistas – half of the references are to papers in that debate – the authors are not doing themselves or anyone else any favors.