As this is a long post and only those who are really interested in this topic will get through the whole thing, I moved my summary and my main comment to the top – like one would do at an academic seminar with the first slide: I think that many of the criticisms or worries Martin raises in his post are the exception rather than the norm in development economics. Many of us spend weeks if not months worrying about these issues and adjust and readjust our study designs accordingly (while the people and organizations who asked for our help wait patiently while these discussions go seemingly ad infinitum). But, if more checks and balances would help, no one could argue against them. If that is to happen, let’s hold everyone (including governments and the World Bank) to the same standards, not just researchers conducting RCTs.
Suppose that I have a colleague in Malawi whose mother lived in a village near the millennium village in Malawi . When she asked her son why her village was not chosen to be a millennium village, what is the answer? I would have been comfortable saying something along the lines of: “All villages fitting certain poverty-related criteria were identified and a lottery was conducted to determine which ones can access the limited resources.” Many of the concerns raised by Martin about RCTs apply to project-level decisions: who holds governments, NGOs, and large donor organizations to account ethically on these types of decisions? One could argue that the presence of an RCT could make an intervention better (and more ethical) than it was without it…
Now that I have this mini rant out of the way, I want to address four issues that Martin raises:
- RCTs are the only way to learn anything in economics (what?);
- Ethically contestable RCTs;
- Using local knowledge to inform treatment assignment rather than using random assignment; and
- RCT design in light of important ethical considerations
- RCTs are the only way to learn anything in economics
“Far more problematic is either of the following:
- Any presumption that an RCT is the only way we can reliably learn. That is plainly not the case, as anyone familiar with the full range of (quantitative and qualitative) tools available for evaluation will know.
The reason I ask is because my impression of the modus operandi, when someone approaches me or one of my colleagues with an idea for evaluating a program, is that we do consider random assignment, but we also consider other causal inference methods: sometimes interventions cannot be randomly assigned for a variety of reasons, while other times it is undesirable to do so. In such cases, I start going down the list: could an ‘as good as randomized’ regression discontinuity design be employed? If not, what are the prospects for a good diff-in-diff exercise? How about matching? In some cases of a one-off policy change in one state/region, I might think of employing synthetic control methods. I am not going to apologize for starting by exploring the possibility (and sensibility) of random assignment, but if that option is ruled out, I don’t stop: if the question at hand is sufficiently interesting or important, you just keep exploring the identification methods and explain to your policy collaborator the conditions/assumptions under which something useful can be learned about her question given the chosen method.
Martin also singles out RCTs for changing how a program is implemented. I don’t think that this is exactly right. For example, if you were to think of evaluating a proxy-means tested anti-poverty program at the eligibility cutoff, you might consider adjusting your score to ‘behave,’ i.e. not be lumpy but more smooth or continuous. That could affect how many baseline characteristics you use to create a score, which might alter the beneficiary list (in an ethically defensible way). If you were thinking of using a diff-in-diff method, you’d want to make sure that there are (preferably multiple rounds of) pre-intervention data on the target population, so that the parallel trends assumption holds. But, baseline data collection might delay your intervention. I’d say that whenever policymakers (or organizations piloting new interventions) are planning to evaluate, how they implement their new program will inevitably change – even if only because they invite people with differing viewpoints on the research questions, methods, etc. to the table. That type of ex-ante questioning (akin to what Cartwright and Hardie  call a pre-mortem) is bound to change intervention details. That’s a good thing…
2. Ethically contestable RCTs
It would have been useful if Martin had given more specific examples of ethically objectionable cases vs. ethically OK ones. As he states in his piece himself (and given that he also uses RCTs to answer certain questions ), RCTs are not inherently unethical.
For example, it seems that pilot programs satisfying the principle of equipoise (with equipoise being redefined for social sciences carefully – as discussed in Martin’s piece  referring to David’s earlier piece ) certainly fit the bill for the OK category. See Mexico’s evaluation of PROGRESA back in the late 1990s or their Ministry of Education’s current experimentation with student and teacher incentives for learning .
Another example might be RCTs for interventions that would have never happened without the specific research question. Suppose that you, as the researcher, raised funds for cash transfers as part of a research proposal. Martin argues that if you know who needs cash the most, it might be unethical to transfer funds by random assignment for your study’s sake. But, what if the cash transfers would not have taken place in the first place if it were not for this research proposal? Is it still unethical to distribute this windfall by random assignment? If it is, then you should not have been granted those funds in the first place…It’d be hard to blame the researcher for thinking that her research grant will make some people significantly richer while leaving others no worse off. She might even generate some useful knowledge in the end…
[A side note here: I want to repeat something about the notion of equipoise in economics – even though others have made similar arguments recently, the most excellent of which was David’s post on the topic . As social scientists dealing with constrained optimization, we cannot be satisfied with the fact that something has been shown to work, in the sense that the null hypothesis of no effect has been rejected. As my colleague Winston Lin  pointed out in a private exchange, RCTs aim to identify an average treatment effect with confidence intervals. It’s that effect size; combined with intervention costs and compared with alternative interventions that concerns the policymaker. So, yes, we have all rolled our eyes at plain vanilla RCTs that seem to state the obvious: giving money to poor people increased total household consumption, or giving eyeglasses to students with poor eyesight improved learning. Stated that way, they do sound silly: but we should give the researchers an opportunity to make the case that there is something more valuable to be learned from their proposed experiment. Martin states that such scrutiny of their studies ‘…is clearly not the norm at present.’ I am going to disagree: I estimate (without hard evidence to back me up) that such projects form a small share all development RCTs and are the exception rather than the norm. Martin, along with many of us and many critics of randomistas, is a reviewer of funding proposals at various stages. If useless or unethical research proposals receive public funding, we are all to blame. And, I agree with Martin that we should strive to do better…]
3. Using local knowledge to inform treatment assignment rather than using random assignment
Martin also states that the researcher can use ‘knowledge on the ground’ about who needs the intervention the most. He is implicitly referring to this paper by Alatas et al. (2012) , where local targeting is shown to be as effective as top-down targeting using proxy means tests. Point taken (and David  will elaborate more on this study tomorrow), but two counterpoints: first, many econ RCTs, like the Alatas et al. paper itself, are cluster RCTs, meaning that the ranking would have to be made across communities rather than individuals. That is harder to do with local knowledge and requires data at the community level, like poverty maps. The cost of targeting is most certainly part of the cost-effectiveness calculations to follow, and Martin knows better than anyone in the world that, at some point, targeting can become more expensive than not targeting. Second, the typical operation of many programs, left to the devices of an NGO or the government (due to capacity or budgetary constraints), would not only miss the neediest communities but the neediest individuals within program communities when program enrollment is done via village meetings, for example. Such meetings, to which the target population is invited, can end up producing beneficiary lists that are less poor, more informed, and more connected individuals than the average person in the target population (please see this paper  for an example). RCTs will generally list the target population going door to door, avoiding this type of selection bias.
4. RCT design in light of important ethical considerations
Martin concludes by stating:
“There may well be design changes to many RCTs that could assure their ethical validity, such as judged by review boards. One might randomly withhold the option of treatment for some period of time, after which it would become available, but this would need to be known by all in advance, and one might reasonably argue that some form of compensation would be justified by the delay...
The experiment might not then be as clean as in the classic RCT—the prized internal validity of the RCT in large samples may be compromised. But if that is always judged to be too high a price then the evaluator is probably not taking ethical validity seriously.”
These are important points, some of which speak to issues at the core of why we do random assignment in the first place, so some discussion is in order.
First, delayed treatment designs, such as described by Martin above complete with compensation, exist: for example, I am involved in one such intervention where the delayed treatment group gets additional resources as compensation for drawing the short stick. However, I would agree that they are not the norm. Governments and NGOs should consider this type of design more often, even if it means that the overall endeavor may end up being costlier. This is less of an issue for true pilot programs: if the pilot is successful, the control group will receive the intervention anyway. But, what is an NGO, which might have raised limited funds from an interested donor, to do? Without the RCT, they were going to implement the intervention in, say, 100 villages – period. Can they afford to treat another 100 villages after endline? If they cannot, is this unethical? Does the fact of conducting an RCT now require the NGO to provide the intervention to another 100 communities at the end of the trial? If the answer is yes, donors and implementation agencies will have to plan budgeting accordingly ahead of time. Or, they can choose another method of evaluation, but it is not clear that other methods absolve us of similar issues. Suppose that you’re using RDD to evaluate program impacts: should you not provide treatment to the ‘control’ group in your study (people/communities who were deemed barely ineligible but who were in the study sample and provided lots of data for the evaluation) after the intervention is over? At the threshold, they are equally-deserving, just unlucky, hence Lee and Lemieux ’s term ‘as good as randomized’…
Second, the tradeoff between a clean RCT and providing delayed treatment with full information are all too real and can, at the extreme, really override the choice to conduct an RCT to begin with. Many of us prefer RCTs to get a clean answer to a particular question of policy interest. Here are a few examples of conflicts that arise between study design and provision of delayed treatment with full information:
- There are limits on how much you would like to (or can reasonably) delay treatment to the control group, which may be shorter than the time required to see full impacts. Often, we are not interested in the immediate impacts but sustained longer-term ones. If you delay treatment for 12 months for an intervention where the ideal endline is 48 months after baseline, you will not get an answer to your original question experimentally. You can answer the question of 12-month effects, as well as 36- vs. 48-month effect, but they are different questions that your original one. Thinking about this stuff ahead of time also allows you to put quasi- or non-experimental pre-analysis plans in place, but that’s a topic for another day…
- The fact of waiting for the intervention can change behavior. For example, in a transfer program, the control group might (if they can) borrow against the impending transfers to smooth consumption (or investment).
- How much information should be provided to the control group can be a grey area. Suppose that you have a cluster-randomized control trial, in which towns are the unit of intervention, each of which contains the eligible target group. At study enrollment stage (i.e. during informed consent), each individual is given information about the study and told of the possibility of immediate or delayed treatment via random assignment (for which they would provide separate consent if assigned to a treatment group). Does the researcher have an obligation to provide a detailed account of every aspect of the planned intervention or can the consent documents provide a generic description of the study and intervention goals without providing more details? In practice, such information can make a difference in findings, and there is a definite tension in providing the information required for informed consent and the interpretation of the impact estimates. [Jed  has written about the admissibility of misleading study subjects here ]
- Delayed treatments with full information can still be unethical. Sometimes, any delay may be too much of a burden to the beneficiaries. That could be true for beneficiaries who are elderly, otherwise in poor health, or in need immediately. In such cases, as Martin suggests, random assignment into even a short-delay-treatment group has to be abandoned…
Stay tuned for David  and Markus ’s comments tomorrow and Thursday, respectively. Jed  has had multiple blog posts on the subject of ethics (see, in addition to the link above, here  and here ) that he may be taking a pass on this one…