Syndicate content

Add new comment

Incorporating participant welfare and ethics into RCTs

Berk Ozler's picture

One of the standard defenses of an RCT proposal to a skeptic is to invoke budget and implementation capacity constraints and argue that since not everyone will get the desired treatment (at least initially), the fairest way would be to randomly allocate treatment among the target population. While this is true, it is also possible to take into consideration the maximization of participants’ welfare and incorporate their preferences and expected responses to treatment into account while designing an RCT that still satisfies the aims of the researcher (identify unbiased treatment effects with sufficient precision). A recent paper by Yusuke Narita seems to make significant headway in this direction for development economists to take notice.

[Below, I am going to summarize what I understood the paper does, including some of the related literature, some of which goes back a long way – especially in clinical trials in medicine. I read this paper somewhat carefully, but it is clear that I have a lot more papers to read and really start to understand this literature better to apply it in my own work. My apologies to those much better versed in this literature than I am. In your case, hopefully, simply becoming aware of this new paper may be your reward in reading. For the rest of us, as researcher age, they face questions of tools upgrading: should I invest in learning beamer to replace my crappy PPTs? How about R instead of Stata? Somehow, sinking my teeth into this literature seems much more first order than those questions. The paper I discuss here contains a lot, is clearly written, and, therefore a great entry point for a researcher like me.]

In Section 7 of his paper, Dr. Narita discussed existing designs and how they compare to his proposed method. We are all familiar with the classical RCT design, where the aim is to design an experiment with maximal power to test the null hypothesis of no treatment effect. There are also designs that take into account subjects’ preferences: randomized preference trials randomize subjects into two groups, the first of which are all assigned to control while the second group is allowed to choose their treatment group. A more generalized version, dubbed selective trials, allows the treatment probability of subjects increase with their willingness to pay (WTP) for the treatment (a measure of their preference for it relative to control or other treatments). I discussed these briefly at the bottom of this former post. Response-based designs, on the other hand, use a patient’s predicted outcome under the treatment to vary the probability that they are assigned to treatment. None of this is new, although they are not commonly observed in social science experimentation just yet (Narita provides some recent examples from development economics that at least implicitly take preferences into account).

What is new in Dr. Narita’s new paper is that it provides a platform to integrate these two concerns with subjects’ well being (their preferences and their expected treatment effects) into the design of an RCT without compromising unbiased estimates of simple treatment effects with precision, while also maintaining (near-) incentive compatibility. In this sense, the proposed method in this paper shares “…much of its spirit with Multi-Armed Bandit (MAB) algorithms in computer science, machine learning, and statistics (Bubeck and Cesa-Bianchi, 2012), … which are popular in the web industry, especially for online ads, news, and recommendations. Both MAB and EXAM attempt to strike a balance between “exploration (information) and exploitation (subject or experimenter welfare).” However, from my reading, among other differences, the exploitation in Narita’s work is only concerned with subject welfare and also differs from MAB in its handling of incentive issues. Finally, if you shut down either the preference or response concerns (or both), you’re back to one of the simpler designs described above, including the classical RCT: so Narita’s proposed method nests within it the more familiar RCT designs.

So, what is the proposed algorithm? The following is a bit technical, and I definitely don’t, hopefully yet, understand all the details, but it’s worth your read, because what is done here seems feasible to conduct in a non-negligible set of experiments development economists design. The researcher needs to obtain two pieces of information to stand in for preferences and responses to treatment that are individual-specific. For the former, Narita chooses WTP, which may either be asked directly to the subjects or obtained through existing data on subjects’ choices and characteristics at baseline. For the latter, you need to observe outcomes with and without the treatment – prior to the current experiment – ideally from a previous experiment with different subjects, but otherwise from observational data. This second requirement may look silly – you need a more classical RCT to design a more ethical one – but, actually, there are many circumstances, including adaptive and sequential designs, where the researcher may be in such a scenario to be able to do this.

Given these two pieces of information, what Narita does is to allocate a hypothetical budget b to every subject in a computer and let each of them “buy” a probability distribution of treatments given the prices of each treatment. Treatments are price-discriminated, meaning that someone with a higher (better) expected outcome under treatment faces a lower price of treatment than someone else with a lower expected outcome. Treatment probabilities have to be strictly between 0 and 1 and the probabilities need to satisfy the implementer’s capacity constraints to be able to provide treatment to everyone eventually assigned to receive it. Hence, the method is a hybrid experimental design as market design (dubbed EXAM by Narita) problem maximizing study participants’ welfare subject to the constraint that the researcher produces as much information (and incentives to reveal WTP) as standard RCTs do. The end result will be Pareto optimal, meaning that no ex ante improvements can be made to someone’s expected utility without hurting someone else’s. Note that this is generally not true for the classical RCT.

The reader familiar with block-stratified RCTs will notice that EXAM is doing something very similar. Subjects with the same WTP and expected outcome will all have the same probability distribution of treatments (because they all have the same budget and solve the same maximization problem) and within each such stratum, they will be distributed treatment in exactly those shares – with those shares being tilted towards those with more preference and expected benefit from treatment. These provide the experimenter with conditional average treatment effects (CATE) – conditional on observable propensity scores. A weighted average of all the CATEs give you the (unconditional) ATE.

Dr. Narita shows that EXAM is asymptotically incentive compatible (i.e. large studies are better), unbiased estimates can easily be obtained, and that power comparison to the classical RCT is ambiguous but potentially more precise. Given the theoretical uncertainties, Narita then tries his algorithm using data from a previous RCT on protecting spring water sources from contamination and can recover similar effect sizes with a small decline in precision. More importantly, and to the point, however, there are substantive improvements under EXAM in mean average WTP among those assigned to treatment compared to the RCT, and expected benefits (in reduction of diarrhea incidence) are also meaningfully higher.

There are other potential benefits to employing EXAM when possible: recruitment of subjects may be easier under it than the classical RCT, which may not pay attention to either wellbeing criteria within its target population. Among the recruited, compliance with assigned treatment may go up (although I find this to be empirically ambiguous if the subjects know their treatment status as opposed to receiving a placebo). Finally, attrition from the study may also be lower.

I have many questions, but pose just a couple here. Dr. Narita or anyone else interested and/or knowledgeable is more than encouraged to chime in in the comments section or send a guest blog post:

  • Predicting outcomes for individuals is noisy business. EXAM uses the predicted values for outcomes using observable characteristics, but that has to come with large uncertainty. The paper suggests that this is not a huge deal and refers the reader to one of the appendices, but I did not get a sense from the paper whether such uncertainty about predicted outcomes (and preferences) is incorporated into the empirical example and what the magnitude of such errors would be based on the sample size from the earlier RCT/observational data set. The welfare optimality of EXAM is dependent on the experimenter eliciting preferences and predicting outcomes well…
  • The spring water protection program seems like a cluster-RCT, where water sources are assigned to treatment and people in their catchment areas are surveyed. But, unless I missed it, in the EXAM version, individuals are assigned to treatment based on their preferences and expected outcomes, not springs. Is this a problem?
I really enjoyed reading this paper and thinking through all the relevant puzzles for my own work it discussed in accessible detail. I leave you with this paragraph from the paper:

“Proposition 2 uses two welfare measures eti and wit, one outcome- or treatment effect-based and one WTP-based. Each has an established role in economic welfare analysis. The medical literature more frequently studies treatment effects but also acknowledges that patients often have heterogeneous preferences for treatments (even conditional on treatment effects). This is especially the case for psychologically sensitive treatments like abortion methods (Henshaw et al., 1993) and depression treatments (Chilvers et al., 2001). In response to these findings, a US-government-endorsed movement tries to bridge the gap between evidence-based medicine and patient-preference-centered medicine (Food and Drug Administration, 2016). According to advocates, “patient-centered care (...) promotes respect and patient autonomy; it is considered an end in itself, not merely a means to achieve other health outcomes” (Epstein and Peters, 2009). My welfare criterion echoes this trend and accommodates both outcome- and preference-based approaches.”

Comments and further contributions are most welcome…