I tried to come up with a catchier title for my blog post, but the authors have already titled their paper perfectly: Marcella Alsan and co-authors have a new working paper (apologies, gated, couldn’t find a version on authors’ websites), titled “Mean reversion in randomized controlled trials: Implications for program targeting and heterogeneous treatment effects.” It’s a problem many of us face but don’t have a lot of good ways to deal with…
Suppose that you have a mental health intervention for people who are currently at risk of anxiety and depression that you want to evaluate. The goal at the end of your treatment is to get them to a place where we’re not worried about them any more, i.e., no proposed treatment actions – not even watchful waiting. So, you have a binary cut-off for people to be screened into the trial (moderate or high risk of depression): how should you go about targeting people for treatment (and or into your trial – these two sets don’t have to be identical, as I will touch on later)?
It turns out that this type of setting is quite common in RCTs: Alsan et al. give examples (hypothetical and real) of programs that might be targeted to subjects with high (or low) lagged values of an outcome at a given point in time (study enrollment dates or tests or exams): students who performed poorly on an assessment, unemployed individuals, superutilizers of ERs that frequently visit hospitals, and the subject of the trial they are analysing, individuals with high blood sugar.
What all these cases might have in common is the possibility of the outcome for an individual being a transient one: if we’re catching an individual at a low point in a spectrum of outcome values, there is a chance that they got there recently and will go back up soon – possibly within the study window. Such mean reversion implies that individuals with an Ashenfelter’s dip (or its inverse) can end up in programs that they don’t actually need. That is bad for everyone – the subject, the implementer, and the researcher.
Very quickly: the trial that failed (on average) …
The authors evaluate a food-as-medicine program that is designed to treat food-insecure diabetic patients. The eligibility criterion is HbA1c ≥ 8, which signifies elevated blood sugar averaged over the past 2–3 months. The outcome is HbA1c at six months. The average lagged outcome was approximately 10.3 in both the treatment and the control group at enrollment. By all accounts, the take-up of the program and compliance was high. But the HbA1c levels were indistinguishable from each other at the six-month follow-up at 8.8. What happened?
Without further analysis, the finding is consistent with (a) the program has no effect on anyone, and (b) the program has an effect on a subgroup of subjects who are not “always improvers.” While this type of treatment might have been ineffective for some (never improvers), it might have worked on “responders” to the treatment. These three groups, along with “derailers,” i.e., those who would have gotten worse with treatment, form the four groupings of people defined their potential outcomes in the Imbens and Angrist (1994)/Angrist and Rubin (1996) framework to estimate local average treatment effects. Fortunately, if we assume away the existence of Derailers, we can investigate the share and mean characteristics of the Responders, as well as the latter for the Always Improvers. That will help with targeting more effectively…
What did the study population look like before and after the enrollment date?
The clinics had information on HbA1c levels on a set of patients – tested every six months going back a few years. Selecting a mock enrollment date and using various cut-offs for eligibility, here’s the time trend of HbA1c scores over a period of 36 months:
We can see that the mean outcome indicator was at its peak for each cut-off point selected between 8 and 11, with declines of 10-20% to pre-enrollment levels or lower 12–18 months from the mock eligibility date. The higher the cut-off, the steeper the mean reversion…
Seeing this figure made me think, “wow, that’s cool. In many settings, we should probably target people with consistently high levels of the lagged outcome indicator, rather than high at any given point in time without knowing their trend beforehand (there might well be other complementary eligibility criteria that might help)."
That thought was reinforced by this figure Alsan et al. produce next:
The program had no effect on those who were on an upward trajectory in their HbA1c levels prior to enrollment, it had a large effect (a 1.4-point reduction) on those with a more or less stable trajectory, and a non-negligible (0.5-point increase), but imprecise effect on those who were on a downward trajectory, a finding the authors do not discuss.
A couple of thoughts here: first, many of us don’t have multiple pre-baseline data points in the trials that we design. This is despite the fact that the David McKenzie made the case for “more T in experiments” more than 12 years ago. And, with a lot more administrative and other data sources today, we are more likely to have this type of trend data before baseline. That would allow us to target the program more efficiently – potentially by excluding a good portion of the always improvers.
Second, we have to think carefully about screening into the trial vs. screening into treatment. It may be undesirable to discourage or deny treatment to those above a certain cut-off, just because they have recently gotten to that point, rather than stably being there for a while. In such cases, the expectation that the subject might be experiencing an Ashenfelter’s dip (or its inverse) might be reason enough to exclude them from the trial, but the subject could be then counselled and given an option for treatment – based on their doctor’s professional judgement and patient preferences. If that sounds a lot like the mammography guidelines of the U.S. Preventive Services Task Force, that’s because it is. Amanda E. Kowalski has a wonderful paper (REStud 2022) on selection into mammography services: she argues that it is possible to examine the evidence of selection and treatment effect heterogeneity to develop better guidelines “to target treatments toward individuals most likely to benefit from them and away from individuals most likely to be harmed by them.”
Of course, we need the badly targeted clinical trials first to examine these data before we can inform the targeting. But, the basic point remains that we can target programs better and make our trials more powerful by improved targeting, which may require some lagged trend data on the outcome indicator…
Can we target better?
That leaves one question: can we predict the changes in the control group using baseline data? The authors do this by regressing the “… change in HbA1c from baseline to 6 months on measures observed at baseline: demographics, indicators for taking the four most common diabetes medications, indicators for the top 50 diagnoses, and the patient’s most recent biometrics for LDL cholesterol, triglycerides, weight and blood pressure using leave-out regressions.” Fortunately, there are some characteristics that predicted improvement, including baseline HbA1c (consistent with mean reversion), a diagnosis of heart disease or chronic renal failure in the prior year, and an indicator the subject was referred to the program by their primary care physician, a marker for active engagement via “usual care.”
This is the heterogeneity of treatment effects by the quartiles of the predicted improvement in HbA1c:
We see that the subjects in the bottom two quartiles of predicted improvement, i.e., those with the lowest amount of expected improvement on their own, are the ones most likely to benefit from the program. While this analysis is exploratory and not pre-specified, it does give some hope for improved targeting (and, as far as I can tell, they did not even use the HbA1c trend data).
The authors conclude by stating that, going forward, machine learning methods could help this prediction problem, aiding policymakers. We can remain agnostic about the relative power of these methods vs. the simpler prediction techniques that we currently employ. I’ll finish by giving a shoutout to a new working paper by my colleagues Daniel Mahler, Christoph Lakner, and others that “…develops a method to predict comparable income and consumption distributions for all countries in the world from a simple regression with a handful of country-level variables.” They find that “…a simple model relying on gross domestic product per capita, under-5 mortality rate, life expectancy, and rural population share gives almost the same accuracy as a complex machine learning model using 1,000 indicators jointly!” Being interested in the right question and going after it might be preferable to using the latest shiny technique…
Join the Conversation