# What’s new in the analysis of heterogeneous treatment effects?

If you’re like me, you have been doing heterogeneity analysis a certain way – let’s call it ‘old school’ to be facetious. In the good version of things, you have some prespecified characteristics of the population (X, unaffected by treatment) over which you investigate the heterogeneity of treatment effects (HTE). You either run a regression interacting the treatment with X or if X is binary, you might show the separate regressions by X=0, 1. A meaningful and statistically significant coefficient on the interaction term would suggest heterogeneity along that dimension. Hopefully, your prespecified X is either a vector of characteristics that are suggested by theory or they are characteristics of the population that are easily observed and can be used to target treatment by the program designers. In the not-so-great version, you pick some variables and show heterogeneity over them, but for the better part of a decade or so now you risk getting raked over the coals by referees and editors that these choices are ad hoc and p-hacking adjacent. This is not simply out malice on the part of the reviewers: it is still not uncommon to see papers that find a small average treatment effect (ATE) for the whole population but find effects for a subgroup, which is then highlighted in the abstract as the main finding. Never mind that if you have a precise zero (or small effect) for the ATE, there must then be the rest of the population that has a negative effect. Somehow, those are rarely highlighted or glossed over as ‘weird.’

Anyway, that’s how we used to do this. Increasingly, researchers are now opting for machine learning (ML) inference to assess heterogeneity. In this version of the analysis, you stay agnostic as to the source of the heterogeneity (other than specifying the vector space in which to search, i.e., the observed baseline characteristics of your study population) and let the data tell you which groups are more (or less) likely to benefit from treatment. This usually involves estimating conditional average treatment effects (CATE) for each individual non-parametrically, by fitting causal forests (Athey, Tibshirani, and Wager 2019). In this paper (Athey et al. 2021), for example, we define an individual as relatively strongly affected by discounts if their estimated CATE on LARC adoption is above the median estimated CATE in the data, and relatively weakly affected otherwise, where we hold out folds of data in estimating the CATE so that an individual's own outcome does not influence its subgroup assignment. We calculate the ATE for each subgroup and see whether they are substantially (and significantly) different than each other. This is similar to the methods discussed in Chernozhukov et al. (2018) and Athey and Wager (2019). In this type of analysis, you might now know that there are some sub-groups worth targeting the treatment to, but you have not shown who they are. So, you also need to describe or compare the salient characteristics of these subgroups to complete the picture. For example, in the example above, we find that younger women, who are more likely to be single, students, and wishing to delay pregnancies by at least three years are more likely to respond to discounts for contraceptives than others. It is possible, however, that the patterns found in your data, especially high-dimensional ones, will not be very useful in easily describing a population. Even if you can, it may not be feasible for the program designers to use that information to target the treatment. This is related to some of the ML skepticism that David mentioned in his post last Friday, which links to this really nice discussion between Andrews, Angrist, and Imbens on how machine learning will impact economics. In our case, we’re lucky enough to be able to suggest to policymakers to target on age (people may lie a bit about it, which is fine) and on recent birth/post-partum status (people aren’t going to have babies just to get a discount on contraceptives). Normally, we might not be able to get such clean recommendations…

You might be asking, “what’s new here?” After all, these methods have been around for a few years now and many of you are using them in your studies. I was just alerted to a new package in R by one of my co-authors, which proposed some new metrics to assess heterogeneity. Given that several people have recently asked me how we estimated HTE in Athey at al. (2021) and they are likely to be using something like the *causal-forest* function in the R package *grf*, I thought it might be useful to point people to the new function called *rank_average_treatment_effect*, which estimates a rank-weighted ATE, or RATE, in grf. This is based on a paper by Yadlowsky et al. (2021).

I like RATE, because, as shown nicely in the tutorial/vignette here, one can visually see the pattern of HTE and also because it serves as a first-stage to discern whether there is anything much going on in the data in terms of heterogeneous effects (even though I don’t think Tibshirani et al., who developed this method, necessarily thought of it that way). Once you estimate your CATEs, you can feed them into the function and estimate what the authors call the Targeting Operator Characteristic (because you’re using these methods to prioritize/target treatment to certain subgroups), which is a curve comparing the benefit of treating only a certain fraction q of units to the overall average treatment effect.

If there are HTEs, the curve will start high for the individuals with the highest expected benefit and declines until it equal ATE when q=1, i.e., everyone is included. This is nice because the area under this curve gives an immediate sense of whether there are HTE in the data or not: if the estimated targeting rule does well in identifying HTE, we’d expect the area under the curve to be a large number. If it does not do so well or there are no HTE, the same figure could be close to zero.

The Rank-Weighted Average Treatment Effect (RATE) is a weighted sum of this curve. The authors suggest two alternative metrics, one of which has more statistical power to detect large treatment effects on a small subset of individuals while the other has more power when the HTEs are more diffuse across the population. I like the first one because it helps answer the sharp null question of whether there is anyone who benefits from the program (technically, this is only true is the ATE=0; if not, it is simply answering whether anyone has effects that are substantially different than the ATE, whatever it is). The function allows you to estimate the TOC with CIs and plot it, as well as estimate the RATE using both metrics. Then, describing the population that highly benefits from the program is up to you…

## Join the Conversation