My post earlier this week on dissipating effects of cash transfers on adults in beneficiary households has caused not only a fair amount of disturbance in the development community, but also a decent amount of confusion about the three-year impacts of GiveDirectly’s cash transfers, from a working paper by Haushofer and Shapiro (2018) – HS (18) from hereon. At least some, including GiveDirectly itself and some academics, seem to think that one can reasonably interpret the findings in HS (18) to imply that the short-term effects of GD, also by Haushofer and Shapiro (2016) – HS (16) from hereon – were sustained three years post treatment. Below, I try to clear up the confusion regarding the evidence and explain why I vigorously disagree with that interpretation.
In a cluster-randomized trial, which the GD experiment is, you have three main groups in two types of clusters:
T (treatment group): units randomly assigned to treatment in villages (clusters) assigned to treatment
S (spillover group or within-village controls): units randomly assigned to control in villages assigned to treatment
C (pure control group): units in villages assigned to control
The reason C is called the pure control group is because the assumption of no interference between units (more formally, the “stable unit treatment value assumption, or SUTVA) is assumed to be violated within clusters, but hold across them. So, with this assumption, we can get unbiased estimates of treatment and spillover effects:
Intention-to-treat effect: ITT = T-C
Spillovers on the non-treated: SNT = S-C
You can see more on the estimands that can be identified in studies designed this way here. When there is a legitimate worry that one or more mechanisms for interference between units may be at play within clusters, ITT and SNT estimated as defined above will still yield unbiased estimates of treatment and spillover effects (and a total cluster-level effect as the weighted average of the two). In contrast, a treatment effect that is calculated by comparing treated and control units within villages will overestimate the ITT if benefits to treated come at the expense of the non-treated (negative spillovers) or there are general equilibrium effects negatively affecting non-beneficiaries; and underestimate it if there are positive spillovers. So, one cannot generally define the intention-to treat effect to be within-village, i.e. ITT=T-S, unless she is willing to assume that SUTVA holds within villages. Individually randomized trials will do this and have to convince the audience that spillovers are unlikely, perhaps using some theory and data.
What did HS (16) do?
As I mentioned in my previous blog, HS (16) did define ITT=T-S, i.e. based their headline impact estimates on within-village comparisons. This was done, despite the fact that, villages, as well as households within villages, were randomized at baseline into treatment and control. So, why did they define ITT that way, knowing that it would be vulnerable to bias in the presence of interference? For reasons that are not clear, listing and baseline data collection were only conducted in treatment villages. The first time households in control villages were listed and surveyed was at the (approximately) nine-month follow-up. This, in turn, caused two issues that HS (16) unnecessarily had to deal with. First, they had to deal with endogenous selection into the pure control group, i.e. the comparability of the control group to the treatment or the spillover group. Second, perhaps much more importantly, this study design error seems to have caused them to define ITT effects to be within-village, i.e. ITT=T-S, rather than ITT=T-C – as would have been standard. This would have been a fatal blow to many studies, but HS (16) was fortunate to escape that fate.
This crucial identification problem was evident to HS (16) and their audience:
“For the results reported in Table II to provide an unbiased estimate of the treatment effect, within-village spillovers of treatment on non-recipient households must be small” (HS 16, Section IV.B).
This is true: unlike the current interpretation in the GiveDirectly blog to ignore the large negative spillover estimates, HS (16) realized that to have any chance of getting away with defining ITT in this non-standard way, they not only had to show that spillover effects were not statistically significant but they actually had to show, with sufficient precision, that they were close enough to zero so as to not matter for their within-village impact findings. They had to actually show that minimum detectable spillover effects (MDSE) were small and spillovers in or above such magnitudes were not present. They did that:
“More generally, however, we note that most of our spillover effect estimates are relatively precisely measured null effects. This finding alleviates the concern that we have low statistical power to detect spillover effects. The average standard error for the standardized variables is 0.08, which implies that the detectable effect size at a 5% significance level and 80% power was 0.22 std. dev. Thus, we can rule out small spillover effects with relatively high confidence.”
Great – the positive effects must be real if the authors are right. In fact, given this finding, there is no reason for the authors to do this analysis comparing only S to C: they could also compare T with C with virtually equal precision and define ITT correctly, perhaps with a tad loss in precision due to intra-cluster correlation in outcomes. And, lo and behold, they do, but it is hidden in Appendix Table 38. As one would expect from all the gymnastics that the authors had to successfully wade through above (plus lots of work on comparability of control group households, differential attrition, etc.), T-C would be similar in size and significance to T-S (given that S-C is small and precise). And, again, they are:
“In addition, as described above and shown in Online Appendix Table 38, the within-village treatment effects are similar to the across-village treatment effects, in terms of both magnitude and statistical significance.”
Note here that if the only object of interest was the intention-to-treat effect, the authors could have entirely skipped the spillover analysis: they could have defined ITT across villages as being equal to T-C, could have done exactly the same robustness tests on comparability of T and C, differential attrition, etc., and come to the same conclusions. If, as it should be, spillovers were of interest, they could be shown to be mostly non-existent as was done in the publication. For some reason that remains a mystery to me, this was not done, but that’s OK: they just went about the analysis the really hard way. They could have simply rejected T-C=0, instead they rejected T-S=0 and then show that S and C are really, really similar: fine. It really does not matter for the interpretation of the nine-month impacts, i.e. of HS (16), but it does set the stage for the interpretation of HS (18). Because now, you, the reader, have been primed to accept within-village comparisons as being valid, i.e., ITT=T-S for this study. That’s going to be problematic.
Wait, there were spillovers in the short-term?
On October, 31, 2015, after the release of the HS (16) working paper in 2013, but before the eventual journal publication of HS (16), Haushofer, Reisinger, and Shapiro released a working paper titled “Your Gain is My Pain.” In it, they find large negative spillovers on life satisfaction (a component of the psychological wellbeing index reported in HS 16) and smaller, but statistically significant negative spillovers on assets and consumption. The negative spillover effects on life satisfaction, at -0.33 SD and larger than the average benefit on beneficiaries, imply a net decrease in life satisfaction in treated villages. Furthermore, the treatment (ITT) effects are consistent with HS (16), but the spillover effects are not. For example, the spillover effect on the psychological wellbeing index in Table III of HS (16) is approximately +0.1, while Table 1 in HRS (15) implies an average spillover effect of about -0.175 (my calculations: -0.05 * (354/100)). There appear to be similar discrepancies on the spillovers implied for assets and consumption in the HRS (15) paper and HS (16). I am not sure what to make of this, as HRS (15) is an unpublished paper – there must a good explanation that I am missing. Regardless, however, these findings of negative spillovers foreshadow the three-year findings in HS (18), which I discuss next.
What does HS (18) find at the three-year follow-up?
As I discussed earlier this week, HS (18) find that if they define ITT=T-S, virtually all the effects they found at the 9-month follow-up are still there. However, if ITT is defined in the more standard manner of being across villages, i.e. ITT=T-C, then, there is only an effect on assets and nothing else. Here is the abstract (emphasis added):
“This paper describes the impacts of unconditional cash transfers distributed on economic and psychological outcomes three years after the beginning of the program. Using a randomized controlled trial, we find that transfer recipients have higher levels of asset holdings, consumption, food security and psychological wellbeing relative to non-recipients in the same village. The effects are similar in magnitude to those observed in a previous study nine months after the beginning of the program. Comparing recipient households to non-recipients in distant villages, we find that transfer recipients have 40% more assets (USD 422 PPP) than control households three years after the transfer, equivalent to 60% of the initial transfer (USD 709 PPP). In contrast, other outcomes do not show significant treatment effects in the across-village analysis, possibly owing to lower power and within-village spillovers. We do find some spillover effects. Households impacted by spillovers have lower consumption and food security than pure control households, perhaps due to the sale of productive assets. Estimates of spillover effects on other outcomes are inconclusive due to differential attrition between spillover and pure control households. We also find little evidence of differential treatment effects depending on the transfer design (whether transfers are made men or women, in monthly payments or a single lump-sum, or a large or small transfer). Thus, cash transfers result in sustained increases in assets. Long-term impacts on other dimensions, and potential spillover effects, remain to be substantiated by future work.”
As you can see, things have now changed: there are spillover effects, so the condition for ITT=T-S being unbiased no longer holds. This is not a condition that you establish once in an earlier follow-up and stick with: it has to hold at every follow-up. Otherwise, you need to use the unbiased estimator defined across villages, ITT=T-C.
To nitpick with the authors here, I don’t buy that the fact lower power is responsible for the finding of no significant treatment effects across villages. Sure, as in HS (16), the standard errors are somewhat larger for across-village estimates than the same within-village estimates. But, the big difference between the short- and the longer-term impacts is the gap between the respective point estimates in HS (18), while they were very stable (due to no/small spillovers) in HS (16). Compare Table 5 in HS (18) with Appendix Table 38 and you will see. The treatment effects disappeared, mainly because the differences between T and C are much smaller now, and even negative, than they were at the nine-month follow-up.
How did GiveDirectly interpret these findings in their blog?
“So, what’s the problem,” you might ask. “You told us all of this in short form in your last post. Why run through it in more detail?” Well, I, along with others, did get some push back on my interpretation that the nine-month impacts are no longer there. In particular, GiveDirectly got in touch to inform me that I had missed their blog post on HS (18), published on February 14, 2018. This is true: I had missed it and immediately updated my post to set the record straight. Then, I read their post. I could barely believe what I was reading. I am pasting a paragraph from it here, but the whole post is short: please read it in its entirety, so you don’t have to take my word.
“Overall the findings are encouraging. The treatment effects on all the main outcomes (assets, earnings, expenditure, food security, and psychological wellbeing) were sustained after 3 years. Gains on an education index that were not significant at 9 months also becomes significant at 3 years, driven by increased spending on school fees, uniforms, books and supplies. The size of these impacts at 3 years are broadly similar to those at 9 months – in fact, the impact on assets increases significantly, even though the value of assets owned by control households doubled over that time.”
Contrast this with the abstract of HS (18) above. Note, in particular, the lack of detail or nuance in the blog post. Whereas the HS (18) abstract mentions every time which estimate refers to what type of comparison, the above paragraph only gives us great news: all effects are sustained; new positive effects appeared; some effects are even larger now! Sigh…
Under normal circumstances, one could, perhaps, chalk this up to being an issue that has to do with language aimed at a non-academic audience, while summarizing the more formally and carefully-worded HS (18). And I completely would have. On Twitter, I responded to GiveDirectly by telling them that I disagreed with their take and contrasted screen shots of the HS (18) abstract with the paragraph above. They responded that they heard me on the disagreement, and I thought “happy ending.”
Unfortunately, for all sides involved, GiveDirectly, along with a few others, decided to continue the argument that the findings could reasonably be interpreted to imply that the short-term effects were sustained, by pointing out that the spillover effects were not robust to different ways of handling differential attrition and the choice of control sample. That changed the tone of our disagreement: this type of “clutching at straws” to claim sustained effects is qualitatively different from being simply a language issue, it’s actually incorrect. I could not stand by while they used the two main flaws of the study (not collecting baseline data in the control group and, now, differential attrition at the three-year follow-up) to argue that there were sustained effects! Talk about two wrongs making a right…
The problem is they cannot use a “spillovers are not robust to differential attrition” argument, and it is HS (16) themselves, who explained why perfectly:
“For the results reported in Table II to provide an unbiased estimate of the treatment effect, within-village spillovers of treatment on non-recipient households must be small.”
It's getting clearer now, right? The condition for the ITT=T-S to be unbiased is NOT for the spillover effects (i.e. S-C) to be statistically non-significant or NOT robust to this or that.specification They have to be very small and precisely non-significant. Of course, instead, what we have at the three-year follow-up are spillover effects that are quite large: just take a look at Table 7 in HS (18). You don’t even need to do that; HS (18) tell us that they find some spillover effects - right there in the abstract!
If we’re trying to say something about treatment effects, which is what the GiveDirectly blog seems to be trying to do, we already have the estimates we want – unbiased and with decent power: ITT=T-C. HS (18) already established a proper counterfactual in C, so just use that. Doesn’t matter if there are spillovers or not: there are no treatment effects to see here, other than the sole one on assets. Spillover estimation is just playing defense here - a smoke screen for the reader who doesn’t have the time to assess the veracity of the claims about sustained effects.
The Lee bounds on spillover effects in Table 8 have nothing to do with any of this, either: if we were primarily interested in establishing whether spillover effects exist in this setting or not, we could argue over how to interpret Table 8 and might reasonably conclude that the evidence is suggestive of negative spillovers – robust to differential attrition under one scenario regarding the control group sample (remember the original study design mistake?) and not robust in two alternative samples. Again, perhaps cause for some worry about effects on non-beneficiaries, but not relevant to the question of sustained treatment effects …
- If you adopt, as you should, the standard definition of ITT across clusters, i.e. ITT=T-C, you have to conclude that there are no effects of UCTs on beneficiary households on any of the measured outcomes but assets at the three-year follow-up.
- There may also be large, negative spillover effects on non-beneficiaries in treatment villages, but these are not robust to bounding the estimates for differential attrition and the choice of control sample used in the analysis.
- If you insist on adopting the within-village comparison as your ITT, go back to 1.
As a lay person, who for some reason reads WB economist blogs, I think I managed to understand this. And it seems to me that even the abstract, not just the blog, is misleading the reader by focusing initially on T-S, when in fact, as the abstract then goes on to show, S-C is negative, making T-S not something we should be considering first, but should be first considering T-C. The abstract's main points should be T-C is positive only for assets and S-C is negative.
But S-C being negative seems to me is hugely alarming. This is where I'd like more explanation, if you'll indulge non-technical readers, because I don't understand why you seem to be concerned mostly about 'sustained treatment effects' or T-C even though it seems S-C is negative. If S-C is negative, doesn't it suggest that T-C could be coming at the expense of S, meaning T-C can't be our primary concern?
Yes, you want to be concerned with the entire village, not just the treated, which is the weighted average of T-C and S-C.
This is an important discussion. But given discrepant readings of the same data, I wonder if it is appropriate that orgs such as GiveDirectly run evaluations of their own programmes (such as their new UBI experiment in, unsurprisingly, Kenya). Indeed, many researcher-led evaluations tend to conflate (combine) roles of intervention design/implementation and evaluation design/implementation, leading to potential conflicts of interest.
Ideally, the evaluators would be disinterested third parties. But if disinterested third parties aren't stepping up, charities evaluating themselves is a heck of a lot better than nothing.
And disinterested third parties are by definition not controlled by charities. So unless charities have some kind of guarantee that someone they trust will be studying their effectiveness, they should absolutely run evaluations of their own programmes.
Consider: if GD wasn't evaluating itself, this conversation wouldn't even be possible. And while I'm not sure who's right here, I'm confident that the discussion is worthwhile.
Always useful to come across this sort of writeup, to check over one's own predictions with! This post was shared to the ~15k member Effective Altruism group on Facebook, so one wonders if the arguments it presents will have any appreciable effect on Givewell's recommendations.
Great - thanks for that. Even the conversation that has followed the posts has been nice to see, so some impact - thanks to the readers of Development Impact...
Which charities do you recommend?