I did not want to write this blog post…
But after multiple colleagues, including one on a Zoom call about the possibility combining cash transfers with an early-childhood-development intervention, asked me if I saw the finding about cash transfers and electroencephalography (EEG), I realized that despite all the coverage (NYT, Vox, and David’s link from last Friday here) of the recent paper in PNAS by Troller-Renfee at al. (2022), there is still a need to disseminate to colleagues in development economics what the strength of the evidence in this paper is.
The intervention offered $333/month to low-income mothers of newborns for a duration of 52 months (I believe it is still ongoing). The monthly unconditional transfer amounted to roughly a fifth of the recipients’ household income and the duration of the transfers is for more than four years, so this is a meaningful transfer amount even in the U.S.; and the relative size of the transfer in relation to household income or expenditures puts it in the range of CCT or UCT programs around the world, leaning towards the generous end of the spectrum. It also happens to be similar in absolute size to the annual (per) child tax credit that families received for a blip here in the U.S., so the evidence from such a program is relevant. That program is widely believed to have contributed to the huge declines in child poverty in the U.S. last year. The control group in this study received (receives) $20/month.
There were 1,000 children in the study and the authors planned to measure brain activity when the children reached 12 months, using mobile EEGs at the children’s homes. However, as it happened to many of us, they were only part way through this measurement when the pandemic restrictions hit in March 2020. So, they end up with a sample of 435 children with “sufficient artifact-free data” from EEGs. This will be important below, so mentioning it here.
I will let the authors tell us what they find (from the abstract):
“We hypothesized that infants in the high-cash gift group would have greater EEG power in the mid- to high-frequency bands and reduced power in a low-frequency band compared with infants in the low-cash gift group. Indeed, infants in the high-cash gift group showed more power in high-frequency bands. Effect sizes were similar in magnitude to many scalable education interventions, although the significance of estimates varied with the analytic specification. In sum, using a rigorous randomized design, we provide evidence that giving monthly unconditional cash transfers to mothers experiencing poverty in the first year of their children’s lives may change infant brain activity. Such changes reflect neuroplasticity and environmental adaptation and display a pattern that has been associated with the development of subsequent cognitive skills.”
The question we want to tackle here is whether the evidence in the paper weighs in favor of the hypotheses stated above. I don’t want to pretend that I have a lot to add here: I read every line of Andrew Gelman’s reading, reanalysis (using the data and code nicely shared by the authors alongside the publication), and the comments in response – including one from one of the senior authors of the paper. He has done a superb, very thoughtful, and kind job of looking at all data and his conclusion is that he is skeptical: to be precise, he thinks this might have been better written as a “null results” paper, explaining that such papers are not papers that show that the null hypothesis is true, but rather than there may be some signal towards the researchers’ hypotheses, but the evidence is too noisy to be able to conclude one way or the other. I fully agree with him.
So, what is the point of this blog post then, coming along a week late: what is my marginal contribution? I want to come at this a bit more from an experimentalist development economist’s perspective – how a referee reading this paper (without the benefit of the data at their fingertips) might react. I also want to reach World Bank and other development colleagues, who might not have had the persistence to stick through other analyses of this paper: most will still only have seen the NYT article, with a much smaller subset the Vox one – neither of which were as scrutinizing of the evidence…
The biggest warning sign about the findings to any reader should come from Table 2 of the paper, comparing columns 3 and 4: the former, which shows OLS regressions using only site fixed effects (FE, the authors recruited mothers in four sites), shows a coefficient estimate for (absolute) alpha to be 0.294 (with a standard error, SE, of 0.381). So, the t-stat is less than one. In the very next column, when the authors adjust for a lot of baseline characteristics, the corresponding estimate (SE) are 0.720 (0.396), implying a p-value of 0.07 and a Westfall-Young adjustment for multiple hypothesis testing (MHT) q-value of 0.12. You’re probably focused on these final columns, but what worries me is the change that emanates from the covariate adjustments. You see, in an RCT, the reason we use covariate adjustments (preferably pre-specified) is to soak up variation in the outcome: because there is balance between the intervention and control groups due to random assignment, what we should obtain is a gain in precision, with little to no change in the point estimates. But what we have here is the opposite: the point estimate for alpha went up by almost 250% while the SE did not decline at all. So, the covariate adjustment is affecting inference by increasing the point estimate, rather than improving precision. This is true for the other outcomes as well: the point estimate for beta (gamma) increases by 33% (43%), while that for theta changes sign. Column 5 that calculates effect size uses the adjusted estimate from column 4: why not both?
Now, in the authors’ defense, the trial, and a pre-analysis plan, are registered (you can access all the supplemental appendices and data from the links above). And it is not uncommon for authors to deviate from the plans when circumstances dictate. This is what the authors say about what they will do about covariate adjustments:
“We will estimate (1) without and then with baseline demographic child and family characteristics (X) to improve the precision of our estimates by accounting for residual variation. These baseline measures, all gathered prior to random assignment, will first be checked for adequate variation and sufficient independence from other baseline measures. They include: dummy variables for three of the four sites; mother’s age, completed schooling, household income, net worth, general health, mental health, race and Hispanic ethnicity, marital status, number of adult in the mother’s household, number of other children born to the mother, whether 4 the mother smoked or drank alcohol during pregnancy and whether the father is currently living with the mother; and child’s sex, birth weight, gestational age at birth and birth order.” (emphasis added)
There are a few things to say here. First, the plan does not make it clear what we infer when we find ourselves in this unfortunate situation of unadjusted and covariate-adjusted impacts estimate differ substantially: the effect size with just site FE comes out to 0.07 SD, compared to 0.17 SD in Table 2. [Technically speaking, the authors plan treats these fixed effects as part of the adjustment, so there should be a column reporting the raw effects: discerning the raw effect size from columns 1 & 2, shows that the site FE-adjusted estimates are also 30% higher than purely unadjusted estimates: 0.294 vs. 0.226, making the raw effect size closer to 0.05 SD for alpha! Gelman discusses the change from the site-adjustments in his blog.] Should we believe the raw, adjusted only for site FE, or fully adjusted estimates? The rules of thumb for stratified RCTs (the researchers seem to have randomized within each metropolitan area) would be that the default estimates should have site FEs, as that is part of the design. The pre-specification leaves a fair amount of leeway here – both to the authors and the reader: please note that in a perfectly plausible parallel universe, people could be discussing a paper that has large and statistically significant-ish unadjusted estimates, which disappear when covariate adjustments are introduced. Did the authors of that paper also list the adjusted estimates as their preferred specification for the calculation of the effect size in column 5 of Table 2? Or did they prefer the unadjusted estimates, saying that p-values increase when adjustments are introduced? We need some other guidance…
For this topic, I only go to one person, Winston Lin, whose two guest posts on regression adjustments in RCTs 10 years ago (Part 1 and Part 2 here in Development Impact) had a huge influence on how I think about the issue. The development community has been playing catch up to his recommendations and has almost converged to the fully-interacted (saturated) adjustment models. For those interested in this topic, these are absolutely must-reads. There are a few things that we can glean from the guidance therein:
· The adjusted treatment effects exhibit small-sample bias, which converge to zero as N à infinity. With 435 observations divided into four sites and then into two treatment arms – unevenly in both – and a large number of adjustment variable (15-20), the possibility of overadjustment is there, which is something that Gelman refers to in his post. Generally, we want these covariate adjustments to be for a small number of prognostic variables, much smaller than sample size.
· If imbalance in at least one prognostic variable is present, however, then adjustments (under the conditions laid out by Lin) may not just be desirable but necessary. My favorite line in the two posts, in response to knowing that there is a large difference between treatment and control in baseline test scores in a hypothetical education RCT, is the following: “If your statistical consultant says, "That's OK, the difference in means is unbiased over all possible randomizations," you might find that a bit Panglossian.”
o So, we should perhaps use covariate adjustments (for a small and pre-specified number of covariates – the baseline value of the outcome indicator would be the best in many circumstances) but there is a caveat. If the design is imbalanced (the authors allocated 60% to control and 40% to treatment) and there are heterogeneous treatment effects for some of the covariates, then a more refined treatment-by-covariate interactions are called for. [We at least know that there are some covariates that are highly prognostic of the outcome and are imbalanced between the two groups, because otherwise the adjusted estimates in column 4 would not differ that much from those unadjusted ones in column 3. It’s not a stretch to think that treatment effects might also be heterogeneous along some of these dimensions.]
o At least, we think that the authors should have (a) been more specific (e.g., is age a continuous variable or should it have been saturated with week-of-age indicators?) about the exact covariate adjustments, (b) perhaps selected a more parsimonious set of baseline covariates (including the possibility of waiting for a month and getting EEGs, or as early as feasible with newborns, and blocking the randomization on these outcomes before starting the intervention), and (c) presented a fully-saturated (treatment x X) OLS model for their main analysis. I know that some people do not like pre-analysis plans that are written (or revised) after baseline data collection, but I am not opposed to them. Analyzing baseline data to inform the exact set of covariates for adjustments (as well as blocking random assignment) and how exactly they will be used in a regression framework before any follow-up is conducted would have been useful.
o Finally, I also refer you here to Tukey’s perspective on the issue of adjustments in RCTs: “The main purpose of allowing [adjusting] for covariates in a randomized trial is defensive: to make it clear that analysis has met its scientific obligations. (If we are very lucky ... we may also gain accuracy [precision], possibly substantially. If we must lose some sensitivity [power], so be it.)” In this sense, it’s a bit harder to argue that this study has met its obligations: the way I read Tukey is that “Look, you’ve shown me that there is this effect, but I am worried about this imbalance in this important variable, implying that you might have just gotten lucky: can you please show me the adjusted analysis to convince me further?” That is very different than: “we have unadjusted estimates with t-statistics well below 1 but adjusted estimates are much larger and close to statistically significant before MHT adjustments.”
· There is another issue here with baseline imbalance that is off: the authors don’t think that there is much to fuss about. In the main paper, they say the following: “Despite the relatively small departures in baseline balance between the high-cash and low-cash groups shown in Table 1, we note that some of the ITT estimates change when covariates are added to the models.” This is an understatement, to say the least and I will have to respectfully disagree with the authors on the imbalance: Table S1.2 in this appendix presents the baseline balance for a large number of variables: the joint (chi-squared) orthogonality test has a p-value of 0.09. This is not the reassuring evidence we need from this table. As Daniel Lakens outlines in his blog post on statistical power and interpreting p-values that also cited two weeks ago, the p-value=0.09 constitutes evidence for the alternative hypothesis that the two treatment groups differ, especially because the ex-post analysis is significantly underpowered compared to the original plans: the study had 80% power to detect an effect of 0.22 SD for N=800 (1,000 children at baseline with 20% attrition). The final sample for this one-year analysis is 435, so with statistical power that much lower, even a p-value of 0.2 would have been evidence of some baseline imbalance.
o There is a lesson here for all of us: we often let researchers (in manuscripts and seminars) skip quickly over the balance (and attrition, see below) across study arms in RCTs. Without knowing the (ex-post realized) power and not seeing a joint orthogonality test for a reasonable set of prognostic baseline variables with a p-value above, say 0.5 for moderately-powered studies, we should not let them off the hook that easily.
· Another thing that worries me is the following: Gelman, dissatisfied that the lagged value of EEGs is not present in the data, creates a predictor index (and, like Lin, suggests that the thing to do would be to control for that index and its interaction with treatment). His index does improve the outcome strongly and significantly, so it is prognostic. However, including it in the regression models does not change the treatment effects that much – unlike that for the authors. Only after he adds the site fixed effects and the exact set of covariates the authors use, he gets a doubling of the effect (on beta, I think?) from 0.07 to 0.14. This worries me: what is it about those exact adjustments that is causing this effect? Are there a few outliers in one group in a particular site that are very influential observations? Is it just one variable causing it and, if so, would including it in a different way matter? What about the author degrees of freedom in choosing this exact set of variables?
In the end, I do find that the authors generally followed Lin’s recommendations from 10 years ago: they do present the unadjusted estimates. They also registered and pre-specified their analysis. They also made their data and code available with the paper. They should be commended for all these things.
However, they also came up slightly short in some of these aspects: the pre-analysis plan did not specify which regression model would be the primary analysis, leaving the unfortunate (and, I believe, unintended) appearance of giving them leeway to pick the one that looks better for the intervention. As I said, what if the evidence were reversed and the adjusted-estimates were half as large with p-values>0.10? It is reasonable to ask whether they would still be the primary estimates used for effect sizes that are presented in the main tables and discussed in the abstract. They also did not specify their covariate adjustments exactly, chose too many variables (left themselves room to discard them if they were correlated), and did not have a baseline measurement of the main outcome. Writing the analysis plan after collecting baseline data might have been preferable. Finally, they could have written a different abstract and paper that emphasized the uncertainty and the null findings, which they could have then amplified in interviews with the media. They did not do this, and I touch on this at the end. [I sympathize with the authors here and think that Gelman put it well in his answers to comments: “Indeed, in a slightly alternative world, the authors would’ve published their results as a null finding, and then I’d have to explain why this doesn’t imply the effect is exactly zero!” But I will come to this just a little later.]
Now, why is there such baseline imbalance? I can’t say exactly (because I did not come across a balance table with the originally-recruited 1,000 children), but I think the answer is attrition: the Consort diagram in SI1.1 of this appendix shows the amount of attrition in stages. Even before the pandemic disrupted the in-person data collection, the follow-up success rates are lower, at 91%, in control vs. 96% in treatment. Then, among these, the in-person vs. phone-data collection rates are 63% vs. 68%, respectively. Finally, of these, usable EEG data for the analysis is 73% vs. 70%. Overall, we have 42% of the 600 children in the control group and 46% of the 400 children in treatment. The large numbers of lost-to-follow-up are cause for concern (and perhaps should have caused the authors to wait until the five-year findings, more on that below), but the difference in levels by treatment arm is also concerning. Assuming the randomization went according to plan and the groups were balanced at baseline for 1,000 children, attrition must have been also differential in characteristics, causing worries about biased estimates.
Once we have this much attrition, we are no longer in the rarified ‘rigorous randomized design’ world and must treat our analysis as such. Most importantly, any bounding exercise that the authors could have conducted to deal with attrition (Lee bounds, Kling-Liebman bounds, etc.) would have produced exactly the kind of uncertainty shown in the graphs of Gelman that included the EEG patterns of all 435 children, making it harder to simply rely on point estimates rather than showing reasonable ranges in which the true effect might lie.
Still, it is harder to fault the authors for not sitting on this evidence for three-to-five years: they may have legitimately thought, noise and all, it is better for the scientific community (and policymakers) to have access to this evidence. After all, we have been debating whether to make permanent transfers of very similar size to low-income parents in the U.S. In fact, one of the authors touches on this issue in their response to Gelman’s blog post:
“Preregistered hypotheses are properly accorded a great deal of weight in reaching conclusions from a piece of research. Gelman’s analyses of our data focused exclusively on those preregistered hypotheses, as have almost all of the blog post reactions. Results for them are featured in Table 2 of our paper, and we pointed to their p>.05 nature in several places in the paper, including our conclusion. Had we stopped our analysis with them, we probably would not have tried to publish them in such a high-prestige journal as PNAS.
But the paper goes on to present a great deal of supplementary analyses. Statisticians have a difficult time thinking systematically (i.e., statistically) about combining pre-registered and non-preregistered analyses and often choose to give 100% weight to the preregistered results. That is a perfectly reasonable stance — and is what classical statistics was designed to do.
For us, the appendix analysis of results by region (in SI6.1), coupled with the visual regional differences in Figure 2 and by results from the regional analyses in the past literature, led us to judge that there probably were real EEG-based power differences between the two groups (although we take Gelman’s point that baseline (i.e., the day after the birth or before) differences not controlled for by our host of demographic measures for the mother may account for them). Our thinking was reinforced by our non-preregistered alternative approach of aggregating power across all higher-frequency power bands. This gets around the problem of the rather arbitrary (and, among neuroscientists, unresolved) definition of the borders of the three high-frequency power bands and eliminates the need for multiple adjustments. Results (in Table SI7.1) show an effect size of .25 standard deviations and a p-value of .02 for this aggregated power analysis.
I do have a problem here: what Greg Duncan seems to be saying is that they would not have submitted their paper had they not seen that an (ex-post) aggregate measure across all higher-frequency bands did not show a large and statistically significant effect. This, for my taste, takes pre-registered analysis too nonchalantly. But we can set that aside as Duncan addresses that issue in his comments. I do think, however, Duncan is wrong when they say that this aggregated index does not need any MHT adjustments. After all, this is an aggregate that a large team of experts in child psychology and development did not think about at the detailed registration stage prior to the intervention. So, we can conclude that it is not an obvious measure – and it is likely meaningless to the experts (I venture to guess that no child specialist would use the actual values of this aggregate index for any purpose – other than that here, which is a standardized overall signal of something from this study). So, it is something that they tried, which panned out. What if this particular aggregate did not pan out, but one of co-authors that is an expert in these EEGs said, “You know, in my experience, the aggregate should be more like [(3*alpha) + (2*beta) + (0.25*gamma)]: we should try that.” As far as I know that is equally legitimate – hard to say without a PhD on that topic…
The point here is that Greg Duncan is only right in stating that an aggregate index is a way around the need for costly multiple hypothesis testing adjustments, IF that is what the authors registered in their analysis plan. In other words, that aggregate is specified as the primary analysis, and all the components, (the alpha, the beta, the theta, absolute and relative) will be secondary analysis to analyze pathways. If you did not do that, then the aggregate is yet another t-test. And, that (N+1)th t-test (N at least 8, if not higher) needs to be adjusted for MHT; and that estimate will likely have an adjusted q-value > 0.05; and certainly so if we use the OLS with site FE estimate, which is, once again, 40% smaller than the covariate-adjusted estimate. So, if the authors are justifying their decision on the basis of stumbling onto this index after looking at follow-up data, they need to treat it as just another regression they ran and conduct Westfall-Young adjustments. Anything else, if normalized to be common in RCTs we run, risks making analysis plans less and less relevant: yes, we don’t have to follow them to the letter, but there are still some rules.
Final thoughts
In the end, I fully agree with Andrew Gelman: This paper does not provide evidence that cash transfers change brain activity in infants. Nor does it provide evidence that such effects are zero. Given the circumstances that had to do with the pandemic and perhaps couple of less than idea study design decisions, and given the fact that the cash transfers are still ongoing and the authors always planned to have longer-term follow-ups, I like the ‘slightly alternative world’ that Gelman discuss better: I would have either liked this to be a working paper and/or for it to be written in an even more careful manner that makes it clear that we just don’t know the answer to this particular question just yet.
I do find more blame when I look at the media outlets: there is no way NYT should lead with this headline: “Cash Aid to Poor Mothers Increases Brain Activity in Babies, Study Finds: The research could have policy implications as President Biden pushes to revive his proposal to expand the child tax credit.” At least Vox updated its post with the criticism from Gelman and Scott Alexander, but the media outlets’ enthusiasm for any “unconditional cash transfers DID this!” news is still palpable and this, unfortunately, affects their coverage, what they write and who they cite. There was a time when the writers would call people like Gelman before publishing their pieces – I don’t know what happened. But, in a world that the media with disproportionate reach to audiences that matter (compared to niche blogs like Gelman’s or even tinier ones like ours) did not “hype” studies without scrutiny, I would be OK with the authors not being as equivocal about the evidence as some would have liked. So, no, I am not cancelling my NYT subscription and will still read Vox, but their coverage of this study, and others like it, is problematic.
Join the Conversation