I’ve been asked several times what I think of Alwyn Young’s recent working paper “Channelling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results”. After reading the paper several times and reflecting on it, I thought I would share some thoughts, with a particular emphasis on what I think it means for people analyzing experimental data going forward.
What’s the paper do?
Alwyn takes the data from 2003 regressions estimated in 53 experimental papers published in the AER, AEJ Applied, and AEJ micro. He re-estimates these regressions doing three things:
- Corrects for multiple hypothesis testing within an equation when there are multiple treatments by testing the hypothesis that all treatments jointly have zero effect. He notes that 1036 of the regressions have more than one treatment in them, with a median of 3 treatments and 5 percent having an incredible 16 or more separate treatments! Out of these 1036 regressions, 604 have at least one coefficient significant at the 5 percent level, which would lead authors to conclude that some treatment was having an effect. However, he shows in only 446 of these can you reject the null that all treatment effects are jointly zero at the 5 percent level – so one-quarter of the regressions where you think you have an effect are ones where you can’t reject no effect.
- Uses randomization inference and bootstrapped standard errors in place of the usual standard errors produced by Stata to deal with finite sample standard errors being susceptible to high leverage observations, particularly with clustered and robust standard errors. Here he is not doing randomization inference rank-based permutation tests, but rather using randomization inference to obtain the set of possible realizations of the t and F or Wald test statistics under the hypothesis of no treatment effect for anyone, as well as bootstrapping based on a pivotal statistic. Doing this further reduces the 446 significant regressions to 364 (randomization inference) or 351 (bootstrap), so that about 18 percent of the regressions that would appear overall significant using conventional standard errors are not significant when using these other approaches. He shows that this correction does not make much difference when it comes to examining a single treatment effect without clustering or using robust errors. However, it makes more of a difference when testing multiple outcomes, and when clustered or robust standard errors are used.
- Corrects for multiple hypothesis testing across equations by developing an omnibus test for overall experimental significance. He notes that the average paper in the 51 paper sample has 10 treatment outcome equations with at least one .01 level statistically significant treatment coefficient and 28 equations without a statistically significant coefficient. Using this omnibus test, he finds that only 40 to 50 percent of all the experiments, evaluating the treatment regressions together, can reject the null of no experimental effect whatsoever. i.e. there are a lot of papers claiming to show an effect, which, if you consider the full set of outcomes together, you can’t reject that all the effects are jointly zero.
In some sense these are not new issues – there are a number of papers written about the need for adjusting for multiple hypothesis testing (recent examples include List, Shaikh and Xu; and Fink, McConnell and Vollmer), as well as papers warning about the use of clustered standard errors in the size samples present in many experiments and recommendations to use the bootstrap or randomization inference instead. But Young’s paper goes further in actually re-analyzing real papers to see how much difference these issues make in practice, in explaining some of the theory behind why these issues arise, and in proposing a randomization inference-based solution to deal with multiple testing across equations.
He has chosen to keep the results at the aggregated level, rather than reporting paper by paper, so we don’t know which papers we might want to re-think the results of based on this analysis. Two of my papers are in his sample, so I don’t know whether they cause rethinking of any of these results. But I think I, like many authors, might argue that:
- Not all treatments are created equal: suppose I am testing different interventions to improve firm profits. One group of firms I give a grant of $50,000 to, and then another 10 treatment groups I try different types of motivational SMS messages. I might find a strong and significant effect of the big grant, but that none of the SMS messages have any effect. But then if I test all treatments are jointly zero, I might not be able to reject that there is no effect. But going in I might have priors that said I think the really expensive treatment will work, but am interested in seeing if inexpensive treatments can have any effect.
- Not all outcomes are created equal: taking again a firm example. I might look first to see if the treatment increases profits, and then have a lot of supplementary analysis that tries to explore what channels this increase in profits occurs through – is it through firms investing in training, hiring more workers, getting more inventories, innovating more etc. – or a bunch of heterogeneity analysis that tries to understand what types of firms benefit more. Testing joint significance of the headline results with all this supplementary analysis for which I did not expect to see effects on everything is not the hypothesis of most interest.
- I’m reasonably confident that with individual-level randomization, so longer as I control for randomization strata, I end up with the correct standard errors for the sample sizes I usually deal with – my paper with Miriam Bruhn has a lot of simulations in it that show this.
What should we do better going forward?
- I think this dovetails with the recent emphasis on pre-analysis plans. It is more convincing for me to argue that I had a couple of primary outcomes, and then the rest of the analysis is secondary outcomes designed to examine channels of influence if I set this out in advance. Otherwise skeptical readers will ask, for example, whether your paper showing a vocational training program had impacts on formal employment but not on overall employment, wages, or hours worked really had the goal of focusing on job quality to start with. The pre-analysis plan can set out a method for dealing with multiple testing.
- Be a lot more careful about standard errors in clustered experiments. I understand that Abadie, Athey, Imbens and Rubin also have a new paper on this point, and think the guidance is still crystalizing around what the preferred default option should be here. But randomization inference seems a good option in this case.
- Use the omnibus test of overall experimental significance when you don’t have a sharp ordering of where you expect to see effects. This could be particularly important for programs like Community Driven Development programs and other programs where there are a whole host of potential outcomes in different domains and perhaps no natural sense of what are the primary outcomes and which are more secondary.
- Consider more widespread use of randomization inference. This approach is increasingly used in Political Science (as discussed in Gerber and Green’s book and is heavily used in the recent Imbens and Rubin book). But it is not often used by economists. One reason is that many people equate it with rank-based permutation tests which can have power concerns; it involves testing a sharp hypothesis (e.g. all individuals have zero treatment effect) rather than the usual hypothesis of an average treatment effect of zero; people feel it is more difficult to deal with incomplete compliance and control variables, as well as to do bounding exercises; while another reason is that a lot of people are just unsure of how to code this properly in Stata. Papers and simulations which illustrate better when these concerns are valid and when they are not, and which provide code on how to do this are likely to foster more adoption.
Add new comment