One of my favorite bloggers, Andrew Gelman , has a piece  in Slate.com in which he uses a psychology paper that purported to show women are more likely to wear red or pink when they are most fertile as an example of the ‘scientific mass production of spurious statistical significance.’ Here is an excerpt:
“And now the clincher, the aspect of the study that allowed the researchers to find patterns where none likely exist: "researcher degrees of freedom." That's a term used by psychologist Uri Simonsohn to describe researchers’ ability to look at many different aspects of their data in a search for statistical significance. This doesn't mean the researchers are dishonest; they can be sincerely looking for patterns in the data. But our brains are such that we can, and do, find patterns in noise. In this case, the researchers asked people "What color is the shirt you are currently wearing?" but they don't say what they did about respondents who were wearing a dress, nor do they say if they asked about any other clothing. They gave nine color options and then decided to lump red and pink into a single category. They could easily have chosen red or pink on its own, and of course they also could've chosen other possibilities (for example, lumping all dark colors together and looking for a negative effect). They report that other colors didn't yield statistically significant differences, but the point here is that these differences could have been notable. The researchers ran the comparisons and could have reported any other statistically significant outcome. They picked Days 0 to 5 and 15 to 28 as comparison points for the time of supposed peak fertility. There are lots of degrees of freedom in those choices. They excluded some respondents they could've included and included other people they could've excluded. They did another step of exclusion based on responses to a certainty question.”
“In one of his stories, science fiction writer and curmudgeon Thomas Disch wrote, "Creativeness is the ability to see relationships where none exist." We want our scientists to be creative, but we have to watch out for a system that allows any hunch to be ratcheted up to a level of statistical significance that is then taken as scientific proof.
Even if something is published in the flagship journal of the leading association of research psychologists, there's no reason to believe it. The system of scientific publication is set up to encourage publication of spurious findings.”
In economics, we used to have many more of the type of study described here, and we surely still do. However, I am going to claim that they are on the decline – at least in my field of development economics. This is not because of the emergence of RCTs: in fact, if anything, RCTs initially may have made this problem worse. When you design your own field experiment and collect your own data, it is as easy, if not easier, to (consciously or not) try different outcome variables, specifications, subgroup analysis, etc. until a neat result emerges. But, the decline is happening through an indirect effect: because RCTs brought on the discussion of standards of reporting (CONSORT guidelines ), pre-analysis plans, multiple comparisons, ad hoc subgroup analysis, etc., two things are happening: (i) researchers are much more aware of these issues themselves; and (ii) referees and editors are less likely to let people get away with these.
I think that this is a welcome development and a worthwhile ongoing discussion in various fields of scientific inquiry, including ours. Some colleagues do worry about pre-analysis plans restricting creativity too much and referees and editors discounting any secondary analysis (even if clearly outlined) completely for inclusion in good journals and I sympathize with this worry. The balance between minimizing the possibility of spurious findings while allowing researchers to still creatively examine their data is a hard one to strike – but we have no choice. Otherwise, we’ll keep cycling through finding after finding that gets publicized for 15 minutes and gets debunked a few hours later.
This brings me to a question about pre-analysis plans, the answer to which I would like to try to crowd-source here. How hard should we be on ourselves when we’re writing a pre-analysis plan? Let me be more specific…
In our study of the effects of cash transfers on young women’s welfare in Malawi, we have now reached five years since baseline. Many of our originally never-married, adolescent female population are now married, have children, work, etc. Initially, we were interested in only a limited number of outcomes: school enrollment/test scores; teen pregnancy and early marriage; HIV, STIs, and mental health. But, now that many are married adults with children, we’re interested in much more: their labor market outcomes, their health, the quality of their marriages, the attitudes of their husbands, the cognitive development of their children, i.e. a lot of things that contribute to their overall welfare and to that of the larger community. We’re after that elusive concept of ‘empowerment.’ So, our surveys are full of questions on women’s freedom, decision-making, assets, bargaining power, happiness, etc. We have to be honest and admit that, when we say we’re looking to see whether the individuals in the treatment arm are more ‘empowered,’ we are not 100% sure what we mean…
In such a scenario, a pre-analysis plan is crucial and makes a lot of sense: if we started looking at all the questions we asked, soon enough we might be justifying to ourselves why we saw an effect in one ‘empowerment’ variable and not in another. That’s when the ‘stories ’ begin and the credibility of findings start to decline. However, reporting tens, if not hundreds, of impacts on each individual question is not very useful to anyone, either. So, what we decided to do is to create indices for each family of outcomes, such as self-efficacy, or financial decision-making, or divorce prospects, or freedom to use contraception, etc. In the pre-analysis plan, we propose to create each index by calculating the average of all standardized sub-questions that fall under that heading (which is usually a survey section or a sub-section). Then, the effects on all such indices would be reported so that readers can see if we’re getting only one or two statistically significant results from the 20 or so indices we created. More importantly, we are NOT allowed to examine any of the sub-questions of an index within the primary analysis – unless the program had a significant effect on the overarching index. For indices with significant program effects, we can then see which sub-questions may be driving that effect.
However, this strategy has some risk, especially for outcomes like empowerment where we have little to go on in terms of predicting ex ante what particular outcomes in early adulthood would be affected by a two-year cash transfer program during adolescence. You notice that our approach, similar to a ‘mean effects’ approach of analyzing multiple outcome measures, weighs each sub-question equally. If we were smart about the families of questions we put together, i.e. a lot of the questions under that heading address the same underlying concept we’re after, this should work fine. But, if we screwed up and the sub-questions are about different concepts, then a few unrelated sub-questions can cause treatment effects to be null when the rest of the group of sub-questions, had they been correctly grouped, would have revealed a strong effect.
Your answer to this dilemma may be: “Well, then, don’t be stupid! Create sensible indices.” In the end, that’s what we tried to do from the start: using indices other people have used before, carefully drafting the survey instruments, still spending a lot of time ex-post going through each section and making sure earlier decisions still seem sensible (on second thought or in light of evidence from new papers since survey design), etc. Making this plan and sticking with it says that we’re confident in our research question, hypotheses, and outcome variables and we will live with the consequences.
However, as a colleague suggested at a Sunday brunch recently, we could have done something slightly different as an insurance policy: instead of calculating indices by averaging standardized sub-questions, we could have used ‘principal component analysis’ (PCA) to create them. This would be akin to constructing indices like the asset index proposed by Filmer and Pritchett (2001), in a paper cited more than 2,400 times , as a proxy for wealth: as they describe, this approach extracts the common information contained in a family of variables more successfully than an ad hoc linear combination (such as the simple average we proposed above).
We decided against this partly because the units of such indices are pretty meaningless – so we could establish a relationship between treatment and an index, but describing the size of the effect (especially for sub-questions) would be more problematic. But, it would be a safeguard against exactly the kind of scenario I described above: if some questions were unrelated to others in the family, the PCA would have created a more sensible (although less definable) index.
What do the readers think? Please comment here or feel free to send me an email , or a tweet  (the pre-analysis plan is not yet final, so you can have contribute to it by commenting before it is registered by the end of this week).