Creativity vs. fishing for results in scientific research


This page in:

One of my favorite bloggers, Andrew Gelman, has a piece in in which he uses a psychology paper that purported to show women are more likely to wear red or pink when they are most fertile as an example of the ‘scientific mass production of spurious statistical significance.’ Here is an excerpt:
“And now the clincher, the aspect of the study that allowed the researchers to find patterns where none likely exist: "researcher degrees of freedom." That's a term used by psychologist Uri Simonsohn to describe researchers’ ability to look at many different aspects of their data in a search for statistical significance. This doesn't mean the researchers are dishonest; they can be sincerely looking for patterns in the data. But our brains are such that we can, and do, find patterns in noise. In this case, the researchers asked people "What color is the shirt you are currently wearing?" but they don't say what they did about respondents who were wearing a dress, nor do they say if they asked about any other clothing. They gave nine color options and then decided to lump red and pink into a single category. They could easily have chosen red or pink on its own, and of course they also could've chosen other possibilities (for example, lumping all dark colors together and looking for a negative effect). They report that other colors didn't yield statistically significant differences, but the point here is that these differences could have been notable. The researchers ran the comparisons and could have reported any other statistically significant outcome. They picked Days 0 to 5 and 15 to 28 as comparison points for the time of supposed peak fertility. There are lots of degrees of freedom in those choices. They excluded some respondents they could've included and included other people they could've excluded. They did another step of exclusion based on responses to a certainty question.”
And another…
“In one of his stories, science fiction writer and curmudgeon Thomas Disch wrote, "Creativeness is the ability to see relationships where none exist." We want our scientists to be creative, but we have to watch out for a system that allows any hunch to be ratcheted up to a level of statistical significance that is then taken as scientific proof.
Even if something is published in the flagship journal of the leading association of research psychologists, there's no reason to believe it. The system of scientific publication is set up to encourage publication of spurious findings.”
In economics, we used to have many more of the type of study described here, and we surely still do. However, I am going to claim that they are on the decline – at least in my field of development economics. This is not because of the emergence of RCTs: in fact, if anything, RCTs initially may have made this problem worse. When you design your own field experiment and collect your own data, it is as easy, if not easier, to (consciously or not) try different outcome variables, specifications, subgroup analysis, etc. until a neat result emerges. But, the decline is happening through an indirect effect: because RCTs brought on the discussion of standards of reporting (CONSORT guidelines), pre-analysis plans, multiple comparisons, ad hoc subgroup analysis, etc., two things are happening: (i) researchers are much more aware of these issues themselves; and (ii) referees and editors are less likely to let people get away with these.
I think that this is a welcome development and a worthwhile ongoing discussion in various fields of scientific inquiry, including ours. Some colleagues do worry about pre-analysis plans restricting creativity too much and referees and editors discounting any secondary analysis (even if clearly outlined) completely for inclusion in good journals and I sympathize with this worry. The balance between minimizing the possibility of spurious findings while allowing researchers to still creatively examine their data is a hard one to strike – but we have no choice. Otherwise, we’ll keep cycling through finding after finding that gets publicized for 15 minutes and gets debunked a few hours later.
This brings me to a question about pre-analysis plans, the answer to which I would like to try to crowd-source here. How hard should we be on ourselves when we’re writing a pre-analysis plan? Let me be more specific…
In our study of the effects of cash transfers on young women’s welfare in Malawi, we have now reached five years since baseline. Many of our originally never-married, adolescent female population are now married, have children, work, etc. Initially, we were interested in only a limited number of outcomes: school enrollment/test scores; teen pregnancy and early marriage; HIV, STIs, and mental health. But, now that many are married adults with children, we’re interested in much more: their labor market outcomes, their health, the quality of their marriages, the attitudes of their husbands, the cognitive development of their children, i.e. a lot of things that contribute to their overall welfare and to that of the larger community. We’re after that elusive concept of ‘empowerment.’ So, our surveys are full of questions on women’s freedom, decision-making, assets, bargaining power, happiness, etc. We have to be honest and admit that, when we say we’re looking to see whether the individuals in the treatment arm are more ‘empowered,’ we are not 100% sure what we mean…
In such a scenario, a pre-analysis plan is crucial and makes a lot of sense: if we started looking at all the questions we asked, soon enough we might be justifying to ourselves why we saw an effect in one ‘empowerment’ variable and not in another. That’s when the ‘stories’ begin and the credibility of findings start to decline. However, reporting tens, if not hundreds, of impacts on each individual question is not very useful to anyone, either. So, what we decided to do is to create indices for each family of outcomes, such as self-efficacy, or financial decision-making, or divorce prospects, or freedom to use contraception, etc. In the pre-analysis plan, we propose to create each index by calculating the average of all standardized sub-questions that fall under that heading (which is usually a survey section or a sub-section). Then, the effects on all such indices would be reported so that readers can see if we’re getting only one or two statistically significant results from the 20 or so indices we created. More importantly, we are NOT allowed to examine any of the sub-questions of an index within the primary analysis – unless the program had a significant effect on the overarching index. For indices with significant program effects, we can then see which sub-questions may be driving that effect.
However, this strategy has some risk, especially for outcomes like empowerment where we have little to go on in terms of predicting ex ante what particular outcomes in early adulthood would be affected by a two-year cash transfer program during adolescence. You notice that our approach, similar to a ‘mean effects’ approach of analyzing multiple outcome measures, weighs each sub-question equally. If we were smart about the families of questions we put together, i.e. a lot of the questions under that heading address the same underlying concept we’re after, this should work fine. But, if we screwed up and the sub-questions are about different concepts, then a few unrelated sub-questions can cause treatment effects to be null when the rest of the group of sub-questions, had they been correctly grouped, would have revealed a strong effect.
Your answer to this dilemma may be: “Well, then, don’t be stupid! Create sensible indices.” In the end, that’s what we tried to do from the start: using indices other people have used before, carefully drafting the survey instruments, still spending a lot of time ex-post going through each section and making sure earlier decisions still seem sensible (on second thought or in light of evidence from new papers since survey design), etc. Making this plan and sticking with it says that we’re confident in our research question, hypotheses, and outcome variables and we will live with the consequences.
However, as a colleague suggested at a Sunday brunch recently, we could have done something slightly different as an insurance policy: instead of calculating indices by averaging standardized sub-questions, we could have used ‘principal component analysis’ (PCA) to create them. This would be akin to constructing indices like the asset index proposed by Filmer and Pritchett (2001), in a paper cited more than 2,400 times, as a proxy for wealth: as they describe, this approach extracts the common information contained in a family of variables more successfully than an ad hoc linear combination (such as the simple average we proposed above).
We decided against this partly because the units of such indices are pretty meaningless – so we could establish a relationship between treatment and an index, but describing the size of the effect (especially for sub-questions) would be more problematic. But, it would be a safeguard against exactly the kind of scenario I described above: if some questions were unrelated to others in the family, the PCA would have created a more sensible (although less definable) index.
What do the readers think? Please comment here or feel free to send me an email, or a tweet (the pre-analysis plan is not yet final, so you can have contribute to it by commenting before it is registered by the end of this week).


Berk Ozler

Lead Economist, Development Research Group, World Bank

Join the Conversation

July 29, 2013

Hi Michael,
Thanks -- very thoughtful and helpful. Other than spending some time figuring out the optimal indices, what you propose sounds a lot like what we're about to register. I think that my post did not make clear the big red line you're also alluding to: we have no intention of not doing anymore analysis that either departs from pre-analysis plan or digging into insignificant overarching indices that have plausible heterogeneity of impacts in sub-groups. It's just that we want to set them apart to ourselves and the readers.
We actually think that there will also be a fair amount of methodological secondary analysis -- e.g that have to do with extensive vs. intensive margin effects, in interpreting the findings, etc. These will be part of the paper (or 'science' as you call it) and would distinguish the work from a rote reporting of a pre-analysis plan -- which is useful but in a limited and sometimes uninteresting way...
Thanks again for taking the time to comment.

Michael Clemens
July 29, 2013

My go-to reference for questions like this is Popper's Logic of Scientific Discovery, which has direct practical recommendations. Simplifying: there are two steps to method, hypothesis generation and hypothesis testing. Step 1 can come from anywhere, including a hallucination or a false belief. But maybe the most fertile ground for finding hypotheses to test experimentally is the *non-experimental* results of a prior study.
For your situation this means: 1) Lash your hands to one or two measures, for the purpose of hypothesis testing, labeled as such. Start with the one most influential in the literature, full stop. Don't agonize too much about whether it's the 'right' or 'best' index, because: 2) Also show the results with other indices/weightings as non-experimental results, labeled unmistakably as such, along with a theory about why results using different indices might differ. Step two is hypothesis generation, not testing, and it's every bit as important to the scientific method as hypothesis testing. Properly distinguished for readers, both are science. The hypotheses generated by ex-post subgroup analysis or alternative outcome indices in one paper can be tested in another paper. If it turns out to be an illusion, that becomes clear in the other paper.
In other words, paper A pre-commits to testing the effect of a pill on "well-being" as an average of pain-free days and self-reported happiness. Experimental result: nil. Clearly-labeled non-experimental section notes that there's a big positive effect on pain-free days in isolation, but not on self-reported happiness. Paper B then pre-commits to testing the effect of the same pill on pain-free days. All of this is science, including the non-experimental tests in Paper A that depart from the pre-analysis plan, because choosing the hypothesis for Paper B to test is part of hypothesis testing.
In short: Don't suppress analysis that departs from the pre-analysis plan, just make the departure fully transparent and set it apart in its own section. Then worry less about getting the pre-analysis plan perfect, trying to ignore the (real) pressure we all face for every paper to be a "home run". (Sorry Kiwis... er, a "sixer".)

Andrew B.
August 01, 2013

Thanks Michael for the dispensation of sound advice by way of a classic, but also for the gracious acknowledgment of the tyranny of baseball analogies. Having said that, I fear the ranks of us Kiwi cricket fans are getting thinner every year.

July 30, 2013

I'm very interested in the index vs. PCA question. Though I'm not sure why you are worried about the unit size for the PCA. Presumably your index will be in standard deviations, which will require some harder thinking as well for people who don't have an intuitive understanding of an SD. And you can always convert the predicted values of the PCA factors into SD, so you would have the same units as your current index. For illustration, you could show a kernel density plot for the two groups. But maybe I'm wrong.
Anyway, Kling et al talk a bit about this issue for a paragraph or so, I think in their appendix to the MTO paper. My interpretation (perhaps incorrect) was that a PCA is appropriate when you are trying to get at an unknown factor (as you seem to be) but an index is important when you care about all the individual outcomes (eg, diabetes, hypertension, etc, you may all care about equally in a health index).

July 31, 2013

I like Michael's model. I wouldn't be bothered if both sets of results appeared in the same paper so long as there was a clear distinction between the paper A analysis and the paper B analysis. I think his point about using the most influential measure gets at something really important: pre analysis plans can be viewed as a mechanism to commit not only the researcher but also one's audience. It's a contract. The researcher is committed to do things a certain way, and the audience is committed to accept the result as a valid judgment on a hypothesis. The pre analysis plan should be workshopped with this kind of mutual commitment in mind---to carry the analogy, the contract requires input and negotiation from both parties. This should result in study designs and analysis plans that assess important hypotheses in ways convincing, ex ante, to the relevant community of researchers. Ideally the contract includes a conditional commitment to publish; this may be hard to do formally, so again this is why it is important to workshop the pre analysis plan---to gain some kind of informal commitment. When a paper comes out having been preceded by such a contract, critiques such as Gelman's would have to contend with the fact that the study was executed according to standards considered compelling ex ante in the discipline.
On index construction, I have some notes in my Quant Field Methods course on this issue. PCA and things like mean effects (or, what I prefer, inverse covariance weighted averages) have different logics. Here is a simple example. Suppose you have three variables: math standardized test score, math grades, and verbal standardized test score. For the example, suppose the two math variables are highly correlated but the verbal score exhibits weak correlation with either of the math scores. If you took, say, the inverse covariance weighted average of the standardized scores, you'd get a score that gives about 50% weight to the verbal scores and then 25% weight to each of the math variables. This might be considered a measure of "aptitude", like the SAT. It is an optimal combination of three variables that are considered ex ante to contribute to a common concept (aptitude), even if they are not all strongly correlated with each other (aptitude consists of different things). If you used PCA, you'd recover two dimensions, one that consists solely of math contributions, and another that consists solely of the verbal contribution. That buys you back a degree of freedom, which is good, but that's about all you've achieved. Whether it is the appropriate way to use the variables one way or another depends on whether you have an ex ante reason to think there us a common construct to which all three variables contribute or not.