Published on Development Impact

A pre-analysis plan is the only way to take your p-value at face-value

This page in:

Andrew Gelman has a post from last week that discusses the value of preregistration of studies as being akin to the value of random sampling and RCTs that allow you to make inferences without relying on untestable assumptions. His argument, which is nicely described in this paper, is that we don’t need to assume nefarious practices by study authors, such as specification searching, selective reporting, etc. to worry about the p-value reported in the paper we’re reading being correct.

Instead, our worry can simply be a function of the fact that the analysis that was performed, even if it was only one analysis of the data at hand, is likely a function of the data itself. Had you drawn another sample, you might do a different analysis given those data at hand. It’s not hard to come up with examples, two proxy variables for the same concept may have different levels of noise in different data sets, so the least noisy one (correctly or not) may get chosen for the analysis. Or, if you find a main effect, you may not look for heterogeneity, but if you don’t, you might. But, that is exactly what violates the interpretation of the p-value of your test…

Now to the analogy between random sampling, randomized controlled trials, and preregistration:

"Just as a serious social science journal—or even Psychological Science or PPNAS—would never accept a paper on sampling without some discussion of the representativeness of the sample, and just as they would never accept a causal inference based on a simple regression with no identification strategy and no discussion of imbalance between treatment and control groups, so should they not take seriously a p-value without a careful assessment of the assumptions underlying it.

Or you can have random sampling, or you can have a randomized experiment, or you can have preregistration. These are methods of design of a study that make analysis less sensitive to assumptions."

I agree with this. A few more thoughts, both regarding issues that come up in the rest of the post and the usually excellent comments section and a couple contributions of my own:
  • As Gelman notes, a pre-analysis plan (PAP) is no guarantee for internal validity: things can go wrong even after a pre-analysis plan, such as non-response, non-compliance, unanticipated patterns in the data, etc. What to do then?
    • As mentioned in the comments, depart from your PAP in a clear and transparent way
    • Lin and Green (2016) describe the idea of standard operating procedures that were not covered by the PAP. What that might look like is here, by Lin, Green, and Coppock 2015). At least, if you find yourself in a situation that was unexpected or you had not thought about, you can revert to this if it is covered therein.
    • A pre-analysis plan should likely be the exact code that turns the raw data into clean data and then runs the exact analysis that you have pre-specified. This may be difficult to do ex ante, but departing from the PAP, as we said before, is not the end of the world – just do it transparently and explain the reason(s). Using baseline data to get an idea of the data and then writing the PAP for the follow-up analysis does not strike me as cheating. If you had not written a PAP beforehand, you could also have only a random sub-sample of your data revealed to you, examine those data to make your decisions about data cleaning and analysis, and then report the results on the rest of your data.
  • These days, it is not uncommon to receive referee requests from journal editors for studies that have pre-analysis plans, which were not followed in the manuscript submitted. Reading papers like that, which usually come with appendices with number of pages in triple digits, can be annoying to say the least. But, then, one can understand the authors’ motivations and reasoning: journal articles in economics do not have templates for such reporting. Rather, they tell a story with a narrative. Ben Olken has written about the difficulty of putting together a PAP here. I don’t have a solution here, but if we are to have papers that adhere to the PAP, PAPs have to be shorter and economics papers need the equivalent of CONSORT guidelines. Then, you can have secondary papers – perhaps in another journal (or, in the same journal, but as a separate paper clearly demarcated as secondary analysis) that accompany the straight-laced first one. That way, you can have a PAP but also do what Olken nicely describes in his paper: “Hypotheses are often themselves conditional on the realizations of other, previous hypothesis tests: the precise statistical question a paper might tackle in Table 4 depends on the answer that was found in Table 3; the question posed in Table 5 depends on the answer in Table 4, and so on.” 
Let’s conclude with the last paragraph of Gelman’s post:

“To my mind, the analogy between random sampling, random assignment, and preregistration is excellent. These are three parallel ideas, and to me it seems like just an accident of history that the first two of these ideas are in every statistics textbook and are considered the default approach, whereas the third idea is only recently gaining popularity. Perhaps this has to do with analyses becoming open-ended. Perhaps fifty years ago there were fewer choices in data collection, processing, and analysis—fewer “researcher degrees of freedom”—so the implicit assumption underlying naively computed p-values was closer to actual practice.”

This is all fair, but what makes PAPs more needed than ever also makes them more difficult to write and puts a fair amount of burden on reviewers. Let us know next time you get a 40-page paper to review that comes with a 120-page appendix, including a PAP. If there are enough accounts of different issues that arise, different experiences, and thoughts, we can publish them here…


Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000