The use of pre-analysis plans (PAPs) has grown rapidly over the past five or so years, with this usage intended to enhance the transparency and credibility of research. In a new working paper, George Ofosu and Daniel Posner look at a large sample of these plans to examine the extent to which their actual use in practice matches up with these goals. This is a useful exercise, both for understanding where progress is made and where improvements are still possible, as well as for helping us to understand where there are still outstanding issues to deal with.
How are pre-analysis plans being written?
Their first exercise takes a representative sample of 195 pre-analysis plans drawn from the EGAP and AEA trial registries and written between their inceptions and 2016 - almost half of which were written in 2016, and 63% of which were for field experiments, 27% survey experiments, 4% lab experiments, and 4% observational studies.
· Outcome variables: Most plans are explaining their key outcome variables in sufficient detail to reduce the chance of post-hoc adjustment or fishing – that is, so that two different programmers just armed with the PAP would both construct the variable in the same way. This is the case for 77% of the primary outcome variables they consider. They note examples which don’t pass this test are when people specify that the primary outcome will be “political participation”, without specifying how this variable is constructed, or an index, without specifying exactly how this index will be constructed.
· Control variables often are not well-specified: they note many PAPs note they will include baseline controls to improve precision, without specifying precisely which ones, so that control variable selection is unclear in 44% of plans.
· Only 68% spell out the precise statistical model they will use for estimation – this includes whether it is OLS or something else, how standard errors will be calculated, etc.
· Only a minority of plans specify how they will deal with outliers, attrition, and treatment imbalance: only 25% discuss how they will deal with missing values and attrition, and only 8% how outliers will be handled.
· Hypotheses are clearly specified, but many plans specify a lot of hypotheses: 90% specified a clear hypothesis, but the number of hypotheses specified was often quite large: 34% specified 1-5 hypotheses, 18% 6-10, 18% 11-20, 21% 21-50, and 8% more than 50! Even when distinguishing primary and secondary hypotheses, only 42% specified five or fewer (This resulted in some really long PAPs – the median was 11 single-spaced pages, but three were over 90 pages). Among those specifying more than 5 hypotheses, only 28% committed to multiple testing adjustments.
See my pre-analysis plan checklist (plus this addendum) and Chapter 6 of Christensen, Freese and Miguel’s excellent new book Transparent and Reproducible Social Science Research for guidance on writing clearer plans that address these issues. Something that I did not discuss in my checklist was making very clear what are the primary and secondary hypotheses, and trying very hard to restrict the number of primary hypotheses – think hard about if you only had one table or figure with which to show the results of your study, what (ex ante) would you want this table or figure to report? A final point on choosing control variables is that some researchers are now starting to say they will use machine learning methods like post-double selection lasso to choose the controls. If you do this, you still need to specify which control variables you will select among, and how you will implement this method (e.g. how will the tuning parameter be chosen).
How does reporting match up with what was pre-specified?
For 95 of these PAPs, there is now a working paper or published paper available. The authors can therefore compare what is reported after the study with what was pre-specified.
· It is common for some pre-specified hypotheses not to be reported, and for authors who test hypotheses that were not pre-specified to not indicate this: the median paper did not report 25% of the hypotheses specified in the PAP, and 82% of the papers reporting a hypothesis that was not pre-specified failed to mention this.
· One reason for the above is the concern that fidelity to the PAP and marking of results as either pre-registered or exploratory makes for boring papers. The authors conducted a survey of users of PAPs and note some responses: ““Editors want a good story,” one PAP user lamented, “and the PAP nearly never delivers a good read—it only delivers a boring, mechanical read with no surprises or new insights.” Another researcher suggested that “papers without a strong coherent narrative are customarily rejected by journals, and a PAP nearly never produces a strong narrative.””
· Reviewers and editors don’t always pay attention to the PAP, and definitely don’t seem bound by it. They note from their survey responses that some users report that they feel PAPs offering the worst of both worlds -they feel constrained from investigating interesting threads that emerge in their analysis, while still leaving them open to demands from reviewers and editors for endless robustness tests. They note the need for norms to further develop as to how the PAPs are used by the research community.
These findings correspond with my own experiences. A few thoughts on this. First, the authors don’t discuss whether the working paper and published papers differ in how they line up against the PAP. Typically, my initial versions of the working paper look much more like the PAP than the final published versions do. Partly this comes from pre-specifying a number of intermediate mechanisms that end up not operating – and so these get first moved to appendices, and then may eventually get dropped from the final paper as the appendix gets bloated with more and more other robustness checks requested. Second, this request for lots of extra checks is where we may see a big advantage from registered reports versus PAPs. As a reviewer/editor of a paper that has a PAP, you can easily claim that is it not your fault if the author didn’t think to look at X, which is obviously very important, and so feel justified in asking them to do it. In contrast, with results-free review, the reviewer and editor agree with the author ex ante on what is to be examined and reported, which should hopefully lead to far fewer robustness checks. However, I agree with the sentiment that complete adherence to the PAP can be boring – I want the authors to bring in descriptive data, qualitative information, new thoughts, and their own exploratory hypotheses to help explain the results they got - just so long as they indicate that this is exploratory and post-hoc.
So are PAPs making research more transparent and credible?
The authors note the difficulty of answering this question, since we don’t have good counterfactuals of what the same research would have reported without the PAPs. While the above discussion notes that many PAPs fall short of the ideal, most are taking active steps to reduce the scope for fishing and post-hoc hypothesizing. And here’s a research idea for someone to play with to help provide more evidence on whether PAPs make research more credible:
1. Take an experiment which you have conducted but not yet analyzed, or a policy change that will occur (like Neumark’s original pre-specified work on minimum wage changes), and get a group of independent researchers who you will ask to analyze this data for you (e.g. an organization could independently contract a number of researchers as consultants, or a professor could get students in an online class (where students don’t all talk to one another). You want them to face as close to the same incentives as regular researchers as possible for generating interesting results.
2. Randomly allocate these researchers into two or three groups:
a. No PAP group: you tell them you want them to analyze the impact of the minimum wage change on employment outcomes, or to see what the impact of your cash transfer program was. Then task them to report their results to you.
b. External PAP group: you give this group a PAP you have written, and ask them to conduct the analysis needed.
c. Two-stage group (this would be nice to also have if you have enough sample): you first ask them to send you a PAP of what they will do, and then send them the data once you have their PAP.
3. Compare the results of the three groups using your favorite tools for detecting p-hacking, fishing, HARKing, etc (see Christensen, Freese and Miguel’s book for some of these).
Clearly the above 3 steps themselves ironically need a lot more refinement before my pre-analysis plan of what a study of the value of pre-analysis plans is itself credible, but maybe it is a seed of an idea for someone.