The demand for pre-analysis plans that are registered at a public site prior available for all consumers to be able to examine has recently increased in social sciences, leading to the establishment of several social science registries. David recently included a link to Ben Olken’s JEP paper on pre-analysis plans in Economics
. I recently came across a paper by Humphreys, de la Sierra, and van der Windt (HSW hereon) that proposes a comprehensive nonbinding registration of research. The authors end up agreeing on a number of issues with Ben, but still end up favoring a very detailed pre-analysis plan. As they also report on a mock reporting exercise and I am also in the midst of writing a paper that utilized a pre-analysis plan struggling with some of the difficulties identified in this paper, I thought I’d link to it a quickly summarize it before ending the post with a few of my own thoughts.
HSW is not a new paper (gated
) – it’s from 2013, but I just came across it last week. HSW start by stating that even in studies that provide a general pre-analysis plan, which describes the primary and secondary outcomes, there remains a lot of latitude for “fishing” in the analysis stage: choice of covariates, heterogeneity analysis, dichotomizing the outcome variable(s), and choice of statistical model. They demonstrate which ones of these can be particularly important (covariates in studies with small sample size; heterogeneity analysis in all studies) and which not (LPM vs. probit vs. logit). They also make the important distinction between fishing and a multiple comparisons problem: it might be that a researcher specifies (in the presence of true uncertainty about outcome indicators) multiple indicators ex ante and reports all of them in compliance with the pre-analysis plan. This analysis might have a multiple comparisons problem, which can be tackled in a variety of ways, but it is not fishing. But, selecting the statistically significant ones only for reporting is. Conversely, it is also possible to fish while being responsive to multiple hypothesis testing concerns if one is correcting p-values but only reporting the remaining statistically significant outcomes. This is an important distinction, not all of the implications of which are universally agreed upon among researchers.
After demonstrating the scope for fishing even in preregistered studies, HSW move on to a more extreme exercise. Subsequent to writing a pre-analysis plan that was more general than specific for an ongoing field study in DRC, i.e. more in line with existing pre-analysis plans in registries, the authors decided to write a mock report using real outcome data gathered during the first few months of data collection but scrambling the treatment indicator. They then shared this report with implementation partners to agree on a final design and to protect the research from biases (on either the researcher or the implementer side).
When the authors reflect on the mock reporting model, they find a couple of modest and not entirely surprising benefits: (i) locking in the outcomes and the definition of success so that pressure from implementers cannot lead to changes ex post; and (ii) identifying gaps and flaws in survey design. But, they also identify a number of difficulties:
Too many potential findings: The authors had a hard time writing (ex ante) results for all possible scenarios. One issue here relevant for pre-analysis plans is how much you simplify beforehand so that you can also speak to interpretation ahead of time: this is clearly a very unattractive aspect of very strict and binding pre-analysis plans when the subjects we study are complex, context-specific, and not always easily predicted ahead of time. The authors conclude:
The problem revealed here however is a deep one, but one we think is common in political analyses and may stem in part from the complexity of the subjects we study: even though we specify a set of individual hypotheses ex ante we nevertheless often engage in post hoc theorization to make sense of the whole collection of the findings. This problem is quite dramatically exposed in the Congo experiment by the difficulty of writing up an interpretation of a random set of findings. An advantage of the mock report in our case was that it clarified that whereas we could reasonably register how we would interpret results from tests of individual hypotheses our interpretation of overall patterns should be rightly thought of as ex post theorizing.
The loss of the option value of waiting: A preregistration process moves the hard work that normally takes place during the data analysis phase to prior to data collection. This can be unattractive if various types of uncertainty get resolved over time; methods improve (which is not a straw man, where even the topic we’re discussing right now is in complete flux); projects fail; some analysis is contingent on the findings of other analysis, etc. The authors argue that a strict commitment to the pre-analysis plan comes at the cost of giving up these options, but that the solution is simple: use the pre-analysis plan as a communication rather than a commitment device. In other words, do post-analysis hypothesizing (HARKing as I linked to last Friday), but clearly delineate it from the analysis that sticks to the pre-analysis plan.
The difficulty of specifying analyses without access to the real data: As we found when we wrote our pre-analysis plan, it is hard to make certain decisions about construction of indices, pre-specification of baseline covariates, etc. – especially if the commitment has to be hard and binding. HSW’s response to this difficulty is again to make the preregistration process not prevent changes in plans in response to new information – as long as they are clearly identified and rigorously explained and defended.
I agree with all of these difficulties, but worry that the current lack of disagreement in the field about these rules/guidelines means that the researcher faces a huge amount of uncertainty when writing the pre-analysis plan. When the time comes (after years of intense work) to submit the paper for publications, will she draw a referee who is such a hard-liner for pre-analysis plans that any deviation from it – regardless of how transparent – will cause the referee to go into a tizzy and recommend rejection? This uncertainty influences key decisions at every step of the way: whether to write a pre-analysis plan; what type of pre-analysis plan to write (short and more generic or super duper detailed); how much to adhere to the pre-analysis plan when writing the paper; at the limit even the question(s) asked in a research project. And that can be damaging to the very science whose integrity and reliability we’re trying to protect and improve – by making us choose sub-optimal methods or, worse, causing some to opt out of asking certain questions or undertaking some studies altogether. The discipline to avoid fishing comes at a cost. Again the authors:
If there are no researchers who engage in fishing…, then results in the complier group are identical to what would arise under binding registration, but results in the non-complier group would be more reliable. This illustrates that binding registration comes at a cost: it prohibits modifications to the analysis plan that are justified for reasons exogenous to the results. The gains from the reduction in fishing that a binding registration would generate must compensate for the loss in research quality that it induces by imposing a constraint on reasonable modifications. (emphasis added)
HSW’s response to the difficulties encountered during their mock reporting trial is not to give up on the comprehensive ex ante reporting standards, but rather to make the pre-analysis plan nonbinding so that the document becomes a tool for transparency rather than tying the hands of the researchers completely. Hence, they advocate a system in which researchers planning observational or experimental studies to test hypotheses using data not available to them at the outset produce comprehensive analysis and reporting plans (including interpretations of patterns of finding as is humanely possible); but also be allowed to deviate from these stated plans IF they make these deviations clear.
This, it turns out, is kind of similar to our conclusions after weeks-long discussions with my co-authors to the broad outlines of what we decided to do (although I’m still leaning towards shorter pre-analysis plans rather than detailed mock reports). We have a pre-analysis plan registered about 18 months ago
, but for certain decisions we already know that tweaks to the stated plan would have been better. There is also additional analysis that is required, which is conditional on other findings (from pre-specified analysis), and without which the paper is much less informative. So, as the authors also suggest, there will be three types of evidence in our final paper describing the longer-term effects of a cluster-RCT: (i) completely experimental reporting of primary and secondary outcomes that completely adhere to predetermined specifications in the registered pre-analysis plan; (ii) similar preregistered analysis that deviates from the pre-analysis plans (with deviations clearly outlined and reasons argued); and (iii) analysis that was not preregistered (mostly non-experimental intensive margin analysis and/or heterogeneity analysis that delves into the mechanisms behind observed effects). It’s OK for a single article to contain these three components: “The critical thing is the development of signposting so that readers have clarity regarding the status of different types of claims.”
Let me pose a question of my own. One issue that is not entirely resolved in HSW is the issue of multiple comparisons. As discussed above, a pre-analysis plan that specifies multiple outcome indicators and reports all of them cannot be accused of fishing. However, there is still a question of how far one would go with multiple comparisons corrections in interpreting the findings. Suppose that intervention A for children could cause changes in a whole lot of adult outcomes – specifically in two distinct domains: A and B. If the intervention affects a under A, then it might have knock-on affects on x, y, and z. Similarly, if it affects b under B, then it might also alter t, u, and v. It’s pretty much impossible to know ahead of time which of these might happen, but is important to find out if they do. Suppose that the intervention has no effect on a, therefore causes no changes in x, y, and z. You find, however, that it did significantly affect b, but with a knock-on affect only on t and not on u or v. When you report a table with four outcomes under domain A and another table with four outcomes under domain B, do you correct for multiple hypothesis testing using (a) all eight variables; (b) the four variables in each table; (c) not at all; or (d) all of the above and then discuss alternative interpretations under each? I feel that this question remains unresolved and is important in many cases of long-term studies of large interventions where the number of socioeconomic domains that might be affected is large and increases over time. Under the strictest interpretations, most studies would have no statistically significant results if they were to correct for all hypothesis tests in the study – simply because we usually don’t have the sample sizes and hugely effective programs with t-stats<0.0001 that would survive such corrections.
Another interesting idea mentioned in the paper, which I see gaining steam, is the notion of partitioning your data set into parts for training and testing and, perhaps even writing your pre-analysis plan after training on one half of your data. This is quite nice and rigorous, but is, obviously, much more expensive conditional on power. HSW also touch on the implications for the publication process – editors and reviewers potentially reviewing pre-analysis plans without data and even giving provisional acceptances on the basis of such preregistered study designs (I know many people, including myself, still remain skeptical about the feasibility of this idea in the near future but we could be wrong). They conclude by suggesting a period of experimentation with registry structures.
We are in exactly that period whether you wish to be or not…