Pre-analysis plans increase the chances that published results are true by restricting researchers’ ability to data-mine. Unfortunately, writing a pre-analysis plan isn’t easy, nor is it without costs, as discussed in recent work by Olken and Coffman and Niederle. Two recent working papers - “Split-Sample Strategies for Avoiding False Discoveries,” by Michael L. Anderson and Jeremy Magruder (ungated here) and “Using Split Samples to Improve Inference on Causal Effects,” by Marcel Fafchamps and Julien Labonne (ungated and updated here) - propose some very clever refinements to address some of the challenges inherent in pre-analysis plans.

Two of the big problems are that (a) it is hard to formulate the best way to test a hypothesis without looking at an associated dataset, and (b) even if one knew the best way to test a hypothesis, most papers perform a series of tests, each associated with the outcome of a previous test. Coffman and Niederle’s discussion of the first problem suggests that if an experiment is inexpensive enough that it can be replicated, then the first round of exploratory work can always be replicated by a second round of confirmatory work. Olken refers to the second problem as “pre-specifying the entire ‘analysis tree,’” the combinatorics of which quickly become intractably onerous without the ability to know some of the patterns of results in advance. The two new papers basically wed these insights, formalize them statistically, and present a solution: split your experimental dataset; use the first piece of the dataset for exploratory work, choosing which hypotheses to test and how to test them; refine a pre-analysis plan using this exploratory work; and, plan in hand, use the second piece to actually perform the tests, in essence performing a replication of the exploratory work.

But, having split the sample, a smaller testing dataset will surely reduce statistical power – the chance that you’ll actually detect an effect if it is there. Or will it? Both papers have a common departure point. Statistical power is the probability of detecting a nonzero effect, conditional on a coefficient truly being nonzero. However, the probability that you – the researcher – successfully detect such an effect also depends on something else: the probability that the coefficient you’ve decided to estimate (true beta, not beta-hat) is actually nonzero. Since that isn’t a sure thing, split-sample methods can increase the odds that you succeed at rejecting a false null hypothesis – and score points for science.

Bear with me, here comes the math. (Choose your own adventure: if you aren’t sufficiently caffeinated for the math right now, just skip down a few paragraphs.)

Fafchamps and Labonne take an approach involving a parameter,

Here’s the basic idea. Consider what would happen if a researcher correctly worked out that the statistical power for a test was 0.99. Very nice. However, what if there was only a 50 percent chance (

Fafchamps and Labonne’s suggestion is to split the sample, and test the intended hypothesis, in all the forms one can think of, in the first half. Power for this search process is lower, since the sample is smaller: 0.86. But that means that if the researcher tries all relevant hypotheses, there is an 86 percent chance of detecting it, conditional on stumbling on the right one. Then, whatever hypothesis the researcher picks in the first round, she tests in the second round, having written a more-informed pre-analysis plan. Power? 0.86 again in the second half. Take the product of those two numbers to find the probability of detecting the right formulation in the first half and then having it pass the test in the second half: 0.74. Voila: 74 percent power instead of 49.5 percent power. Progress!

Anderson and Magruder operationalize Olken’s distinction between “primary” and “secondary” hypotheses: a primary hypothesis is about a “key variable of interest,” while a secondary hypothesis is of lesser (or perhaps conditional) importance. Anderson and Magruder consider two parameters: u_h, the importance associated with a hypothesis; and p_h, the prior associated with whether that hypothesis is actually false. A hypothesis with large values of u_h and p_h is likely to be considered “primary.” If either of these is sufficiently small, however, it costs more in power than it yields in expectation to include the hypothesis in a pre-analysis plan.

Here’s the basic idea. Consider that there is a main hypothesis tested with power 0.8 in the full dataset. Consider a second hypothesis, with the same statistical power on its own, but for which an accurate prior is that there is only a 10 percent chance that the null is false – that the underlying coefficient is actually nonzero. If the researcher tests both hypotheses, she should adjust for multiple testing. The Bonferroni correction to the p-value means that there is now only power 0.71 on each hypothesis. But since the second null hypothesis had only a 10 percent chance of being false, this means we have sacrificed 9 percent statistical power on the first hypothesis (0.80-0.71) while only gaining a 7 percent chance of an additional hypothesis being rejected (0.71*0.10). If the researcher’s objective function is expected total hypotheses rejected, this is a bad deal (0.78 instead of 0.80). A loss for science.

Anderson and Magruder’s suggestion is to split the sample, but to then do something they call “hybrid.” Leave the main hypothesis alone: it will be tested in the full sample, regardless. It can be in the pre-analysis plan from the very beginning. But use a little bit of the data, perhaps 30 percent, to try out the second hypothesis. That’s a small sample, so be lenient: look for an absolute T statistic of 1.2, for example. Conditional on the 10 percent chance that there is an effect to detect, there is a 63 percent chance of detecting the second effect in this 30-percent sample. (Of course, conditional on the 90 percent chance that there is really nothing to detect, there is also a good chance of a false positive: 23 percent, under the null.) Now, if the secondary hypothesis doesn’t pass the threshold, the researcher just gets to do the one main test; this happens 0.1*(1-0.63) + 0.9*(1-0.23) = 72.9 percent of the time. So the nice feature of the hybrid approach is that, much of the time, the main hypothesis doesn’t need a multiple test correction. Its power ends up being 0.729*0.8 + (1-0.729)*0.71 = 77.6 percent.

When the secondary hypothesis does pass the threshold, Anderson and Magruder have another suggestion: just do a one-sided test for it. After all, it is wildly unlikely that, if a real effect is at work, it would turn up with the right sign in the one sample split but with the opposite sign in the other. So: test for only the sign that appeared in the first split of the data. (This is a clever way to use a little bit of information from the first split of the data to increase the power of your test in the second split.) With 70 percent of the data remaining, a one-sided test with Bonferroni adjustment (since it is the second hypothesis) has power 0.65. How many hypotheses will be rejected in expectation? 0.776 + 0.1*0.63*0.65 = 0.817. If the researcher’s objective function is total hypotheses rejected, this is a better deal (0.817 instead of 0.800). Progress!

The math is over. Now to wrap up.

There were three pretty innovative tricks in these papers. The first is splitting the sample. Though Anderson and Magruder point out that splitting the sample has been used for various purposes in statistics for more than 80 years, this application is a new one. Split-sample approaches help a pre-analysis plan when,

My discussion vastly oversimplifies both papers. I used the Bonferroni correction, but both papers consider a variety of multiple-testing adjustments, including those that, like Bonferroni, control the family-wise error rate (FWER: the probability of getting at least one false rejection), as well as those that control the false discovery rate (FDR: the fraction of rejections that are incorrect). The methods work, whichever approach you take.

The Fafchamps and Labonne paper goes on to discuss how this approach might reorganize other aspects of the research process: data management might be divided between the portion of a research team that controls and anonymizes the whole dataset and a separate group that formulates and tests hypotheses in the split-sample while writing the pre-analysis plan; journals might accept papers based only on the pre-analysis plan and the analysis in the first half of the dataset, without knowing what remains significant in the second half.

The Anderson and Magruder paper goes on to show how their approach could have changed the conclusions of the Casey, Glennerster, and Miguel paper that brought pre-analysis plans to prominence in the context of field experiments in development economics. Anderson and Magruder’s finding serves as a warning: a pre-analysis plan does bind researchers’ hands against data mining and p-hacking, but may also bind them against some important discoveries.

A caveat.

There is a looming problem, hinted at by both papers. Lunch (or, in this case, a pre-analysis plan with lots of hypotheses) still isn’t free. Anderson and Magruder report two statistics: among recently-published field experiments, the median T-statistic is 2.6; among recently-filed pre-analysis plans, the median number of tests is 128. The contradiction here is that if your expected T-statistic is 2.6, your unadjusted power is 74 percent. If you adjust the FWER for 128 tests, your power is down to 17 percent. How do we reconcile this? Perhaps field data collection will have to be on a larger scale than before, or only some coefficients require multiple test corrections. Fafchamps and Labonne’s proposed division of labor also appears to necessitate a larger research team than has previously been typical. This trend may place some types of research out of reach for graduate students, or for researchers who are “only” able to secure a few hundred thousand dollars in research funding. No matter how you slice the data, multiple test correction and pre-analysis plans combine to drive the required sample sizes up considerably. If these requirements are disproportionately applied to field experiments, they may be raising the bar in precisely the wrong places: “specification searching and publication biases are quite small in randomized controlled trials,” as Vivalt (2016) and the amazingly-titled Brodeur, et al. (2016) (ungated here) conclude.

All is not lost. With the rise of “big data” comes massive sample size, and thus the required statistical power. If they arrive sequentially, early waves of “big data” can act as the first split that helps write the pre-analysis plan for later waves. (This only helps, of course, if “big data” somehow obviates the need for the kind of bespoke data collection that is common in current field experiments.) Finally, if you are still having a hard time writing your pre-analysis plan, or you worry that your pre-analysis plans won’t pan out, just do as Anderson, Magruder, Fafchamps, and Labonne have done: write papers

PS – here is a short piece of Stata code that produces all the calculations above.

Two of the big problems are that (a) it is hard to formulate the best way to test a hypothesis without looking at an associated dataset, and (b) even if one knew the best way to test a hypothesis, most papers perform a series of tests, each associated with the outcome of a previous test. Coffman and Niederle’s discussion of the first problem suggests that if an experiment is inexpensive enough that it can be replicated, then the first round of exploratory work can always be replicated by a second round of confirmatory work. Olken refers to the second problem as “pre-specifying the entire ‘analysis tree,’” the combinatorics of which quickly become intractably onerous without the ability to know some of the patterns of results in advance. The two new papers basically wed these insights, formalize them statistically, and present a solution: split your experimental dataset; use the first piece of the dataset for exploratory work, choosing which hypotheses to test and how to test them; refine a pre-analysis plan using this exploratory work; and, plan in hand, use the second piece to actually perform the tests, in essence performing a replication of the exploratory work.

But, having split the sample, a smaller testing dataset will surely reduce statistical power – the chance that you’ll actually detect an effect if it is there. Or will it? Both papers have a common departure point. Statistical power is the probability of detecting a nonzero effect, conditional on a coefficient truly being nonzero. However, the probability that you – the researcher – successfully detect such an effect also depends on something else: the probability that the coefficient you’ve decided to estimate (true beta, not beta-hat) is actually nonzero. Since that isn’t a sure thing, split-sample methods can increase the odds that you succeed at rejecting a false null hypothesis – and score points for science.

Bear with me, here comes the math. (Choose your own adventure: if you aren’t sufficiently caffeinated for the math right now, just skip down a few paragraphs.)

**Fafchamps and Labonne**Fafchamps and Labonne take an approach involving a parameter,

*psi*- the likelihood that, when writing a pre-analysis plan uninformed by actual experimental data, a researcher tests a hypothesis for which the null is indeed not true (i.e. where there is truly a non-zero coefficient).Here’s the basic idea. Consider what would happen if a researcher correctly worked out that the statistical power for a test was 0.99. Very nice. However, what if there was only a 50 percent chance (

*psi*) that the test was an interesting one to perform? The other half the time, the researcher tests a hypothesis for which the null is true and there is no effect to find. That means the probability of detecting a true effect is only 0.5*0.99 = 0.495. A great dataset with slim odds of a discovery: a loss for science.Fafchamps and Labonne’s suggestion is to split the sample, and test the intended hypothesis, in all the forms one can think of, in the first half. Power for this search process is lower, since the sample is smaller: 0.86. But that means that if the researcher tries all relevant hypotheses, there is an 86 percent chance of detecting it, conditional on stumbling on the right one. Then, whatever hypothesis the researcher picks in the first round, she tests in the second round, having written a more-informed pre-analysis plan. Power? 0.86 again in the second half. Take the product of those two numbers to find the probability of detecting the right formulation in the first half and then having it pass the test in the second half: 0.74. Voila: 74 percent power instead of 49.5 percent power. Progress!

**Anderson and Magruder**Anderson and Magruder operationalize Olken’s distinction between “primary” and “secondary” hypotheses: a primary hypothesis is about a “key variable of interest,” while a secondary hypothesis is of lesser (or perhaps conditional) importance. Anderson and Magruder consider two parameters: u_h, the importance associated with a hypothesis; and p_h, the prior associated with whether that hypothesis is actually false. A hypothesis with large values of u_h and p_h is likely to be considered “primary.” If either of these is sufficiently small, however, it costs more in power than it yields in expectation to include the hypothesis in a pre-analysis plan.

Here’s the basic idea. Consider that there is a main hypothesis tested with power 0.8 in the full dataset. Consider a second hypothesis, with the same statistical power on its own, but for which an accurate prior is that there is only a 10 percent chance that the null is false – that the underlying coefficient is actually nonzero. If the researcher tests both hypotheses, she should adjust for multiple testing. The Bonferroni correction to the p-value means that there is now only power 0.71 on each hypothesis. But since the second null hypothesis had only a 10 percent chance of being false, this means we have sacrificed 9 percent statistical power on the first hypothesis (0.80-0.71) while only gaining a 7 percent chance of an additional hypothesis being rejected (0.71*0.10). If the researcher’s objective function is expected total hypotheses rejected, this is a bad deal (0.78 instead of 0.80). A loss for science.

Anderson and Magruder’s suggestion is to split the sample, but to then do something they call “hybrid.” Leave the main hypothesis alone: it will be tested in the full sample, regardless. It can be in the pre-analysis plan from the very beginning. But use a little bit of the data, perhaps 30 percent, to try out the second hypothesis. That’s a small sample, so be lenient: look for an absolute T statistic of 1.2, for example. Conditional on the 10 percent chance that there is an effect to detect, there is a 63 percent chance of detecting the second effect in this 30-percent sample. (Of course, conditional on the 90 percent chance that there is really nothing to detect, there is also a good chance of a false positive: 23 percent, under the null.) Now, if the secondary hypothesis doesn’t pass the threshold, the researcher just gets to do the one main test; this happens 0.1*(1-0.63) + 0.9*(1-0.23) = 72.9 percent of the time. So the nice feature of the hybrid approach is that, much of the time, the main hypothesis doesn’t need a multiple test correction. Its power ends up being 0.729*0.8 + (1-0.729)*0.71 = 77.6 percent.

When the secondary hypothesis does pass the threshold, Anderson and Magruder have another suggestion: just do a one-sided test for it. After all, it is wildly unlikely that, if a real effect is at work, it would turn up with the right sign in the one sample split but with the opposite sign in the other. So: test for only the sign that appeared in the first split of the data. (This is a clever way to use a little bit of information from the first split of the data to increase the power of your test in the second split.) With 70 percent of the data remaining, a one-sided test with Bonferroni adjustment (since it is the second hypothesis) has power 0.65. How many hypotheses will be rejected in expectation? 0.776 + 0.1*0.63*0.65 = 0.817. If the researcher’s objective function is total hypotheses rejected, this is a better deal (0.817 instead of 0.800). Progress!

The math is over. Now to wrap up.

There were three pretty innovative tricks in these papers. The first is splitting the sample. Though Anderson and Magruder point out that splitting the sample has been used for various purposes in statistics for more than 80 years, this application is a new one. Split-sample approaches help a pre-analysis plan when,

*ex ante*, you can’t precisely characterize the hypotheses you would like to test, or the exact weights you attach to the importance of testing them. They provide more power than guessing the hypotheses, but less power than if you had been sure of the hypotheses from the get-go. The two other tricks? Using a hybrid pre-analysis plan approach; and the one-sided test in the second split. This last trick—the one-sided test in the second slice of the data using the sign from the first slice of the data—improves statistical power, and is one of the very few situations I can think of in which a one-sided test in a pre-analysis plan both legitimately preserves test size and doesn’t risk missing unanticipated negative results – after all, the impacts of new interventions may surprise you! (Examples come to mind in education , cash transfers, and public works programs, to name a few.)My discussion vastly oversimplifies both papers. I used the Bonferroni correction, but both papers consider a variety of multiple-testing adjustments, including those that, like Bonferroni, control the family-wise error rate (FWER: the probability of getting at least one false rejection), as well as those that control the false discovery rate (FDR: the fraction of rejections that are incorrect). The methods work, whichever approach you take.

The Fafchamps and Labonne paper goes on to discuss how this approach might reorganize other aspects of the research process: data management might be divided between the portion of a research team that controls and anonymizes the whole dataset and a separate group that formulates and tests hypotheses in the split-sample while writing the pre-analysis plan; journals might accept papers based only on the pre-analysis plan and the analysis in the first half of the dataset, without knowing what remains significant in the second half.

The Anderson and Magruder paper goes on to show how their approach could have changed the conclusions of the Casey, Glennerster, and Miguel paper that brought pre-analysis plans to prominence in the context of field experiments in development economics. Anderson and Magruder’s finding serves as a warning: a pre-analysis plan does bind researchers’ hands against data mining and p-hacking, but may also bind them against some important discoveries.

A caveat.

There is a looming problem, hinted at by both papers. Lunch (or, in this case, a pre-analysis plan with lots of hypotheses) still isn’t free. Anderson and Magruder report two statistics: among recently-published field experiments, the median T-statistic is 2.6; among recently-filed pre-analysis plans, the median number of tests is 128. The contradiction here is that if your expected T-statistic is 2.6, your unadjusted power is 74 percent. If you adjust the FWER for 128 tests, your power is down to 17 percent. How do we reconcile this? Perhaps field data collection will have to be on a larger scale than before, or only some coefficients require multiple test corrections. Fafchamps and Labonne’s proposed division of labor also appears to necessitate a larger research team than has previously been typical. This trend may place some types of research out of reach for graduate students, or for researchers who are “only” able to secure a few hundred thousand dollars in research funding. No matter how you slice the data, multiple test correction and pre-analysis plans combine to drive the required sample sizes up considerably. If these requirements are disproportionately applied to field experiments, they may be raising the bar in precisely the wrong places: “specification searching and publication biases are quite small in randomized controlled trials,” as Vivalt (2016) and the amazingly-titled Brodeur, et al. (2016) (ungated here) conclude.

All is not lost. With the rise of “big data” comes massive sample size, and thus the required statistical power. If they arrive sequentially, early waves of “big data” can act as the first split that helps write the pre-analysis plan for later waves. (This only helps, of course, if “big data” somehow obviates the need for the kind of bespoke data collection that is common in current field experiments.) Finally, if you are still having a hard time writing your pre-analysis plan, or you worry that your pre-analysis plans won’t pan out, just do as Anderson, Magruder, Fafchamps, and Labonne have done: write papers

*about*writing pre-analysis plans instead.PS – here is a short piece of Stata code that produces all the calculations above.

## Comments

## In regards to your final

In regards to your final thought "Finally, if you are still having a hard time writing your pre-analysis plan, or you worry that your pre-analysis plans won’t pan out..", Registered Reports are a way to address both points (https://cos.io/rr). In this publishing format, peer review occurs once the pre-analysis plan is written, and the decision to publish is made before results are known. That way, the focus of that peer review process is on the theoretical interest of the research questions and of the ability of the proposed methods to address those questions. About 62 journals currently accept RRs. If your preferred journal does not, then go ahead and ask them to (https://osf.io/3wct2/). Finally, you can submit your pre-analysis plan (even if it is not part of one of those Registered Reports) to the Preregistration Challenge (https://cos.io/prereg) for a chance at a $1000 prize for publishing the results of preregistered work.

## Dear David,

Thanks for that comment! In fact, I had thought of linking to Registered Reports, but cut it in the interest of brevity. I'm glad you have linked it here.

Since you are here, could you also comment on two things?

1) How does your initiative grapple with the asymmetry that, for some research questions, a significant finding is of greater importance than an insignificant finding (even a precise zero)? Journals with limited space (and busy readers with limited attention) might need to condition on the actual findings, not just the analytical plan? In other words, how many headlines begin with the phrase, “researchers FAIL TO discover new…” ? It seems to me that the answer could lie in whether a journal’s commitment, based on the pre-analysis plan, precludes submission to other outlets. (In economics, the norm is that a manuscript cannot simultaneously be under consideration at multiple journals.) If, for these journals, there was a multilateral agreement that this was allowed, you could get a pre-commitment that this journal will publish your study based on the plan, and then you are allowed to submit to top-tier outlets if you think you have a shot at one – i.e. if the finding is exciting, overturning priors, etc. If it doesn’t pan out, this journal will publish whatever short report you want to submit with minimal editorial cost. Is this the direction to be headed?

2) How do we think about imprecise zeros? A lot of the file drawer problem may be that the zero is imprecise: the survey instrument ended up very noisy, sample size ended up smaller than anticipated, attrition is high, compliance is low, the partner organization didn’t implement well, etc. A journal editor might think that either a precise zero or a rejected null is interesting, but a poorly-implemented program or a data collection disaster isn’t. (And isn’t that informative for meta-analyses either.)

Best,

Owen

## The author and I have been

The author and I have been discussing his post via email, and I wanted to include one statement about a common question in regards to Registered Reports: not all null results are of interest to the wider research community, how will RRs not just be a dumping ground of uninteresting research? (not a direct quote)

The answer to that question is that the editorial and peer review of Registered Reports can address that question before results are known. So, that review can assess if the proposed research question addresses a question that is of sufficient importance to the research community, such that a null result would be informative.

Not all research questions meet this test. "I tried this thing that was not likely to have an effect, and lo and behold, nothing happened..." would not be a question that would likely be accepted before results are known.

For many other research questions, however, true nulls are both important and difficult to disseminate (because nulls are easy to pick apart post-hoc). A question such as "Intervention X worked here, but others have had difficulty with it, how can we settle this dispute?" is a question whose results are informative regardless of outcome. By specifying the conditions ahead of knowing the results, strong biases are removed from the process.

Obviously, work that is done poorly should not be published. Initial review will almost always look for various quality checks (manipulation checks, positive controls, etc), and those can be used to reject a paper after seeing their results, but only based on criteria that were established before the study was conducted.

See more, especially the FAQ, on this page: https://cos.io/rr

## Add new comment