Syndicate content

How scientific are scientific replications? A response by Annette N. Brown and Benjamin D.K. Wood

A few months ago, Berk Ozler wrote an impressive blog post about 3ie’s replication program that posed the question “how scientific are scientific replications?” As the folks at 3ie who oversee the replication program, we want to take the opportunity to answer that question. Our simple answer is, they are not meant to be.

Before folks quote us on that out of context, let us explain. Of course we expect the replication researchers we fund, as well as any who independently submit papers to our series, to practice good science—to understand relevant theories, handle data carefully, apply methods appropriately, and so on. But the purpose of the 3ie replication program is not a grand exercise in proving that any one study is “right” or “wrong”. In fact we discourage replication researchers from pronouncing that a replication study has “succeeded” or “failed”. Here’s an example why.
Brian Nosek showed a fascinating slide (from his forthcoming work) at the recent BITSS session as the ASSA meetings. It had a forest plot of the effect sizes and confidence intervals from 29 studies of racial discrimination in soccer. The branches of the plot were all different and included a broad range of effect sizes and confidence intervals, with effect sizes ranging from zero to highly positive and only about two-thirds depicting statistically significant effects. As our brains were humming away conducting mental meta-regression to decide whether the average effect is statistically significant, Nosek shocked us by explaining that all 29 analyses were conducted asking the same question with the same data set.

That example provides a rather sobering justification for internal replication but it also, we think, demonstrates that an exercise in determining which one of those analyses is “right” would not be productive. What we would hope is that folks who are interested in understanding whether there is discrimination in soccer would look at the assumptions made, how concepts are measured, what specifications are estimated, and what theories are implied or proposed in order to decide which of the analyses provide credible and relevant information for their purposes.

The purpose of internal replication from the standpoint of the 3ie program, as explained in our article on replication here, is to validate impact evaluation findings for the purpose of policy making and program design. So we wholeheartedly welcome an exploration like Berk’s blog post that dissects the analysis presented in both studies. We do not see the Iverson and Palmer-Jones replication study as proving or disproving Jensen and Oster’s original study, but rather shedding light on some of the analytical decisions that Jensen and Oster made that are important for understanding policy implications, with Berk’s post shedding even more light on those.

Ironically, we have been striving to eliminate the use of the term “scientific replication” by our replication researchers. In the early days of the program, we issued a typology document that used that term, but when we went back to the famous Hamermesh article where he defines “statistical replication” and “scientific replication”, we realized that we were using the term wrong. In our article on replication, we set out a new typology for replication analysis that includes pure replication, measurement and estimation analysis, and theory of change analysis. Not surprisingly, the replication researchers so far have wanted to use the term “scientific replication” instead. Who doesn’t want to be seen as doing something scientific? Berk has given us even more reason to avoid this term.
Berk usefully suggests some takeaway points for the 3ie replication program, and we would like to comment briefly on some of those. As Berk alludes to in his blog post, the Iverson and Palmer-Jones replication study was funded as a pilot before we had the 3ie replication program policies and procedures written or enacted. Much of the debate that erupted during the conduct of that pilot replication study (with several players in addition to the replication researchers and original authors getting involved) indeed informed the policies and procedures of the program. With the exception of that replication study and one other pilot, all 3ie-funded replication studies have published replication plans. These are not exactly the same as pre-analysis plans, but that is a subject for another blog post.

Berk suggests that the review process for replication studies should mimic that for the original papers. One important clarification here is that our paper series is a working paper series, not a journal. Nonetheless, we do have a fairly involved review process, which we describe in detail in this recent blog post. The process includes both known and anonymous external referees as well as internal reviewers and a requirement for the pure replication component to be shared with the original authors in advance of any public release of the replication study. We feel quite strongly that there are incentive compatibility problems with the system Berk suggests of letting the original journal make the call and having the original referees serve as the replication study referees.

We have already mentioned how we feel about using the term “failure” for replication. We agree fully, however, that a strong culture of replication should eventually reduce the incentives both for the original authors to play up statistical significance and for replication researchers to find specifications without statistical significance.

We look forward to Berk’s next blog post and further discussion.

Add new comment