Thomas Herndon, Michael Ash, and Robert Pollin (HAP) in their now famous replication study of Reinhart and Rogoff’s (R&R) seminal article on public debt and economic growth use the word “error” 45 times. The study sparked a tense debate, summarized by the Financial Times (FT) between HAP and R&R about which differences in HAP’s analysis really point to errors in RR’s original work. At 3ie, we are more than a year into our replication programme, and we are seeing a similar propensity for replication researchers to use the word “error” (or “mistake” or “wrong”) and for this language to cause contentious discussions between the original authors and replication researchers. The lesson we are learning is:
To err is human, but to use the word “error” in a replication study is usually not divine.
Some would ask, isn’t that the point of internal replication? Yes. As we argue in our forthcoming paper, one of the four reasons why internal replication is important for validating evidence is because “to err is human”. Original authors do occasionally make mistakes and correcting them is major benefit of replication.
So what’s the problem? The problem is that pure replication of an original author’s empirical analysis is often really complicated, not to mention time consuming. And what we’re seeing is that even relatively successful pure replications end up with many estimates that are just not quite the same as in the original article. Replication researchers are often quick to call these “errors”. But if two people conduct the same analysis on the same data, and they each get similar but not identical estimates, who is to say what is right and what is wrong?
Not surprisingly, the word “error” makes original authors defensive and leads to debate. But two sides arguing about a small difference in a point estimate does not help us achieve the objective of finding the best evidence for policy making and program design. To suggest that a small difference that happens to be around an arbitrary cut-off should change policy conclusions is to fall prey to the “cult of statistical significance”. Whether in the original paper or in the replication study, we should focus instead on what is relevant and robust. As Pollin concedes in the FT interview, the real question is whether a conclusion is robust.
So when is an error truly an error? We submit that the word “error” only be used in replication studies when the replication researcher can identify the source of the mistake. The HAP replication study does point to some clear errors. For example, the original authors missed five rows of data in their estimations using their excel file. That was an error that was acknowledged by the original authors here and here.
When there are discrepancies in the estimates that cannot be explained, we recommend that replication researchers use the words discrepancy or inconsistency. We are not suggesting that discrepancies are not important. They are. A large number of discrepancies in the pure replication that cannot be explained by the original authors or by the replication researchers may call into question how well the underlying datasets are coded, labeled, documented, and stored. And that should call into question the quality of the analysis that can be conducted with those data. One objective of the 3ie replication programme is to motivate authors to document and maintain their data more carefully. But unexplained discrepancies are not necessarily errors.
An error is also not an error if it results from a different decision made in the measurement or estimation analyses. Many researchers hold strong beliefs about which methods are appropriate and how they should be used. Sometimes what is right is pretty cut and dried. You need to use clustered standard errors when you have a cluster design. But often those choices are more discretionary. Jed Friedman’s blog post on linear probability models (LPM) versus probits and logits describes his debate with a referee about whether it is “wrong” to use LPM in the case of binary responses. Friedman quotes Jörn-Steffen Pischke on the matter: “the fact that we have a probit, a logit, and the LPM is just a statement to the fact that we don’t know what the ‘right’ model is.”
Certainly a replication researcher should critically examine the methodological choices made by the original authors. The existence of multiple possible models should motivate a careful discussion of the underlying assumptions as well as provide an opportunity to test the original paper’s result for robustness to model choice. Arguments about measurement and estimation are particularly important when the main conclusions from the study hinge on those choices. In the Financial Times interview, Pollin makes the more relevant critique of R&R that “their results are entirely dependent on using that particular methodology.” This statement, and not the 45 uses of the word error, is a more divine approach to replication research.
To err is human, but to use the word “error” in a replication study is usually not divine.
Some would ask, isn’t that the point of internal replication? Yes. As we argue in our forthcoming paper, one of the four reasons why internal replication is important for validating evidence is because “to err is human”. Original authors do occasionally make mistakes and correcting them is major benefit of replication.
So what’s the problem? The problem is that pure replication of an original author’s empirical analysis is often really complicated, not to mention time consuming. And what we’re seeing is that even relatively successful pure replications end up with many estimates that are just not quite the same as in the original article. Replication researchers are often quick to call these “errors”. But if two people conduct the same analysis on the same data, and they each get similar but not identical estimates, who is to say what is right and what is wrong?
Not surprisingly, the word “error” makes original authors defensive and leads to debate. But two sides arguing about a small difference in a point estimate does not help us achieve the objective of finding the best evidence for policy making and program design. To suggest that a small difference that happens to be around an arbitrary cut-off should change policy conclusions is to fall prey to the “cult of statistical significance”. Whether in the original paper or in the replication study, we should focus instead on what is relevant and robust. As Pollin concedes in the FT interview, the real question is whether a conclusion is robust.
So when is an error truly an error? We submit that the word “error” only be used in replication studies when the replication researcher can identify the source of the mistake. The HAP replication study does point to some clear errors. For example, the original authors missed five rows of data in their estimations using their excel file. That was an error that was acknowledged by the original authors here and here.
When there are discrepancies in the estimates that cannot be explained, we recommend that replication researchers use the words discrepancy or inconsistency. We are not suggesting that discrepancies are not important. They are. A large number of discrepancies in the pure replication that cannot be explained by the original authors or by the replication researchers may call into question how well the underlying datasets are coded, labeled, documented, and stored. And that should call into question the quality of the analysis that can be conducted with those data. One objective of the 3ie replication programme is to motivate authors to document and maintain their data more carefully. But unexplained discrepancies are not necessarily errors.
An error is also not an error if it results from a different decision made in the measurement or estimation analyses. Many researchers hold strong beliefs about which methods are appropriate and how they should be used. Sometimes what is right is pretty cut and dried. You need to use clustered standard errors when you have a cluster design. But often those choices are more discretionary. Jed Friedman’s blog post on linear probability models (LPM) versus probits and logits describes his debate with a referee about whether it is “wrong” to use LPM in the case of binary responses. Friedman quotes Jörn-Steffen Pischke on the matter: “the fact that we have a probit, a logit, and the LPM is just a statement to the fact that we don’t know what the ‘right’ model is.”
Certainly a replication researcher should critically examine the methodological choices made by the original authors. The existence of multiple possible models should motivate a careful discussion of the underlying assumptions as well as provide an opportunity to test the original paper’s result for robustness to model choice. Arguments about measurement and estimation are particularly important when the main conclusions from the study hinge on those choices. In the Financial Times interview, Pollin makes the more relevant critique of R&R that “their results are entirely dependent on using that particular methodology.” This statement, and not the 45 uses of the word error, is a more divine approach to replication research.
Join the Conversation