Syndicate content

The infinite loop failure of replication in economics

Markus Goldstein's picture
In case you missed it, there was quite a brouhaha about worms and the replication of one particular set of results this summer (see Dave's anthology here).   I am definitely not going to wade into that debate, but there is a recent paper by Andrew Chang and Phillip Li which gives us one take on the larger issue involved:  the replication of published results.   Their conclusion is nicely captured in the title: "Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not.”" 
Now, this is a borderline "usually not" (from a 51% failure rate), so it's worth unpacking Chang and Li's numbers.   They start with a sample of papers from top macroeconomic and general interest journals published between 2008-13.  They use only papers that are empirical, estimate a model with US data, and have results that involve the use of GDP.      This gives them a sample of 67 papers.   Right away, 6 are not going to be replicable since they use proprietary data.  So let's use 61 as the denominator going forward. 
Now journals come in 2 main flavors:  those that require you to submit your code and data, and those that don't.  Chang and Li have 35 papers from those journals which require submission of code and data.   First, compliance is not complete -- only 80% of those papers have both the code and data in the journal archives.   They then email the authors of the papers with missing data and/or code and are only able to get what they need for one additional paper.  Over in the realm of journals with no requirements, the hit rate is obviously lower -- they net data and code for 15 of 26 papers.  
And then they try to run the code.  What they're looking for is results which are substantively, if not exactly, the same.   They succeed in getting this to happen for 23 of 35 papers in the mandatory data/code journals, and 6 of 26 in the non-data/code journals.   Chang and Li artfully observe that there may be some self selection in journals chosen for submission at play here, so this difference shouldn't at all imply a causal relationship.
Why do they fail so often? Well, to start with, despite the formidable array of software packages available through the US government, Chang and Li lose at least 2 studies because they don't have the right package.   And that may be a lower bound, since 9 additional papers fail because of incorrect data or code.  But the big reason for failure -- for 21 papers or 51% -- is because of straight up missing data or code.   These are cases where the provided code doesn't cover one or more of the key results and/or at least one variable is missing.   
However you slice this, the results aren't great.  So Chang and Li have a set of recommendations.   The first is obvious:  all journals should require data and code for publication (and should enforce this).   There may be cases where an exemption is warranted, and this should be noted clearly to help would be replicators.  Authors can also do a couple of things to make things easier to replicate.   First, they can specify the software and version in the readme file.   Chang and Li also suggest, for those more complicated routines, that authors specify the run time.   Their pain here is clear: "we encountered a few instances where we believed an estimation was executing, only to find out weeks later that the programs were stuck in an infinite loop and were supposed to run in much less time."    In addition, for results that are going to use a number of different programs, Chang and Li recommend that authors specify in what order they need to be run.   Finally, they note that a significant number of the programs they ran didn't estimate the models that produce the results as in the paper's figures and tables -- this too, would be helpful.  
So these numbers give us something to think about.   While Chang and Li are macroeconomists and thus focusing on the macro literature, they do include journals that microeconomists would be happy to publish in.   And this makes me wonder what the analogous statistics would look like for a set of micro or even micro development papers.    Speculation won't get us far, but Chang and Li do note (in footnote 2) that a similar recent effort in psychology got results similar to theirs.  


Submitted by Ben Wood on

Thanks for highlighting the importance of replication in sciences Markus. Regarding your request for comparison micro development statistics, 3ie is trying to fill that gap with our "push button replication" (PBR) project.

Our PBRs will attempt to reproduce recently published development-related impact evaluations using data and code provided by the original authors (or publicly available). Similar to the Chang and Li paper, our PBR researchers won't be recoding these papers or examining the robustness of the results to alternative approaches. The PBR project is designed to test the replicability of development impact evaluations, using whatever data, code, and instructions the original authors provide.

We'll be sure to keep you in the loop when we have some results!

Thanks Ben.   Let me know when you have something and it would be good to highlight it here.   One thing to keep in mind - from reading Chang and Li's paper it looks like it took quite some time to get these things done (maybe 2 years...)

Add new comment