Published on Development Impact

The infinite loop failure of replication in economics

This page in:
In case you missed it, there was quite a brouhaha about worms and the replication of one particular set of results this summer (see Dave's anthology here).   I am definitely not going to wade into that debate, but there is a recent paper by Andrew Chang and Phillip Li which gives us one take on the larger issue involved:  the replication of published results.   Their conclusion is nicely captured in the title: "Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say ”Usually Not.”" 
Now, this is a borderline "usually not" (from a 51% failure rate), so it's worth unpacking Chang and Li's numbers.   They start with a sample of papers from top macroeconomic and general interest journals published between 2008-13.  They use only papers that are empirical, estimate a model with US data, and have results that involve the use of GDP.      This gives them a sample of 67 papers.   Right away, 6 are not going to be replicable since they use proprietary data.  So let's use 61 as the denominator going forward. 
Now journals come in 2 main flavors:  those that require you to submit your code and data, and those that don't.  Chang and Li have 35 papers from those journals which require submission of code and data.   First, compliance is not complete -- only 80% of those papers have both the code and data in the journal archives.   They then email the authors of the papers with missing data and/or code and are only able to get what they need for one additional paper.  Over in the realm of journals with no requirements, the hit rate is obviously lower -- they net data and code for 15 of 26 papers.  
And then they try to run the code.  What they're looking for is results which are substantively, if not exactly, the same.   They succeed in getting this to happen for 23 of 35 papers in the mandatory data/code journals, and 6 of 26 in the non-data/code journals.   Chang and Li artfully observe that there may be some self selection in journals chosen for submission at play here, so this difference shouldn't at all imply a causal relationship.
Why do they fail so often? Well, to start with, despite the formidable array of software packages available through the US government, Chang and Li lose at least 2 studies because they don't have the right package.   And that may be a lower bound, since 9 additional papers fail because of incorrect data or code.  But the big reason for failure -- for 21 papers or 51% -- is because of straight up missing data or code.   These are cases where the provided code doesn't cover one or more of the key results and/or at least one variable is missing.   
However you slice this, the results aren't great.  So Chang and Li have a set of recommendations.   The first is obvious:  all journals should require data and code for publication (and should enforce this).   There may be cases where an exemption is warranted, and this should be noted clearly to help would be replicators.  Authors can also do a couple of things to make things easier to replicate.   First, they can specify the software and version in the readme file.   Chang and Li also suggest, for those more complicated routines, that authors specify the run time.   Their pain here is clear: "we encountered a few instances where we believed an estimation was executing, only to find out weeks later that the programs were stuck in an infinite loop and were supposed to run in much less time."    In addition, for results that are going to use a number of different programs, Chang and Li recommend that authors specify in what order they need to be run.   Finally, they note that a significant number of the programs they ran didn't estimate the models that produce the results as in the paper's figures and tables -- this too, would be helpful.  
So these numbers give us something to think about.   While Chang and Li are macroeconomists and thus focusing on the macro literature, they do include journals that microeconomists would be happy to publish in.   And this makes me wonder what the analogous statistics would look like for a set of micro or even micro development papers.    Speculation won't get us far, but Chang and Li do note (in footnote 2) that a similar recent effort in psychology got results similar to theirs.  


Markus Goldstein

Lead Economist, Africa Gender Innovation Lab and Chief Economists Office

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000