In their working paper “The Sources of Researcher Variation in Economics,” Huntington-Klein, Portner, McCarthy, and The Many Economists Collaborative on Researcher Variation present results from a study where 146 research teams explore the same data to answer the same question. Do these teams come to same conclusions? No, or yes, sort of or most of them are consistent, if not identical.
This study seeks to understand how the choices researchers face, using the same data and addressing the same study question, affect findings. In this “many-analysts” design, the 146 research teams independently conduct research. These teams are asked to estimate the impact of the U.S. Deferred Action for Childhood Arrivals (DACA) program on the probability that eligible individuals work full-time. Teams used the same underlying data from the American Community Survey. The researchers start with broad instructions (the first stage) to generate estimates of impact. In a second stage, they are given greater specification about the research design. And lastly, they are also provided with a pre-cleaned data set.
Headline findings are that the IQR of policy impacts in the 1st stage was 3.1 percentage points (with a mean 5.4 percentage point effect) but with “substantial” outlier estimates. In the 2nd stage, the IQR was larger (surprising since one would assume the design narrow the range of choices by researchers): 4.0 percentage points. This was in part because some researchers did not follow the specified research design especially related to instructions on identifying the treatment group. And in the 3rd stage, the IQR fell to 2.4 percentage points --using the same cleaned data helped reduce variation in findings. Bottom line conclusion: The reported policy effects were “relatively similar”, despite the variation in data preparation/construction and research methods, albeit with some notable outliers.
Read the paper for the many important details that I am not covering in this blog. Rather, I will share some of my evolving thoughts related to whether this paper is: presenting a case for nihilism (no); uncovering big problems (no); showing surprising findings (a smidge); in some ways (and with much respect to all the researchers involved), somewhat of a strawman (maybe); a useful reminder of some key things that make for ‘better’ research (yes).
Getting to your final estimates is not quick or smooth. One aspect of this work is that it focuses on only a portion of the full process of producing research output in the form of a research paper. Participating in this specific research activity is obviously not the same thing as pursuing research in one’s research portfolio where researchers (normally) dig in to the topic, the data, and the analysis. The paper is a stark reminder to me of the value of the often-slow research paper process – know your data (should I use weights?), know your policy (who is DACA-eligible in the sample?), know if your results are stable (do controls matter?), and discuss your findings before sending off to a journal. Part of digging-in is the process of presenting results to colleagues (at seminars, at conferences) to solicit feedback and vet drafts. [This study did include a peer-review process (with non-mandatory revisions) to two-thirds of papers, with basically no effect at reducing variation in findings.] Of course, these aspects of the research process do not guarantee convergence to “the right” answer but one hopes they move us to better analytical work.
Do robustness checks. The study shows that there is a lot of variation in choices across research teams in terms of the set of controls, the use of weights (nb: 75% did not use weights, despite the use being a recommendation for the ACS data), and, to a modest extent, the choice of functional form. In this specific study, not all of this variation drives the variation in effect sizes. But, generally, these findings show that in a qualified set of researchers, our peers may (reasonably) choose different covariates, adopt a different specification, or opt to use/not use weights. There is typically not a single right or best way, and it is on the individual researchers to assess how these different choices do or do not matter, and to show these findings in our work.
Before getting to your point estimates, look at sample traits. I was struck (as I think the authors are too) by the degree of variation in sample size. It is largest in Task 1 where the 25th and 75th sample size percentiles ranged from 61,600 to 356,787 observations. In this task, the researchers were not given a specified control group, but rather were left to figure it out. Some teams used almost the entire ACS sample, meaning they included people who are not like the DACA-eligible group (treatment). Basic reporting on sample traits, specially across ‘treated’ and ‘untreated’ samples could have reduced differences across research teams. Before jumping to regressions, take the time to look at the sample in a way makes sense to the research question – in this case, in terms of the profile of non-eligibles.
Document. The authors note that “The optimal level of researcher variation is not zero” but that there should be “good” reasons why two researchers would have different findings for the same data and same question. Identifying how we get to our findings means both making our work replicable, as well as giving sufficient details in the paper for a reader. This likely means a dreaded 40-page appendix, lengthy footnotes and/or extra sections with robustness check. This paper shows the value in terms of research quality to extra pages and to avoiding one-and-done estimations.
Join the Conversation