Last week, I promised to summarize the results of a paper by Eble and Boone on the performance of economics articles reporting results from prospective RCT studies in top 50 journals over the past decade (HT to Chris Blattman, who blogged about it earlier. Here it goes…
The idea of the paper is that researchers in the medical field have been doing RCTs for a good while and have gone through many of the debates we, in economics, are having now. They have come up with reporting guidelines (CONSORT), which aim to reduce the risk of bias in the reported results stemming from a variety of issues. The authors, correctly in my opinion, see this as a $100 bill lying on the ground: it’s likely to serve us well if we pay attention to things that they have learned and adopt them for our circumstances – taking into account the fact that economics trials are not the same as biomedical ones.
The authors do a search of the words ‘randomized’ or randomization in the abstract or the title in the EconLit database and end up with 54 prospectively randomized studies published in top 50 Economics journals between 2000-2011 (as a side note, because economics does NOT follow the CONSORT guidelines, the titles of randomized control trials are not mandated to include those words, meaning that the authors missed at least one RCT published in these journals during this period. I recommend that the authors revise their search terms to include ‘experiment’, ‘experimental’, etc.). They also do a search of articles reporting phase III trials in top-tier medical journals, but I will not focus on these here (as another aside, I think that the proper comparison for economics trials in medicine are phase II trials, and not III. It is possible that if we had phase III-like trials in economics, the reporting would be much better).
The authors assess the performance of these articles in six domains: selection, performance, detection, attrition, reporting, and imprecision (or power). I will explain each of these as summarize their evidence below, followed by my comments in italics:
Selection: The authors are worried about the sample drawn not being representative of the intended population and also differences between treatment and control due to imperfections in the randomized assignment. They seem particularly worried about the latter (example: if a courthouse assigns cases randomly to judges but the sequence is known/deterministic, then clerks or lawyers can game the system by filing cases at the right moment to ensure assignment to a friendly judge). The authors find that less than a quarter of the econ papers get a passing grade in describing recruitment for the study and the randomization procedure. While it can feel like the CONSORT guidelines go overboard here (we had to report the version of Stata used for the randomization for a paper in The Lancet), I agree that econ papers can do better by including flowcharts describing the recruitment and attrition at each stage, as well as describe how the randomization was done.
Performance Bias: The possibility for ‘Hawthorne’ and ‘John Henry’ effects is well known and acknowledged in economics. This is exacerbated by the fact that often times (almost always?) it is not possible to blind subjects to their treatment status or to provide placebo treatments in economics. It is also not helped by primary outcomes that are self-reported (although this is not unique to economics: for example, many public health researchers rely on self-reported sexual behavior data as a proxy for HIV/STI risk). The authors find that about a third of econ studies report inadequately on these issues and an equal number have a high risk of bias. I am surprised that things are not worse here: many papers I referee are silent about these issues or the fact that they use self-reported data. Faithful readers of Development Impact will know how I feel about measurement issues in experiments. We need to do much better here by (a) at least discussing the possibility of bias due to these design issues; and (b) by using objective primary outcome measures. We also need more experiments that explicitly try to identify ‘Hawthorne’ and ‘John Henry’ effects.
Detection Bias: If the data collectors are not blinded to the treatment status of the subject. This is especially problematic if the enumerators have a stake in the outcome of the evaluation. The authors find: “Seventeen of the 54 economics articles failed on reporting and 19 on risk of this bias. Many of these trials collected data with individuals who may have had incentive to skew the data in favor of the intervention. Two articles explicitly mentioned using data collectors who were employed by the same company, which administered the intervention. Several others neglected to say who collected the data, leaving doubt as to whether a similar conflict of interest could have biased the results.” Organizations should have independent researchers evaluate their interventions; such researchers should sign ‘conflict of interest’ statements; they should make sure that enumerators are blinded to treatment status (as much as possible); and report on these issues adequately in articles. It’s all pretty straightforward.
Attrition bias: This is self-explanatory. The authors are worried about attrition from the sample during the course of the study that would cause biased treatment effects. They are also worried about whether the researchers report intent to treat effects or toss out observations in an ad hoc manner (or don’t describe adequately what they did). They find: “Only 17 of the 54 economics articles passed this criterion. More than 20 did not discuss exclusion of participants in the final analysis and almost all of these had widely varying numbers of observations in different versions of the same analysis, suggesting that selective exclusion of observations did in fact take place. Less than half of the articles we collected mentioned the intent to treat principle by name and, among those that did, several neglected to follow it. Many of these articles excluded groups of participants because they did not follow the protocol, and one paper threw out the second of two years of data collected because of contamination.” All of my pet peeves about attrition are on display here: sample sizes bouncing around; attrition issues being relegated to appendices and downplayed; only showing balance on the treatment dummy and not on whether attrition is correlated with different baseline characteristics in each study arm; running the analysis on a subset by excluding some observations and not conducting any sensitivity analysis, etc. Some papers, such as this one on which I reported here, will exclude as much as 40% of the individuals who did not take up the assigned treatment!
Reporting bias: Here, we’re talking about specifying your analysis preferably before you collect (but at least before you analyze) your data and, hence, avoiding the possibility of fishing for results. This was, not surprisingly, non-existent for all 54 papers. Corrections for multiple comparisons are very rare. Furthermore, economics papers are not good at being honest about reporting their limitations. I think that there is good news and bad news here. On the plus side, pre-analysis plans, including good descriptions of interventions in appendices, etc. are becoming more common as I write these lines. Registries by EGAP and AEA, and efforts by the likes of J-PAL and CEGA are most welcome in these respects (Check out CEGA’s transparency in social sciences, BITTS, initiative here). However, even without pre-analysis plans (or, recently, with them) there are still too many stories told in economics papers. It would be better if the authors (a) simply reported their primary results without commentary; and (b) included a section on the limitations of their studies that was honest and transparent. Some interpretation, using secondary analysis as necessary, can follow but only after the simple and consistent reporting of the primary results – without trying to infect the reader with the authors’ views of the results rather than a balanced reporting.
Imprecision: This is about ex ante power calculations to make sure that the study is powered to find the effects it is trying to identify. The authors find: “Only two economics papers attested to perform an ad ante sample size calculation.” This does not surprise me, although, like the authors, I believe this percentage is higher but not reflected in articles. Many funders request these calculations explicitly before funding field (or lab) experiments, so it is a matter of getting these calculations into the articles. Minimum detectable effects (MDE) are always useful to report, although they can also be calculated using standard errors of the treatment effects.
I was surprised to see that the authors’ subgroup analysis (not pre-specified) did not find recent papers or top 4 journals (why 4?) fared better. My guess is that things will get better, and in an accelerated manner, over the next few years. Over the past two years, I have reviewed articles for journals and funding proposals that included pre-analysis plans, very detailed descriptions of interventions, etc. This will be a welcome change. I also suspect, and hope, that many of the same practices will apply not just to RCTs but to other studies in economics. If you ever tried to decipher what an intervention is (say, in a quasi-experimental study of a government intervention), you know what I mean. Ditto for issues of data, measurement, attrition, etc.
One thing we’ll need to watch out for is the space limitations that medical journals impose. As long as journals continue to impose space limitations (many econ journals, such as AER and JHR, do the same), reporting on the details of the randomization and that you did it using Stata 13.0 will eat up space that can be used to explain the findings, do secondary analysis, describe a model or a conceptual framework, etc. And, such losses are important. As much as I want straightforward reporting, I also (like QJE editor Larry Katz here) would like to see secondary analysis, interesting economic analysis, explanations of puzzles, things that simple could not have been foreseen ahead of time in these articles. We need to have a balance.
Finally, while CONSORT guidelines may be useful, they have not come close to ensuring high quality evidence in biomedical studies. For example, some studies will dutifully report high rates of attrition but will not do anything to address the potential biases. They will publish articles where intervention implementation has been terrible (deviations from original design, contamination, etc. that makes distinctions between the study arms obsolete) and report the results according to the templates but the caveats are completely lost. When treatment 1 and treatment 2 were supposed to be distinct de jure but ended up being identical de facto, and your abstract says you found no differences between them, following templates is not good enough. In fact, many economists (perhaps somewhat unfairly) consider the quality in top medical journals to be inferior to those in top econ journals, with the former having been the butt of jokes among economists.
So, yes, by all means, we should report much better. But, we should not expect that to solve our problems. Good finished articles will always need good and honest researchers and excellent, tough, and thorough feedback (whether via traditional peer review or via crowd-sourcing).
- RCT reporting