Syndicate content

How scientific are scientific replications?

Berk Ozler's picture

3ie now has two completed replications on its websiteone fully completed for Jensen and Oster’s ‘Cable TV and Women’s status’ study and one partially completed for Miguel and Kremer’s ‘Worms’ study. 

Looking at the replication reports and authors’ responses (posted simultaneously), some disagreement between the replicators and original authors is apparent – in both cases the authors disagree with at least some of the conclusions drawn by the replicators [I use the terms authors and replicators as shorthand for original authors and replication researchers – the respective terms used by 3ie]. As the sample size at the moment is very small (2) and the replications that are completed are not of studies that had pre-analysis plans, initial impressions can be misleading so please take what I write here as an attempt to start a discussion on the topic as I cover various issues raised by this process over the upcoming weeks and not as an implicit criticism of problems innate to such an exercise.

Rereading the original studies, the replication studies, and the authors’ responses – all of which are lengthy and some of it extremely detailed and nitty-gritty is like doing three referee reports per study, so with apologies this will take a bit of time and multiple posts over the upcoming weeks, even months…

The replicators of the Jensen and Oster study (JO from hereon), Iversen and Palmer-Jones (IPJ from hereon), do two things, among others, that can be described as robustness and heterogeneity analysis. The main robustness analysis by IPJ involves taking one of JO’s main outcome indicators – an index of women’s attitudes towards spousal beating consisting of six component variables – and running fast and loose with it. The reader’s should note that JO’s original reporting is on the overall indices, which show significant effects of the village having access to cable TV. Because of strong and significant results on the aggregate index, there is no need for ‘p-hacking,’ i.e. to find some components that yield significant results. If there was a pre-analysis plan, the authors might have committed to only examine components if there was a significant effect on the overall index – to explore which component of the index is most responsible for the observed effect. There was, in fact, some of this in JO’s earlier WP version, but it was cut from the final journal publication – as this is exactly the kind of thing that editors recommend cutting from final publications – unless such analysis leads to important insights.

Anyway, IPJ decompose the index and report that the ‘treatment’ only had an effect on three of the six components. Had they stopped there, this would perhaps spur a nice little discussion and make a modest contribution. However, the authors then decide to exclude the component with the strongest impact from the index (for ‘external validity’ concerns: I won’t bore you with the details). Then, they introduce a concept of components that are easy vs. hard to change. These analyses look arbitrary at best – even though they don’t knock out the original findings – and there is really no other way to describe this than ‘data mining.’ What they found, which is not much different than the original study, really doesn’t matter: the replicators should not have been allowed to go down this road to begin with – I don’t think anyone who defined scientific replication had this in mind.

IPJ then take a more legitimate approach to robustness checks, which is to question the construction of the index. The authors created binary variables out of discrete outcomes with more than two response options, and then averaged them. Replicators argue that you could use the original scores in the components (instead of converting them into binary variables), and then alternatively use principal components or multiple correspondence analyses. This all sounds eminently reasonable. But even here, IPJ cannot help themselves and decide to construct versions of these by excluding the component they took an issue with earlier. If I am still able to read tables, these variations seem to make little difference to the original findings – in effect size or statistical significance. But, the authors’ conclusion from the ill-conceived analysis above and the more legitimate robustness analysis that follows is that the one or two reductions in effect size or a loss of statistical significance raises doubts about the robustness of one of JO’s main findings. JO, in their response, have this to say:
 

“…of 18 possible permutations in constructing the index (a. averaging the variables vs. MCA vs. PCA, b. binary vs. averaging, c. including all 6 questions vs. excluding question iii vs. excluding question iv), just one of them is no longer statistically significant. Counter to the claims of lack of robustness, this seems to suggest to us that in fact the results are extremely robust to alternative constructions. We imagine that for almost any variable one might use in any empirical analysis, there exists some combination of changes along various dimensions that would yield a result that is no longer statistically significant. Calling this a lack of robustness seems like a very strict standard to hold any study to.”



I have no choice but to agree with JO’s response: the point of robustness checks in such a replication exercise is not to rerun regressions until you convert one statistically significant result to insignificant and highlight that.

The most glaring issue (or concern and a question for 3ie) comes in Table 6 in IPJ, where the replicators conduct heterogeneity analysis on the effect of access to cable TV on spousal beatings, autonomy, and son preference by female illiteracy for the following sub-groups: has TV; watches TV; SC/ST status; and above 35 years of age. If you’re counting, that’s 54 hypothesis tests. As if that was not enough, the authors have attached footnote on one of the 50-plus parameter estimates, which discusses further interaction effects within that cell (the effect that they’re discussing is that of access to cable on son preference for women who are literate, non-SC/ST, and who do/don’t watch TV! That’s three interactions in one parameter estimate).

The aim here seems to be explaining the pathways/mechanisms for the observed effects. In particular, IPJ want to provide evidence that the effects are not due to TV viewing. Setting aside the important and non-negligible issue of running interactions with an endogenous variable such as watching TV (also mentioned by JO in their response), my main issue here is with multiple hypothesis testing. If we were to apply the simplest correction for the fact that IPJ run 54 tests to interpret some of them individually, the p-values we’d require would be around 0.005 instead of 0.05, a bar which none of the cable TV x illiterate x TV interaction effects would clear. But, even if we think that this is overly restrictive and we use a False Discovery Rate correction (please see my post on this topic here), still none of these interactions would be significant for spousal beating or son preference, with perhaps one surviving for female autonomy. Even this assumes that IPJ tried no other specifications, which turns out not to be true as they report more tests in the main text (see the bottom of page 18) that are not presented in Table 6. Of course, more tests with imprecise estimates would only reduce our confidence further.

In any case, the point here is not whether some of the effects in Table 6 would or would not survive proper multiple hypothesis testing corrections. Rather, the question is why such corrections were not conducted or required to begin with. If the authors were interested in some exploratory analysis to provide alternative interpretations of the results, they could have presented a careful plan where they identified some time-invariant baseline characteristics along which heterogeneity analysis would be conducted and presented a plan for statistical corrections in said analysis. Based on a theory, pre-specified, and statistically rigorous, such analysis could be a welcome and interesting endeavor (even though I am still not sure that it should be part of a ‘replication’ exercise: as JO point out in their response, such baseline characteristics can be correlated with many other attributes, making interpretation difficult). But, the way it is done in IPJ amounts to reading tea leaves. It seems to me that 3ie, through its review process, should have had tighter controls over the types of analyses allowed under the title of ‘replication.’ A big part of the point of replication is to reduce p-hacking, not to proliferate it.

As I said above, I will continue blogging about the replication studies over the upcoming weeks, but here are some initial takeaways that I have so far (I am thankful to David McKenzie for a discussion of some of these issues without implicating him none whatsoever):
 
  1. Pre-analysis plans: My understanding is that 3ie initially did not insist on pre-analysis plans for replication study proposals but now it does. This is helpful but will not be perfect. On the one hand, it seems desirable for 3ie to be more prescriptive on the kinds of analysis allowed for such proposals. Providing a different interpretation of robust results may be valuable but not something we want to fund under this window. On the other hand, it will sometimes be necessary for replicators to alter the pre-analysis plan after obtaining the data – there should also be a process for this.
  2. Reducing the time sink: Both replications in the 3ie website and the authors’ responses are really long (as long as the original studies themselves). I assume that both sides spent a lot of time on these. Part of this could be avoided by the use of pre-analysis plans and a judicious definition of what constitutes a replication under this window. But, part of it also clearly has to do with the fact that both studies themselves did not have pre-analysis plans. For replications of original RCTs that belong to this new age we live in, the PIs will be much more prepared with pre-analysis plans and replication data sets -- thus substantially reducing the time spent on replications and responding to them. Of course, in such studies, pure replication will take a more central role, because in an RCT that is designed and executed well and has a pre-analysis plan, there won’t be as much need for scientific replication: adding some controls should not make much of a difference; attrition, baseline balance, etc. should have been dealt with in the original study, etc. Of course, we could have journal editors assign pure replications to graduate students (perhaps as part of the data submission requirements for publication), which would further reduce the need for the 3ie replication window. That’s a good thing, as we can then focus our energy and resources to more important replications: similar studies using other data sets, experiments, which would build our knowledge on an important topic or a particular intervention.
  3. Introduce a page limit: One of the problems with IPJ (and from a quick read the replication of the worms study) is with the interpretation of the results that is provided by the replicators. Judging from the authors’ responses, this seemed to be the main cause of disagreement rather than the analysis actually undertaken. Replicators interpreted a piece of evidence as overturning (or raising serious doubts about) the results, while the authors disagreed: perfectly, and unsurprisingly, aligned with the incentives currently setup by this process. There is a simple way to fix this: introduce a template (like medical journals), which does not allow any interpretation of the results. The document should read as follows: the pre-analysis plan, the empirical models to carry out that plan, tables of the findings (with explanations as necessary) – side by side with the original findings, clearly delineating any differences (in variable construction, specification, clustering, sampling weights, etc.) between the original and the replication analyses. That’s it: no need for meandering interpretations of one of the coefficients in one of the tables, which both color the independent readers’ interpretations and judgments and unnecessarily upset the authors of the original study.
  4. Introduce a similar review process for replications: The quality of the replication studies should be as high as the original studies. To ensure this, one might consider coming to an agreement with the editors of the journals, in which the original studies appeared. In such an agreement, that journal could essentially also review the replication proposal (after an initial filter by 3ie) and reject it outright (or ask for it to be revised and resubmitted). If moving forward, the actual replication study itself would go through the same process as the original study for publication, perhaps even sent to the same referees that reviewed the original study and preferably handled by the same editor. At this stage, the draft replication should probably also be sent to the authors of the original study, who can send comments and corrections to the editor’s. If journals are not happy to go as far as publishing the replication studies (and the authors’ responses) as rejoinders to the original study, then 3ie can publish them on its site. I believe a rigorous peer review process is key to reducing many researchers’ hesitations about the current replication process, the time sink, and, most importantly, the quality of the final product.
  5. Define (ex ante) what constitutes a failure to replicate: Uri Simonsohn has a paper on the evaluation of replication results that I will discuss in more detail in the upcoming weeks in the context of further suggestions on how to report findings and how to commonly evaluate them. But, for now, let’s start by saying that we should do much less of a song and dance in economics about statistical significance thresholds and present confidence intervals instead. Suppose that an original study presented an effect size and 95% CI as being (d=10; 95% CI: 1 - 19) and a replication study introduced a tweak and reported (d=9; 95% CI: -1 - 18). Has your view of the program effects changed dramatically? Currently, original researchers have too much incentive to play up statistical significance at a given level and replicators have an incentive to find specifications that produce CIs containing zero. Let’s stop this…
Well, time flew, I wrote 2,400 words, and I did not even get to start discussing the newly completed replication of Miguel and Kremer’s worms study. That’s for my next blog in a couple of weeks…
 

Comments

Submitted by Fernando Martel García on

Berk:

I see two fundamental problems.

First, many scientists do not know what replication is, how it's done, and what is it for. Most people think replication is "obvious" but it ain't. Ditto for "robustness". For example, a recent review of the social science literature finds 18 replication typologies, spanning 79 replication types, most of which lack theoretical foundations. Indeed, even the notion of a "failed" or "successful" replication seems completely misplaced.

Second, many of the problems you identify would be avoided if editors stopped asking that replications contribute new findings. Rather, replications should be seen mostly as a method to evaluate and improve research practice.

I have discussed these topics in two manuscripts, both available at SSRN:

1. Replication and the Manufacture of Scientific Inferences: A Formal Approach

2. Scientific Progress in the Absence of New Data: A Procedural Replication of Ross (2006).

I think the first one is the most relevant. It argues that we should not only model the data generating process but also the measurement process, and other interventions used to study the DGP. I show that study findings can be informative about both, the substantive hypothesis and the method used to investigate it. This is in line with the Bayesian approach of Jaynes, which I cite. A procedural replication helps us discern between these two possible inferences.

Submitted by Berk Ozler on
Hi Fernando,

Nice. The curious case of Claude Bernard reminds me of RCTs of HSV-2 supression that do not result in reductions in HIV transmission and the researchers who persist with more trials because of the strong prior that the biological mechanism is there.

Berk.

Submitted by Amy on

Speaking from the vantage point of Psychology the question is one of where are the adults in all of this? Replication bullying will become the norm rather than the exception. To your sensible list I would add:
-1- Upload data and code with all accepted journal articles including reports and working papers.

-2- Just as important upload the replication files used by the replicators. Why is this not open to scrutiny?

-3- Why just one replication? Support multiple teams for the same
study

Submitted by Annie on

I totally agree! Bad replications and replication bullying will kill science not enhance it. Good researchers will all doing theory. Who on earth wants to spend months replying to silly and, sometimes, bad intentioned replication reports?

This is a comment from Benjamin Wood, of 3ie Thank you for your thoughtful comments on replication research and 3ie's replication program. I wanted to quickly comment on a few of your points: - You correctly note that IPJ did not have a replication plan, but your similar assertion in regards to the "Worms" replication research is not correct. Their replication plan is publicly posted on their replication study's webpage:http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-pro.... All 3ie replication grantees, after the initial two pilot studies, are required to publicly post their replication plans (that I would argue differs from a pre-analysis plan but that is for a larger discussion). IPJ was one of the pilots. - The 3ie replication program does include both internal and external review processes. The replication window proposals are scored by both internal and external scorers. Each replication study (after the pilots) has an external project advisor who, in addition to one or more internal reviewers, reviews and comments on the replication plan and the final study. For the final study (including the pilots) there is also an additional external referee and 2-4 additional internal reviewers. For the worms study, for example, we had a very well-known economist and a very well-known epidemiologist as outside referees. The replication researchers received many pages of detailed comments. Your suggestion about calling upon the journal and referee for the original publication is interesting, but I think there are serious conflict of interest concerns. - As you can see from our program policies, publicly available on our website http://www.3ieimpact.org/en/evaluation/impact-evaluation-replication-pro..., we do require the replication researchers to send their pure replication studies to the original authors for comment before proceeding with their measurement and estimation analysis and theory of change analysis. You can also find more information about the thinking behind our replication program in our published paper here: http://www.tandfonline.com/doi/full/10.1080/19439342.2014.944555#.VD-Ucl... We are working on our own lessons-learned analysis about replication and the attempt to "mediate" replication through a program such as 3ie's, which will be based on a larger (but still small) group of the grant-funded and in-house replication studies in process and soon-to-be-completed.

Submitted by Berk Ozler on
Hi Ben,

Thanks for this. I don't think I have made any assertions about the worms replication plan being non-existent. I made an assertion that the worms study itself did not have one.

It's a relief to hear that IPJ was one of two pilots and that there are better procedures in place for the review of the replication studies.

We'll look forward to more studies and 3ie's assessment of lessons learned, but it may be time to start a broader discussion around replication that involves more members of the research community rather than a 3ie-driven effort on certain types of replication.

Sincerely,

Berk.

Submitted by Berlin on

In my mind applied economics is similar in some sense to medical science, in its relation to theory for example, and even more after the RCT revolution. I personally think replications and systematic re-analyses of empirical results are needed, though perhaps more in the spirit of the Cochrane Collaboration. Is there nowadays any outlet for good-intentioned and serious replications in development economics, except 3ie?
Thanks for an illuminating debate!
/M

Submitted by Berk Ozler on
If, by replication, you mean replicating the results of one intervention elsewhere, there are plenty of outlets for that. Not so much for the kind of replications discussed above. Berk.

Add new comment