Published on Development Impact

How scientific are scientific replications?

Berk Özler

October 15, 2014

This page in:

3ie now has two completed replications on its website – one fully completed for Jensen and Oster’s ‘Cable TV and Women’s status’ study and one partially completed for Miguel and Kremer’s ‘Worms’ study.

Looking at the replication reports and authors’ responses (posted simultaneously), some disagreement between the replicators and original authors is apparent – in both cases the authors disagree with at least some of the conclusions drawn by the replicators [I use the terms authors and replicators as shorthand for original authors and replication researchers – the respective terms used by 3ie]. As the sample size at the moment is very small (2) and the replications that are completed are not of studies that had pre-analysis plans, initial impressions can be misleading so please take what I write here as an attempt to start a discussion on the topic as I cover various issues raised by this process over the upcoming weeks and not as an implicit criticism of problems innate to such an exercise.

Rereading the original studies, the replication studies, and the authors’ responses – all of which are lengthy and some of it extremely detailed and nitty-gritty is like doing three referee reports per study, so with apologies this will take a bit of time and multiple posts over the upcoming weeks, even months…

The replicators of the Jensen and Oster study (JO from hereon), Iversen and Palmer-Jones (IPJ from hereon), do two things, among others, that can be described as robustness and heterogeneity analysis. The main robustness analysis by IPJ involves taking one of JO’s main outcome indicators – an index of women’s attitudes towards spousal beating consisting of six component variables – and running fast and loose with it. The reader’s should note that JO’s original reporting is on the overall indices, which show significant effects of the village having access to cable TV. Because of strong and significant results on the aggregate index, there is no need for ‘p-hacking,’ i.e. to find some components that yield significant results. If there was a pre-analysis plan, the authors might have committed to only examine components if there was a significant effect on the overall index – to explore which component of the index is most responsible for the observed effect. There was, in fact, some of this in JO’s earlier WP version, but it was cut from the final journal publication – as this is exactly the kind of thing that editors recommend cutting from final publications – unless such analysis leads to important insights.

Anyway, IPJ decompose the index and report that the ‘treatment’ only had an effect on three of the six components. Had they stopped there, this would perhaps spur a nice little discussion and make a modest contribution. However, the authors then decide to exclude the component with the strongest impact from the index (for ‘external validity’ concerns: I won’t bore you with the details). Then, they introduce a concept of components that are easy vs. hard to change. These analyses look arbitrary at best – even though they don’t knock out the original findings – and there is really no other way to describe this than ‘data mining.’ What they found, which is not much different than the original study, really doesn’t matter: the replicators should not have been allowed to go down this road to begin with – I don’t think anyone who defined scientific replication had this in mind.

IPJ then take a more legitimate approach to robustness checks, which is to question the construction of the index. The authors created binary variables out of discrete outcomes with more than two response options, and then averaged them. Replicators argue that you could use the original scores in the components (instead of converting them into binary variables), and then alternatively use principal components or multiple correspondence analyses. This all sounds eminently reasonable. But even here, IPJ cannot help themselves and decide to construct versions of these by excluding the component they took an issue with earlier. If I am still able to read tables, these variations seem to make little difference to the original findings – in effect size or statistical significance. But, the authors’ conclusion from the ill-conceived analysis above and the more legitimate robustness analysis that follows is that the one or two reductions in effect size or a loss of statistical significance raises doubts about the robustness of one of JO’s main findings. JO, in their response, have this to say:

“…of 18 possible permutations in constructing the index (a. averaging the variables vs. MCA vs. PCA, b. binary vs. averaging, c. including all 6 questions vs. excluding question iii vs. excluding question iv), just one of them is no longer statistically significant. Counter to the claims of lack of robustness, this seems to suggest to us that in fact the results are extremely robust to alternative constructions. We imagine that for almost any variable one might use in any empirical analysis, there exists some combination of changes along various dimensions that would yield a result that is no longer statistically significant. Calling this a lack of robustness seems like a very strict standard to hold any study to.”

I have no choice but to agree with JO’s response: the point of robustness checks in such a replication exercise is not to rerun regressions until you convert one statistically significant result to insignificant and highlight that.

The most glaring issue (or concern and a question for 3ie) comes in Table 6 in IPJ, where the replicators conduct heterogeneity analysis on the effect of access to cable TV on spousal beatings, autonomy, and son preference by female illiteracy for the following sub-groups: has TV; watches TV; SC/ST status; and above 35 years of age. If you’re counting, that’s 54 hypothesis tests. As if that was not enough, the authors have attached footnote on one of the 50-plus parameter estimates, which discusses further interaction effects within that cell (the effect that they’re discussing is that of access to cable on son preference for women who are literate, non-SC/ST, and who do/don’t watch TV! That’s three interactions in one parameter estimate).

The aim here seems to be explaining the pathways/mechanisms for the observed effects. In particular, IPJ want to provide evidence that the effects are not due to TV viewing. Setting aside the important and non-negligible issue of running interactions with an endogenous variable such as watching TV (also mentioned by JO in their response), my main issue here is with multiple hypothesis testing. If we were to apply the simplest correction for the fact that IPJ run 54 tests to interpret some of them individually, the p-values we’d require would be around 0.005 instead of 0.05, a bar which none of the cable TV x illiterate x TV interaction effects would clear. But, even if we think that this is overly restrictive and we use a False Discovery Rate correction (please see my post on this topic here), still none of these interactions would be significant for spousal beating or son preference, with perhaps one surviving for female autonomy. Even this assumes that IPJ tried no other specifications, which turns out not to be true as they report more tests in the main text (see the bottom of page 18) that are not presented in Table 6. Of course, more tests with imprecise estimates would only reduce our confidence further.

In any case, the point here is not whether some of the effects in Table 6 would or would not survive proper multiple hypothesis testing corrections. Rather, the question is why such corrections were not conducted or required to begin with. If the authors were interested in some exploratory analysis to provide alternative interpretations of the results, they could have presented a careful plan where they identified some time-invariant baseline characteristics along which heterogeneity analysis would be conducted and presented a plan for statistical corrections in said analysis. Based on a theory, pre-specified, and statistically rigorous, such analysis could be a welcome and interesting endeavor (even though I am still not sure that it should be part of a ‘replication’ exercise: as JO point out in their response, such baseline characteristics can be correlated with many other attributes, making interpretation difficult). But, the way it is done in IPJ amounts to reading tea leaves. It seems to me that 3ie, through its review process, should have had tighter controls over the types of analyses allowed under the title of ‘replication.’ A big part of the point of replication is to reduce p-hacking, not to proliferate it.

As I said above, I will continue blogging about the replication studies over the upcoming weeks, but here are some initial takeaways that I have so far (I am thankful to David McKenzie for a discussion of some of these issues without implicating him none whatsoever):

Pre-analysis plans: My understanding is that 3ie initially did not insist on pre-analysis plans for replication study proposals but now it does. This is helpful but will not be perfect. On the one hand, it seems desirable for 3ie to be more prescriptive on the kinds of analysis allowed for such proposals. Providing a different interpretation of robust results may be valuable but not something we want to fund under this window. On the other hand, it will sometimes be necessary for replicators to alter the pre-analysis plan after obtaining the data – there should also be a process for this.
Reducing the time sink: Both replications in the 3ie website and the authors’ responses are really long (as long as the original studies themselves). I assume that both sides spent a lot of time on these. Part of this could be avoided by the use of pre-analysis plans and a judicious definition of what constitutes a replication under this window. But, part of it also clearly has to do with the fact that both studies themselves did not have pre-analysis plans. For replications of original RCTs that belong to this new age we live in, the PIs will be much more prepared with pre-analysis plans and replication data sets -- thus substantially reducing the time spent on replications and responding to them. Of course, in such studies, pure replication will take a more central role, because in an RCT that is designed and executed well and has a pre-analysis plan, there won’t be as much need for scientific replication: adding some controls should not make much of a difference; attrition, baseline balance, etc. should have been dealt with in the original study, etc. Of course, we could have journal editors assign pure replications to graduate students (perhaps as part of the data submission requirements for publication), which would further reduce the need for the 3ie replication window. That’s a good thing, as we can then focus our energy and resources to more important replications: similar studies using other data sets, experiments, which would build our knowledge on an important topic or a particular intervention.
Introduce a page limit: One of the problems with IPJ (and from a quick read the replication of the worms study) is with the interpretation of the results that is provided by the replicators. Judging from the authors’ responses, this seemed to be the main cause of disagreement rather than the analysis actually undertaken. Replicators interpreted a piece of evidence as overturning (or raising serious doubts about) the results, while the authors disagreed: perfectly, and unsurprisingly, aligned with the incentives currently setup by this process. There is a simple way to fix this: introduce a template (like medical journals), which does not allow any interpretation of the results. The document should read as follows: the pre-analysis plan, the empirical models to carry out that plan, tables of the findings (with explanations as necessary) – side by side with the original findings, clearly delineating any differences (in variable construction, specification, clustering, sampling weights, etc.) between the original and the replication analyses. That’s it: no need for meandering interpretations of one of the coefficients in one of the tables, which both color the independent readers’ interpretations and judgments and unnecessarily upset the authors of the original study.
Introduce a similar review process for replications: The quality of the replication studies should be as high as the original studies. To ensure this, one might consider coming to an agreement with the editors of the journals, in which the original studies appeared. In such an agreement, that journal could essentially also review the replication proposal (after an initial filter by 3ie) and reject it outright (or ask for it to be revised and resubmitted). If moving forward, the actual replication study itself would go through the same process as the original study for publication, perhaps even sent to the same referees that reviewed the original study and preferably handled by the same editor. At this stage, the draft replication should probably also be sent to the authors of the original study, who can send comments and corrections to the editor’s. If journals are not happy to go as far as publishing the replication studies (and the authors’ responses) as rejoinders to the original study, then 3ie can publish them on its site. I believe a rigorous peer review process is key to reducing many researchers’ hesitations about the current replication process, the time sink, and, most importantly, the quality of the final product.
Define (ex ante) what constitutes a failure to replicate: Uri Simonsohn has a paper on the evaluation of replication results that I will discuss in more detail in the upcoming weeks in the context of further suggestions on how to report findings and how to commonly evaluate them. But, for now, let’s start by saying that we should do much less of a song and dance in economics about statistical significance thresholds and present confidence intervals instead. Suppose that an original study presented an effect size and 95% CI as being (d=10; 95% CI: 1 - 19) and a replication study introduced a tweak and reported (d=9; 95% CI: -1 - 18). Has your view of the program effects changed dramatically? Currently, original researchers have too much incentive to play up statistical significance at a given level and replicators have an incentive to find specifications that produce CIs containing zero. Let’s stop this…

Well, time flew, I wrote 2,400 words, and I did not even get to start discussing the newly completed replication of Miguel and Kremer’s worms study. That’s for my next blog in a couple of weeks…

Get updates from Development Impact

Authors

Berk Özler

Lead Economist, Development Research Group, World Bank

More Blogs By Berk

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.