Pre-analysis plans and Registered Reports: What the new opinion piece does and doesn’t imply

|

This page in:

When editors from the AER, QJE, AER-Insights and AEJ Applied are all authors on an opinion paper calling for moderation in how people use pre-analysis plans (PAPs), researchers are bound to listen. This is especially the case because a lot of what they say is exactly what many researchers would love to hear – that “trying to write a detailed PAP that covers all contingencies, especially the ones that are ex ante unlikely, becomes an extraordinarily costly enterprise” with instead advice to “Keep the PAP short. The information contained in the AEA RCT Registry is often sufficient.” No wonder, then, that Markus summarized this in his blogpost last week as “A saner approach to pre-analysis plans.”

The paper (hereafter Duflo et al.) makes a lot of good points and will indeed be welcomed by most researchers. However, we note that there are some contrasts with one of the other recent innovations occurring, which is registered reports (RRs), pioneered in development economics by the Journal of Development Economics. As both authors and reviewers of RRs, we wanted to highlight a few points in terms of how we think about PAPs and RRs – using our takeaways from the new Duflo et al. PAP paper and our experiences with the RR process.

  1. One size does not fit all – different levels of detail will be useful for different purposes

Duflo et al. write that “An examination of 25 randomly selected PAPs on the registry underscored such ambiguity: page lengths vary between one and thirty pages.” We see this as partially a feature, and indeed ourselves have written PAPs of vary different levels of detail and length, with different purposes in mind. For large, multi-year, expensive projects, in which we clearly understand the intervention, are explicitly testing a specific theory, and have a range of well-defined outcomes we care about, it has made sense to be as specific as possible in our PAPs. But in other cases, the intervention is a bit of a black-box to us until it actually occurs and we are unclear exactly what outcomes make sense to measure, what types of data we will be able to collect, or even what funding we will have down the line to collect these data. In such cases, putting down a very simple and short document so that the “PAP serves as a record of initial intentions” seems useful. We don’t think that a one-size-fits-all standard-length PAP is likely to be the optimal approach.

  1. If you had only one table and/or figure to show the impacts of your study, what would it show?

Many development experiments potentially have a large range of outcomes over multiple domains, with multiple potential causal mechanisms potentially at play, and often researchers have several different treatments occurring. This is part of what leads to fat PAPs, as researchers try to pre-specify all these different things they could possibly look at. A test that we try to ask ourselves, our co-authors, and partners is what we call the 2-page policy brief or Science paper test: if you would get to only show people one table and one figure to illustrate the results of your study, what would you want to show? This helps prioritize what the key outcomes are, helps deal with multiple testing concerns, and fits well with the suggestion of Duflo et al. that “Highlighting core hypotheses by putting them at the center of a short PAP enhances credibility in the public eye, especially given the limited ability to process complex evidence.” This can be simply included in the AEA registry under primary outcomes.

  1. Be precise on how you will measure outcomes when you can

Duflo et al. note that there are times where it is difficult to be precise about how the outcome variable will be measured. But our experience is also that there are many times where it is possible to be precise and researchers are not. This is particularly the case when refereeing registered reports. For example, researchers might specify “income” as a key outcome. But we want to know a lot more details, such as over what period will this be measured? Will it include income from all sources (including home production) or just monetary wage? Will it be used in logs, levels, inverse hyperbolic sines, or some other transformation? Will it be winsorized, or how will outliers be dealt with?  If comparisons are being made over time and space, will nominal or real income be measured? Etc. The point is that even for a relatively simple outcome like income, there are many decisions to be made, and ex ante thinking through these is useful.

A related issue is when the researcher wants to (or has to) deviate from the PAP for one reason or another. It is useful then to have some “standard operating procedures” to fall back on, which gives some discipline to the exercise and serves as a “safety net for PAPs” as suggested by Lin and Green. See the SOP at Don Green’s lab at Columbia University here, which can be a live document but with time-stamped versions.

  1. The potential value of a PAP and especially a RR is not just transparency

Registering a simple PAP or set of fields in the AEA registry is an important step towards alleviating file drawer problems (where some results are never published) and transparency/robustness concerns (by making clear the key outcomes researchers had in mind). But the potential value of a PAP and especially going through a process like RR is so much more than this – ideally the process should not just document what is done, but also be an avenue to ex-ante improve the study design.

  • Potentially improving the interventions themselves: clearly documenting what the setting is for the study (e.g. age, gender, poverty levels, context of the subjects) and the precise details of the intervention (e.g. training materials, how grants will be disbursed, dates, conditions, costs, etc.) makes the researcher pay attention to the plumbing of the program, and identify early on implementation details that may make it less successful than intended. If a registered report is shared with external reviewers at this stage before the intervention starts, outside experts may also be able to raise concerns that are not constructive ex-post (“why on earth did you do it that way?”) but that might be helpful ex ante. Cyrus Samii has a nice thread on twitter about how EGAP has workshops to do this, saying that PAPs “should be a vehicle for having an ex ante *conversation* about what you are doing, as a way to refine your study and also get buy-in and agreement.” Colin Camerer notes this role too in cognitive neuroscience, where researchers must present planned research design before using fMRI, since, as in RCTs, the cost of poor design choice is really high.
  • Improving what is measured: the process of carefully documenting the intervention and main intended outcomes can help the researcher designing survey questionnaires to make sure they don’t inadvertently not measure something – this is where explicit mapping of key outcomes to survey questions is needed. Again, when registered reports are submitted before follow-up data collection takes place, reviewers also have the chance to offer advice on how to make sure a critical outcome is measured, or to measure something that will help publication. For example, for this recent prospective article in JDE RR Stage 1, the reviewers and the editor had very helpful suggestions on measuring mediating economic outcomes to get at the mechanisms through which the intervention may have an impact on the final outcome. Because the review was after baseline but prior to any follow-up data collection, the authors were able to add several measures to the follow-up survey and include them in the RR.
  • Improving the estimation techniques: while this is useful for transparency, we see this as less crucial for research improvement in most cases – since there is not that much difference between reviewers asking you to do a robustness check ex ante, and them asking it ex post. It is only when different measurement is needed that this becomes as key ex ante.
  • To allow for a thorough discussion of any ethics issues that might be associated with the intervention or data collection methods: just as researchers may not always have the perfect foresight or the cognitive bandwidth, as Duflo et al. suggest, with respect to theory, methods, and data, they may similarly have blind spots on the potential harms the proposed intervention or data collection methods might cause to subjects in the study or others in study communities. Institutional Review Boards are not always the best arbiters of such discussions, either. An ex ante discussion of intervention design and privacy concerns (such as using administrative data from a partner organization to contact subjects) can go a long way in avoiding problems ex post.
  1. Thinking through treatment heterogeneity and the role of exploration

We’ve all come across papers in which the average effect is zero, but then the authors have a lovely model and story about why impact should only be there for some subgroup and emphasize treatment heterogeneity. Duflo et al. note that “Readers of the research paper should treat those results exactly as they would any study on secondary data without a pre-analysis plan that is based on credible causal inference.” In many cases, this will involve some skepticism, so we agree with their recommendation that “If the researchers know that they are interested in specific subgroups, it makes sense to specify them in advance; this is in fact a case where pre-specification can be very valuable for increasing the credibility of the findings.” But this is also where ex ante outside review can be helpful – we have had situations where a reviewer has requested us to examine heterogeneity in a dimension that was not pre-specified, and doing so has greatly helped in understanding impacts of the intervention – but where we think the exploration would have been less credible if we had simply presented this (not pre-specified) heterogeneity as part of the paper. And, as above, some standard operating procedures to fall back on for not pre-specified heterogeneity analysis might include endogenous stratification in studies with smaller samples and the machine learning techniques cited in Duflo et al. in larger ones.

So, what should you do as a researcher?

As we note, we don’t think one-size fits all, and different researchers, and indeed the same researcher on different projects, may want more or less detailed plans. We think of this as a multi-step process, where researchers can decide how many of the following steps they want to take (and these steps need not be linear):

Step 1: register the barebones design and intentions in the AEA registry

Step 2: write a first version of the PAP, at least to cover what you will measure in a first follow-up survey or in the short-term, once you understand more what the intervention is doing.

Step 3: get external feedback through working seminar/sharing with partners etc. of basic design and short PAP.

Step 4: Write a RR and get more formal refereeing and advice

Step 5: potentially go through iterations of the PAP, or rounds of revisions on RR

While the profession now has a decade of PAP experience, RRs are much newer, and the interplay between them both still evolving. Likewise, there is a need to monitor how short PAPs and deviations from them are used over time, and how this varies for researchers at different career stages and from different backgrounds. For example, will shorter PAPs and more tolerance of deviations from them particularly help less experienced researchers, or will editors and referees be less trusting of less precision and of deviations if they come from people without research track-records? Likewise, the role of more detailed RRs in improving research designs will likely vary with researcher background. It is therefore key we continue to share lessons and it is great to see how researchers’ views evolve – indeed we appreciate Duflo et al. noting that their own views have evolved over time: “We recognize that some of our own, early experimentation with PAPs do not adhere to this principle, and regret if these growing pains have had the unfortunate consequence of unintentionally signaling that extremely detailed pre-analysis should be considered de rigueur. That is not our view.”

Authors

David McKenzie

Lead Economist, Development Research Group, World Bank

Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation