The journey of a matching paper


This page in:

I wrote this post on Friday, October 8, a few days before the announcement of laureates in economic sciences in 2021, who are Angrist, Card, and Imbens. Sort of apt, I think – I can only hope that we did their immense work justice, while plying the trade they built…

Earlier this year, I wrote a blog post about a working paper with a number of co-authors on the impact of a large cash transfer program for refugees in Turkey. Now that paper is out in the Journal of Development Economics, an outcome that is both very pleasing and, to be perfectly honest, a little bit surprising for me. Papers using inverse probability weighting have a hard time cracking into top field journals these days, which JDE is for our field. I have been thinking about what might have contributed to this outcome and I decided to share some of my thoughts, with the hope that they might be helpful to some researchers.


In defense of peer review (sort of) …


David wrote here about “what you need to do to make a matching estimator convincing,” which included an introductory discussion on the difficulty of getting econ audiences to buy your findings that are based on propensity score matching. While covering the steps to take to improve statistical and rhetorical plausibility, he used our paper in question as an example of the latter:

Berk’s recent paper gives another type of example in this category – bureaucratic errors might determine who gets selected and who doesn’t. For example, they note that spelling differences, the case backlog, and issues linking registrations of different members could explain why one household gets selected and another household with similar observables did not.”

This was our best rhetorical evidence on why there were all these households that looked identical on baseline characteristics but had discordant treatment statuses at the outset of Turkey’s ESSN program:

While this system was best practice and hard to game, there could be discrepancies between a household's demographic characteristics at baseline and what was registered in the underlying databases (PDMM, MERNIS). As described in Section 2, such discrepancies could exist due to ... This means that households that appear de facto eligible at baseline could be deemed de jure ineligible by the program administration and vice versa. We exploit these discrepancies to match households that appear very similar on observables but have discordant treatment status.”

That was our original submission. However, this is where good refereeing comes in. One of our reviewers had the following comment, which the editor wanted us to explicitly address in our revisions:

“The authors describe the reasons for discrepancies between these measures on pp. 14. To me, there are three main reasons why records could differ and thus allow the matching strategy (later points notwithstanding) to be possible. (1) Exogenous measurement error, (2) real changes in composition/eligibility between registration and PAB; (3) Different definitions (measurement) of the household unit between the two data sources.

“If discrepancies are largely a result of (1) or (2), which are the reasons the authors highlight in pp. 14 ("absence of documentation, spelling differences in existing documents, a backlog of cases to be processed, or difficulties in linking registrations of household members"), I would be rather convinced of the empirical strategy…

But what is left unmentioned here is that the main reason "how similar households ended up with different beneficiary status at baseline" is because of measurement. The definition of the "applying unit" in the ISAS system is different from the survey definition of "household" in a way that might hamper achieving the independence condition obtainable by PSM. From my experience with the context, this is the case: see here, "What is considered a family?": "A family is any combination of the following members: mother and/or father, their unmarried children, grandparents and other unmarried family members. All family members on the application form must be registered under the same DGMM family number (Family ID number granted by DGMM)." - with this definition not involving shared resources or physical proximity, as by definition (or implied) by the survey question. The inherent assumption in the strategy is that these two measures (case unit structure for eligibility, and household structure in PAB) are the same, when by design they are not and there are various unobservables that cause them to differ. If this is not an assumption being made in the strategy, then the text should make that clear.”

You can see that this is where, instead of a knee-jerk reaction to the identification strategy (“It’s not an RCT: next!”), we carefully parse out the reasons behind why we could successfully match people with discordant treatment statuses in our baseline data. Some of those scenarios, like the ones we originally argued, favor the ID strategy. Others, like the third explanation that Reviewer #3 proposed, are clear threats to identification. The job of editors is to arbitrate the relative likelihood of those alternatives, weigh them against other study limitations, and the novelty or the potential contribution of the paper and deliver a verdict: reviewers and authors debate in front of the judge, with the reviewers asking for more evidence in favor of the author’s arguments or a concession of their point.

As you can see in our response to the reviewer below, which I am pasting fully, we went with the concession approach:

a.      We thank the reviewer for this comment. This is indeed a threat to identification. It is similar to one that we had thought about but the reviewer’s example is much clearer. We elaborate below in response to the reviewer and also insert text into the revised manuscript to clarify that this threat to identification exists (and is not addressed by the IPW approach we adopt):

b.     Imagine that there are two households of identical composition: Mother, father, three unmarried children under the age of 17 (C1-C3), and a grandmother. In HH1 everyone is registered under the same DGMM family number. In HH2, everyone is registered under the same DGMM for that household, except for C3. Let’s say, hypothetically, that HH1 is eligible by the official criteria due to the fact that their dependency ratio is equal to 2 (4 dependents divided by 2 able-bodied adults >1.5), while HH2 is deemed ineligible because their dependency ratio (in their registration and, hence, ESSN application) is equal to 1.5: so, HH2 just misses the cutoff…As the reviewer pointed out, there could be multiple reasons for this:

                                               i.     C3 in HH2 was left out of the registration by clerical error. The HHs are identical. IPW works fine.

                                              ii.     C3 in HH2 is not yet registered, because her national ID card was lost en route to Turkey, but she is just like C3 in HH1 otherwise. The HHs are identical. IPW works fine.

                                             iii.     C3 in HH2 is not yet registered because she arrived later from Syria with her cousins, with whom she was staying before leaving Syria. HHs are not identical – potential threat to IPW.

                                             iv.     C3 in HH2 is not registered because she is not related to the family registered under the DGMM. She is a distant relative (say, not eligible to be registered under the same family number) or a friend of the family. HHs are not identical – potential threat to IPW.

c.      The reviewer laid out scenarios 1, 2, and 4 in their comments. We had discussed scenario 3 among the co-authors, following a comment by a seminar participant: “If two HHs with the same demographics are different because some of them arrived later and, hence, not registered with the family, are they really the same as families who arrived altogether and managed to register everyone?” The reviewer’s scenario is starker: it is in fact possible that two households that match on all eligibility criteria (per the baseline survey) and have discordant treatment statuses are in fact different types of families, with the difference not being orthogonal to the primary outcomes in our study. For example, C3 could be a biological daughter in HH1 vs. a niece in HH2, which could affect their probabilities of being in school, working for money, getting married early, etc., with knock-on effects on food consumption, school attendance, and so on. Unregistered members may be more likely to be a certain sex or older or younger within an age band, etc. These would all violate the conditional independence assumption.

d.     We have added a long footnote 22 into Section 4.1 (pasted below), warning the readers of this possibility. As we do not have information on exact ages of HH members, relationships between members, or who is registered under the same DGMM and who is not, there is no way for us to check how serious this threat is. However, perhaps, this is not a major concern when it comes to the main finding of the paper, i.e., the endogenous HH reorganization. Regardless of whether such differences exist between households that look very similar on observables, it would be hard to come up with plausible scenarios that also explain such a large net movement of children from the control to the treatment group within a short period of time – instead of this being a causal effect of ESSN eligibility. It may be a more serious concern for the VT estimates on school attendance, per capita consumption, and so on.

                                               i.     The registration system had problems, particularly during the initial stages, which included registration of split families (see, e.g., Oxford Policy Management 2018). The FAQ page for Kızılay Card includes a question on what applicant households should do if they have some family members registered under a different DGMM family number. Hence, the possibility that discrepancies existed for many households between household composition at baseline and what is registered at DGMM, i.e. what is in the ESSN application, is not in question. However, eligibility status could also be discordant for two households with the same observable composition if differences in registration reflect real differences in family structure: for example, everyone could be registered in one household because they are a nuclear family, while another household with exactly the same observable eligibility criteria might have one member not registered because she is a distant relative, a married child, or a friend of the family, who could have been deemed ineligible to register with the household under the program rules of what is considered a family. We cannot rule out the existence of a non-negligible number of such cases in our study and our identification strategy does not protect against such unobservable differences. If, after matching eligible and ineligible households on a rich set of baseline characteristics, such differences remain and they are predictive of the outcomes of interest, they could bias our findings. Such bias is less likely to be relevant for the main finding of endogenous household reorganization, but more so for outcomes such as school attendance, food consumption, etc.” (Footnote 22 on page 16 of the resubmitted manuscript)"

So, your infrequent reminder that not all reviewers are asinine. Many papers do get better through the much less than perfect peer review system that we have and not all of them cause psychological damage in the process.


Factors that may have led to a higher chance of acceptance …


This is speculation, obviously, but here are a few things that I think might have been in our favor. Some of these are under the researchers’ control while some of them are pure luck:

·       Pre-registration of the non-experimental analysis: This is becoming a bit more common now and I do think that it helped us a bit. When my operational colleagues asked me to get involved with this possible evaluation, one of the things we asked them was to give us the baseline data and hold back and follow-up data that came (or would come) in – until we conducted our matching exercise, finalized the study sample, and registered it along with the plan on how to analyze which outcome variables. I suspect that fixing the matched and trimmed sample and registering it with EGAP helped alleviate concerns of tinkering with different matching approaches for p-hacking. Registering a small number of outcome indicators helped prevent us from going to look at other variables that might have been affected by the program, and so on. As an aside, you could not do much better than reading Imbens (2015) on matching methods, while writing your PAP for a study like ours…

·       Being very forthright with the study limitations: Many fields, such as biomedical sciences, have a section at the end devoted to study limitations. We felt that our study had so many of these issues, each of which was important (and potentially fatal) by itself, that we needed to have a large section in the middle of the paper – before presenting the study findings – to discuss them in detail. With a sub-section devoted to each limitation and what we did to address it, we were able to say essentially that “even taking these into account, and being conservative with the interpretation of the findings, a robust picture of defensible conclusions emerges nonetheless.” I have no way to prove this, but I do think that it is better for authors to point to problems with the study themselves rather than sweeping them under the carpet by having a throwaway sentence or two that dismiss them. I take this approach when giving seminars as well as while writing papers.

·       Interesting findings that are relevant for policy design (i.e., luck): If our main findings did not survive all the bounding exercises, etc., discussed above, the paper would have been much less convincing. Similarly, if we did not stumble upon the interesting finding on household reorganization in response to the introduction of the cash transfer program, it would also have been a less interesting paper – more like a plain vanilla impact evaluation, which cannot credibly estimate an intent-to-treat effect. That the program potentially reduced poverty and inequality substantially among the entire refugee population, and not just in the treatment group, directly as a result of the churn in household composition, is interesting and important for policymakers to know about: there are possible design lessons to take from these findings.

  •         I can almost hear some of you saying “interesting findings should have nothing to do with it…”: I agree with you and also disagree at the same time. On the one hand, sure, the file drawer problem is real and null findings should get published. I think that this is true for registered studies (especially very important RCTs for policy). I also think that such publications do not have to be in our flagship journals (unless the null finding is itself interesting because it debunks a popular view about the effectiveness of something). They need to be searchable, so that they can be included in reviews and meta-analyses. Our journals are like our profession’s newspapers, devoted mainly the headlines. Many of these are boring – even when they are relevant for policy: we don’t need to insist that every evaluation of every cash transfer program, regardless of what it finds, be published in our flagship journals.


Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation