Here is a familiar scenario for those running field experiments: You’re conducting a study with a treatment and a comparison arm and measuring your main outcomes with surveys and/or biomarker data collection, meaning that you need to contact the subjects (unlike, say, using administrative data tied to their national identity numbers) – preferably in person. You know that you will, inevitably, lose some subjects from both groups to follow-up: they will have moved, be temporarily away, refuse to answer, died, etc. In some of these cases there is nothing more you can do, but in others you can try harder: you can wait for them to come back and revisit; you can try to track them to their new location, etc. You can do this at different intensities (try really hard or not so much), different boundaries (for everyone in the study district, region, or country, but not for those farther away), and different samples (for everyone or for a random sub-sample).
Question: suppose that you decide that you have the budget to do everything you can to find those not interviewed during the first pass through the study areas (doesn’t matter if you have enough budget for a randomly chosen sub-sample or everyone), i.e. an intense tracking exercise to reduce the rate of attrition. In addition to everything else you can do to track subjects from both groups, you have a tool that you can use for those only in the treatment arm (say, your treatment was group-based therapy for teen mums and you think that the mentors for these groups may have key contact information for subjects who moved in the treatment group. There were no placebo groups in control, i.e. no counterpart mentors). Do you use this source to track subjects – even if it is only available for the treatment group?
Now, I know that you’re expecting an answer, as most of the time our blogs are about a new paper that is addressing such a question. However, in this case, while I have my ideas, I can’t claim to have the answer. In fact, a short search that I conducted has not produced an answer to this question. But, it did lead me to read the “attrition” chapter of the “Field Experiments” book by Gerber and Green, because I thought that Don Green might have mentioned something about this in his 2012 book. The reason is that this question was actually put to me by one of our field workers in a current experiment, where we are currently, and intensively, tracking all respondents who can be tracked. When I answered that we are tracking as intensively as we can for everyone, meaning that we use all sources of information available on the possible whereabouts of our missing subjects, the said field worker told me that Don Green, in the class that she took from him at Columbia, told the class that it was better to leave such information on the table, lest it causes differential attrition. So, off I went to revisit chapter 7 in his book…
For those who want a primer on the topic, this chapter is a good read as it goes into things in a bit more detail than the randomization toolkit by Duflo, Glennerster, and Kremer(2006) (I don’t have the more recent Glennerster and Takaravasha book at my fingertips). Of course, you should also check out our “tools of the trade” posts on attrition. While the chapter generally goes through the familiar topics that surround the handling of attrition, I did learn a couple of things that I had not thought about as carefully previously. I also liked the use of the potential outcomes framework, which makes things easier to understand (and more contemporary). Here is a quick summary of the chapter, in bullet point form:
- Attrition is a scourge in all kinds of studies, but the sting is felt most acutely by those who set up RCTs, because of the threat of bias in an otherwise clean design. In field experiments with survey or biomarker data collection, there will always be loss to follow-up and without some assumptions about the form of that attrition, it may be impossible to make any causal inferences about intention to treat or average treatment effects.
If attrition (missingness in Gerber and Green, 2012) is independent of potential outcomes (MIPO), then we have unbiased estimates but reduced power. Of course, MIPO is an assumption that you cannot rally confirm or deny, but investigate with good statistical detective work in the usual ways that we increasingly all do in economics and political science. Sometimes, MIPO may be satisfied but only conditional on certain baseline covariates X, say, age, sex, location, etc., which the book calls MIPO | X.
- Again, MIPO | X is simply a conditional on observables assumption, but one of the things that I took away from reading the chapter was that you may examine this assumption by looking at baseline variables that are prognostic of the follow-up outcome within these cells. Suppose that you have differential missingness by age and sex, but you think that things are orthogonal once we condition on these covariates: if your outcome is a test score and you have a baseline score, you could simply look at the baseline scores within each of these cells and confirm that they don’t predict attrition. Of course, if you could do that, why not further your MIPO | X assumption, where X also includes that (those) prognostic variable(s)? Just as in the examination of baseline balance, you don’t have to control for things that are unbalanced: you should control for things that are prognostic of the outcome of interest.
- MIPO | X also gives you the commonly known and sometimes used reweighted ITT using cell proportions, or otherwise known as inverse probability weighting (IPW).
- Generally, however, it may simply be impossible to convince yourselves, your referees, or your editor, that missingness is not a problem. And, given that missingness is a fact of life, you’ll have to deal with it. As mentioned above, IPW is one way of dealing with it, but increasingly one that does not cut it with referees, especially in RCTs because you’re invoking a “conditional on observables” assumption – something from which you were desperately trying to get away when setting up your study. I personally still like it reported (it can simply be in the form of showing an F-test of the regression of missingness on a bunch of prognostic baseline covariates; and interacted with treatment status): gives me a data point that I can evaluate.
What Gerber and Green (2012) define as “extreme value bounds” and “trimming bounds” correspond to what economists usually call “Manski” and “Lee” bounds:
- Manski bounds are assumption-free and bracket the true ITT or ATE. However, in practice, they can be so wide as to be meaningless. This can either happen because attrition is large (so there are too many missing values to fill with extreme values) or the outcome variable is such that range of plausible values is large. So, if you have a binary or a discrete outcomes and low attrition, Manski bounds can work for you. Otherwise, you are out of luck here: this approach will suggest that the effect of your program could be anything…
- What about Lee bounds? Well, despite its popularity, I never liked this approach over the alternatives… It introduces a monotonicity assumption that excludes the existence of “if-untreated-reporters” (if attrition is higher in control) or that of “if-treated-reporters” (if attrition is higher in treatment). Once you do this, you can no longer recover an estimate for the ITT for your original sample, but only an estimate of program effects for “always-reporters”. Rank preservation restrictions, which are discussed in the Duflo et al. (2006) toolkit (which refers to the Angrist, Bettinger, and Kremer, 2006 paper that is also used as an example in Gerber and Green, 2012) and the Behaghel et al. (2015) paper, discussed by David here, are versions of the same assumption. The method is popular because it gives tighter bounds than the Manski approach (even tighter with the improvements proposed by Behaghel et al., 2015), but it comes at the cost of making another assumption and giving up the ITT/ATE for the original random sample and settling for them among the always-reporters. I don’t understand why many reviewers frown upon IPW but are much more accepting of Lee bounds: in RCTs, often times the ITT on the original random sample is of key importance – it’s the population of interest. Sure, the ITT for the always-reporters may be informative in cases where we’re looking for a test of a theoretical prediction, but such subtlety is hardly present in papers or referee reports.
- OK, so none of the ways to deal with attrition are sufficiently attractive and a mild case can torpedo your study. So, the best offense is a good defense: prevent large amounts of attrition to begin with. One can limit attrition by devoting more funds to finding subjects at follow-up, but, of course, those funds come at the cost of something else: a larger sample size, better measurement of outcomes, etc. The book is convincing in guiding researchers towards selecting a random sample of those lost to follow-up and intensively going after them. It does so by showing the differences in bias in ATE and the extreme value bounds theoretically and through simulation exercises. In cases where the attrition problem is non-negligible and “regular tracking” is unlikely to be highly successful in lowering it enough to make “Manski bounds” meaningfully tight, a plan to select a sub-sample (perhaps using block randomization) and trying really hard to find them in ways that are more expensive than regular tracking may end up being more cost-effective. Simulations (presented in Table 7.6 of the book) are useful because selecting a sub-sample is risky business – it will produce noisier estimates that will get assigned higher weights in the final tally. Intuitively, when the intensive second round subsample is large enough and successful in finding most people, and therefore more likely to be MIPO, the benefits from this approach are shown to be the highest, especially if the first round attrition was not MIPO. And, hoping for success in such an intensive tracking exercise does not have to be a pipe dream: in this paper, we randomly sampled one in three children who did not have assessments after the first pass through their schools and villages for further tracking, and were able to find 37 of the 42 children (88%) randomly assigned to second-round tracking. As Gerber and Green point out, you can then conduct Manski bounds on a much smaller share of the sample: suppose you were missing 25% of your sample in the first round and found 90% of the random sub-sample in the second round: you now need to fill in only 2.5% of the sample 0.25 x (10/100) to calculate extreme value lower and upper bounds…
First, it seems to me that it is paramount to minimize the number of subjects lost to follow-up: this makes bounds estimated later tighter – I am willing to have possibly a bit more bias in my ITT/ATE for tighter bounds. Second, it is not clear to me that using an additional source to find someone is all that different than other things our experienced enumerators might be doing to locate everyone they can: perhaps there was also a more prevalent source in control clusters that is not observable to me as the PI. Third, it is also possible that, ex-post, the treatment group is much less likely to be found in their original villages because, perhaps, the treatment caused more of them to seek new opportunities in other (urban) areas (see, for example, this paper by Markus and colleagues). In such cases, an advantage in finding treatment subjects in the second, intensive tracking round may actually restore balance in missingness by closing the gap in attrition from the first round. Finally, if you define “trying really hard to obtain outcome measures for missing subjects” as doing everything that you can possibly do given your budget (and not intentionally pursuing different methods by study arm, but simply using all sources of information to locate missing people), then I am not sure that this clearly constitutes a new source of bias. To me, that cat is already out of the bag as soon as we were unable to find some people from either group – as we simply don’t know what factirs are causing whom to be lost to follow-up in either group. Once we’re agnostic about the bias and trying to provide bounds for the ITT as tightly as we can, I’d rather minimize the number of subjects missing rather than leaving sample on the table…
But, like I said, this is more of a bleg and I am happy to hear counterarguments – they might, after all, save my next field experiment. Please comment below, especially if you’re Don Green or you work in his lab…
Update (9/25/2017; 8:00 AM): Within an hour Alex Coppock (@aecoppock) responded with a link for "Double Sampling for Attrition." Worth your while checking it out...
We have a whole section on practical ways to reduce attrition in Chapter 7 of Glennerster and Takavarasha. Also long section in my chapter in Handbook of RCTs. http://www.sciencedirect.com/science/article/pii/S2214658X16300150
My top tips are following, some are not very expensive:
1. plan tracking from beginning: eg ask at baseline who locally will know where you are if you move. Collect multiple phone numbers.
2. timing of survey (if program successful in getting people in work, need to do some follow up after work hours. Also to trace adolescents who moved we found interviewing a parents home during holidays when they typically return to parents--equivalent of thanksgiving--very effective)
3. Ask peers or others. Dupas collected data on girls dropping out from other girls at school (cheap and effective). We get main outcomes from parents who dont move. Cheaper and low attrition. Only follow up a subsample to check validity
On top of what Rachel Glennerster says, it is important to have a "tracking module" at the end of the baseline survey to:
1. Collect addresses of: where the person lives, where the interview was done and where the person works (if, for instance you want to track a business person).
2. Extra contacts information: neighbours, relatives, etc. their phone numbers, addreses and also how they generally know the respondents (maybe they know a nickname and not the real full name).
3. Pictures of the respondent (very useful when tracking very mobile respondents - the picture can be shown in the market/neighborhood).
Also, very important figures to distinguish and hire in the field team:
4. Mobilizers (that go ex-ante, 3 days in advance to schedule the interview) and Trackers (that go ex-posteriori, if the respondent did not show up).
5. Collect all the tracking information per respondent on one page tracking form so that mobilizers/trackers can have access to every detail in an easy format.
Thanks for these tips - all things we regularly do in our field work. My question (and the impetus behind the bleg) was much more specific, however...
Reading Gerber and Green's chapter encouraged me to write this manuscript (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2302735) that you may find useful. Therein I prove when and why attrition is a problem, how to diagnose it, and what to do about it -- including 2 stage sampling. One key aspect is that attrition can be:
(1) fully problematic, e.g. cannot conclude anything at all;
(2) partially problematic e.g. can at least infer impact for those units reporting outcomes; or
(3) non-problematic (e.g. can infer SATE for all units, including those missing outcome)
Reading Gerber & Green I was dissatisfied with definitions like "Attrition is not a problem if MIPO|X". To me that sounds like an accounting identity. Like saying "Cancer is not a problem if you don't have cancer".
At a minimum I wanted to (1) fully understand why cancer is a problem, and what causes it; then (2) how to diagnose it even in situations where I cannot observe cancer cells directly (the diagnostic technique in the manuscript make minimal assumptions as to the data generating process).
More generally, what I try to show in the manuscript is that missingness is best understood as a causal question, not a statistical one per se. That is, it is best discussed in terms of hypotheses about the underlying structural model generating the data. This is the sort of framework needed to *justify* MCAR, MAR, and MNAR. Indeed, at times causal estiamtes can be recoevered even under MNAR conditions (see also the related work of Pearl et al (e.g. http://papers.nips.cc/paper/5575-graphical-models-for-recovering-probab…)).
Great - thanks. I'll download and add to my reading list - as ou know I do updates from time to time, based on the references that kind and thoughtful readers like yourself send in response to our posts...
Thanks for your post, which I read with interest. It does a nice job of summarizing the key issues in both theory and practice, and it seems as though your own foray into “double sampling” to address missingness makes for a more interesting example than the one we use in the Coppock et al. (2017) paper. If you could post a small replication dataset with an outcome variable from the first round (including a missingness code), that same outcome variable for those measured in the second round, a treatment indicator, and an indicator for the random sample of subjects with first-round missingness, I’d like to use it as a teaching example.
You make an interesting point about discarding symmetry in favor of obtaining as many non-missing responses as possible: “Once we’re agnostic about the bias and trying to provide bounds for the ITT as tightly as we can, I’d rather minimize the number of subjects missing rather than leaving sample on the table.” Ordinarily, I imagine a scenario in which a researcher clings to the hope of ignorable missingness. In that situation, the researcher is not completely agnostic, and it makes sense to undertake data collection in ways that preserve symmetry between treatment and control so as to get a point estimate that will be accorded weight proportional to the plausibility of the benign missingness assumption. If data collection were costless, one could take the best from each of our approaches – get all the data you can by any means necessary and use extreme-value bounds – while keeping track of which responses were obtained in a symmetrical fashion and which were not so that point estimates could be obtained under optimistic assumptions using the symmetrical subset.
Thanks. On the replication dataset, no problem - will put it together and post. Please let me know if it is for a course you're teaching this semester and I'll try to do it faster...
On the point you're making about ignorable missingness and a defensible point estimate vs. agnostic bounds, I have had a similar conversation with my co-blogger David McKenzie about this, who suggested the same distinction. I guess that a way to have your cake (point estimate) and eat it too (bounds) is to use all possible information to track all study arms first - setting aside the one key source of tracking information for the treatment arm (say, their trainers in a job training program, who have special information on their whereabouts). You then, in what we might call a "third stage" deploy this asymmetrical tool to track missing T. You could then compare the point estimate (and the baseline characteristics) from the symmetrical data set to the full asymmetrical set. This is somewhat akin to those papers that use the number of calls required to reach people to trim the dataset, but here it would have been a bit more deliberate ex ante as to what information is being used to track all vs. a specific study arm or sub-group...
Thanks for commenting. Cheers,
I have a question.I have conducted a randomized experiment.However,the attrition rate is average 35%.the mean between control and treatment is no difference.In this case, do I need to drop the attrition samples? Can I use Heckman model to deal with the attrition of samples?
You can see whether the characteristics of those lost to follow-up are different by treatment and control. If not, attrition bias (in the sense people doing field experiments think about it) can be presumed to be absent (no differential attrition in levels or characteristics): your T and C still are similar to each other. However, you have (a) lost a lot of power (reduction in sample size) and (b) if those lost to follow-up have different characteristics than those still in your sample, now external validity is reduced.