Syndicate content


Weekly links July 31: Household surveys in crisis, Doing Business, machine learning, and more…

David McKenzie's picture
  • Household surveys in crisis – Bruce Meyer and co-authors in a new NBER working paper highlight the many issues due to declining cooperation of respondents in the U.S.:
    • Unit nonresponse: Households have become increasingly less likely to answer surveys at all: nonresponse rates for major U.S. surveys like the CPS, SIPP, and GSS now exceed 20 percent

Worm Wars: A Review of the Reanalysis of Miguel and Kremer’s Deworming Study

Berk Ozler's picture
This post was updated on July 24, 2015 in reponse to the increased traffic to the site from Twitter upon the publication of the replication discussed below and the authors' response in the International Journal of Epidemiology. I took a day to re-review the papers in question and not surprisingly, what I said below remains as valid as 6 months ago because all that happened is that the papers got published without much change since they first appeared on the 3ie website. However, I do have a few new thoughts (and one new table from Hamory Hicks, Kremer, and Miguel), which I discuss below. My original post remains unedited. I'd also like to thank Stefane Helleringer for a nice response he wrote about the definion of ITT in public health: see the back and forth here.

Despite the differences in various methodological and data handling choices, which I discussed below in my original post, it is clear that the interpretation of whether one believes the results of Miguel and Kremer are robust really rests on whether one splits the data or not. Therefore it is important to solely focus on this point and think about which choice is more justified and whether the issue can be dealt with another way. A good starting point is the explanation of DAHH in their pre-analysis plan as to why they decided to split the data into years and analyze it cross-sectionally rather than the difference-in-difference method in the original MK (2004):
The data from a stepped wedge trial can be thought of as a one-way cross-over, and treated as such, by comparing before and after in the cross-over schools (group 2) and accounting for the secular trend using the non-crossing schools (groups 1 and 3). However, such an approach requires assumptions about the uniformity of the trend and the ability of the model to capture the secular change, and as such loses the advantage of randomization.
This explanation seems confused to me: the common trend assumptions are something that need to be established in observational studies when we're using Diff-in-Diff as an identification strategy, but in a cluster-randomized trial like this one, we have it by design: Groups 2 & 3 are perfect counterfactuals for each other if the randomization has been done correctlty. Surely, if we look at a bunch of outcomes, we might find random differences in the changes from 1998-1999 between the two groups, but that's no reason to assume that there is something wrong with this approach or that it takes us away from the advantage of randomization. Analyzing everything cross-sectionally by year and not controlling for the lagged value of the outcome variable is costing DAHH some statistical power instead.

However, let's accept for a second DAHH's argument that there's something strange with Group 2 and we're wary of it. Them it seems to me that the solution is simple: why not look at the two clean groups that never change treatment status the whole study period of 1998-1999. In other words, exclude Group 2, pool all the data for 1998 and 1999 and compare the effects between Group 1 and Group 3. Sure, we lose power from throwing out a whole study arm, but if the results stand we're done! Thankfully, Joan Hamory Hicks was able to run this analysis and send me the table below, which is akin to their Table 3 in their original response:

As you can see, all effect sizes on school participation are about 6 percentage points (pp), which is remarkably close to the effect size of 7 pp in the original study. The p-values went up from <0.01 to <0.05, but that is fully expected having shed a third of the sample. So, even if you think that there is something strange going on with Group 2, for which the visual inspection presented by DAHH in Figure 3 is really not sufficient, you still have similarly-sized and statistically significant effects when making the cleaner comparison of Groups 1 & 3. Problem solved?

I want to conclude by making a bigger picture point about replications. They are really a really expanded version of robustness checks that are conducted for almost any paper. It's just that the incentives are different: authors want robustness and replicators might be tempted to find a hole or two to poke in the evidence and "debunk" the paper (if I had a dime yesterday for every deworming debunked tweet...). But, when that happens, I start worrying about multiple hypothesis testing. We now know and have tools for how to deal with multiple inference corrections, when the worry is Type I errors (false rejections of a correct null). But, what about Type 2 errors? After all this is exactly what a replicator would be after: finding a manner of handling the data/analysis that makes the results go away. But, how do we know whether that is a true "failure to reject" or a Type 2 error? Even in studies with 80% power, there is a 20%chance that each independent test will fail to reject under the null of a positive effect. The more of these you try, the more likely you'll come across one or two estimates that are insignificant. What to do about that?

To be fair to the authors, they were at least aware of this issue, mentioned on page 7 of the PAP:

We aim to deal with this problem by making a small number of analyses using as much of the original data as possible at each stage and concentrating initially on the direct intervention effects on the major study outcomes.

But, then this is where it would have been really important to have a very clear PAP, describing only a very few, carefully methodologically justified, analyses proposed and sticking very strictly to it. But, every step of the way when the authors decide to weight or not weight the data (cluster summaries), splitting the data by year, adjusted/unadjusted estimates, alternative treatment definitions dropping large numbers of observations, etc. there is a fork and the fork opens up more roads to Type 2 errors. We need replications of studies that are decently powered themselves, where the replicators are careful to hoard all the power that is there and not scatter it along the way.

I hope that this update has brought some clarity to the key issues that are surrounding the debate about the publication of the replication results and the accompanying flurry of articles. I was an unwitting and unwilling participant of the Twitter storm that ensued, only because many of you were responsible for repeatedly pointing out the fact that I had written the blog post below six months ago and linking to it incessantly throughout the day. I remain indebted to our readers who are a smart and thoughtful bunch...


This post follows directly from the previous one, which is my response to Brown and Wood’s (B&W) response to “How Scientific Are Scientific Replications?” It will likely be easier for you to digest what follows if you have at least read B&W’s post and my response to it. The title of this post refers to this tweet by @brettkeller, the responses to which kindly demanded that I follow through with my promise of reviewing this replication when it got published online.

Pollution, worker efficiency and the role of management: Evidence from India

Markus Goldstein's picture
In a nice, recent paper Achyuta Adhvaryu, Namrata Kala, and Anant Nyshadham take a look at how air pollution hurts productivity and what effect, if any, managers can have in mitigating these effects.   The short answer is yes, pollution hurts worker productivity, and yes, managers with certain qualities do a better job of mitigating these effects as they manage their workers.   That’s the short story, but how they get there is quite fascinating.  

From my mailbox: should I work with only a subsample of my control group if I have big take-up problems?

David McKenzie's picture
Over the past month I’ve received several versions of the same question, so thought it might be useful to post about it.
Here’s one version:
I have a question about an experiment in which we had a very big problem getting the individuals in the treatment group to take-up the treatment. Therefore we now have a treatment much smaller than the control. For efficiency reasons does it still make sense to survey all the control group, or should we take a random draw in order to have an equal number of treated and control?
And another version

Increasing Prosperity for the Poorest using Social Networks: Guest post by Kathryn Vasilaky

Social networks affect all of our lives; the people we know influence what we're exposed to and the actions we take. Gaining weight? Blame your network. Got a job promotion? Thank your network. In the developed world, the term "social networks" often illicits thoughts of Facebook, LinkedIn, Twitter, Instagram and many, many more. In the rural developing world, networks still depend, for the most part, on offline interactions. Access to information is by word of mouth, and a few central individuals often disseminate information from the top down to the remainder of a village.

Allocating Treatment and Control with Multiple Applications per Applicant and Ranked Choices

David McKenzie's picture
This came up in the context of work with Ganesh Seshan designing an evaluation for a computer training program for migrants. The training program was to be taught in one 3 hour class per week for several months. Classes were taught Sunday, Tuesday and Thursday evenings from 5-8 pm, and then there were four separate slots on Friday, the first day of the weekend. So in total there were 7 possible sessions people could potentially attend. However, most migrants would prefer to go on the weekend, and many would not be able to attend on particular days of the week.

Weekly links: preventing coups through free trade?, why don’t Indians eat more? And more…

David McKenzie's picture

Power calculations: what software should I use?

Berk Ozler's picture

In my experimental work, I almost always do cluster-randomized field experiments (CRTs – T for trials), and therefore I always used the Optimal Design software (OD for short), which is freely available and fairly easy to use with menu based dialogue boxes, graphs, etc. However, preparing some materials for a course with a couple of colleagues, I came to realize that it has some strange basic limitations. That led me to invest some time into finding out about my alternatives in Stata. I thought I’d share a couple of things I learned here.