Published on Development Impact

Designing experiments to measure spillover effects

This page in:

Many programs affect those who were not directly targeted by the intervention. We know this for medical interventions (e.g. deworming: Kremer and Miguel 2004); cash transfer programs (e.g. PROGRESA: Angelucci and de Giorgi 2009); and now voter awareness programs (Giné and Mansuri 2011). While it is not hard to detect the existence of spillovers by comparing outcomes among (exogenously) untreated subjects in treatment areas to subjects in control areas in cluster randomized trials, it is harder to make statements of the following kind: “a 10 percentage point (pp) increase in the share of people treated would lead to a 3 pp improvement in the primary outcome of interest.”

I want to use the recent paper by Giné and Mansuri (2011) to discuss this issue in some detail. For the main messages of the paper, which Chris Blattman called “one of the nicest field experiments I have seen,” I urge you to read the paper yourself. There are lots of interesting things about this paper – especially if you’re interested in theories of voter turnout, and gender aspects of voter awareness campaigns – but I will ignore most of them to focus on the topic at hand. The salient facts for us are that a voter awareness campaign was randomized across geographic clusters in Pakistan and effects on female turnout and candidate choice were measured. The authors found impacts on both, but I will focus here on the turnout results.

The sampling and the randomization were done as follows. One out of every four households in each cluster was selected to be in the study sample. Then, all households in treatment clusters, with the exception of every fifth household, were treated; there was no treatment in the control clusters. Notice then that the treatment intensity in treatment clusters is constant at 20% (0.25 x 0.80). The authors find (see Table 4) that there was a 12 pp increase in female voter turnout among both treated and untreated HHs in treatment clusters over a control mean of 52% in the control areas with those in control clusters.

Let’s convert these impacts into absolute numbers. Suppose, for a second that two clusters have 500 women each (for the sake of argument, I will use women instead of HHs – makes no difference here). One of the clusters is randomly assigned to the control group and the other to treatment. In each one, the authors picked 125 women (one in four) for the study sample. In the treatment cluster, they treated 100 women (four out of five); in the other cluster no one received the voter awareness campaign. In the control cluster, we find that 65 women in the sample voted (52%), and if this sample is truly random, we expect to find 260 female voters in the whole cluster. In the treatment cluster, we find that 12 more women from the 100 treated have voted, i.e. 77 female voters PLUS 3 women from the untreated 25 (due to the spillover effect of the same size mentioned above), for a total of 80 female voters in the sample. As they left 400 women untreated and the effect for them is identical (12 additional voters per hundred individuals -- based on observing the outcomes for the 25 untreated women in the sample), we expect 60 (5x12) additional voters in the treatment cluster for 320 voters.

Notice here a couple of things. First, we detected the existence of spillovers, which is important. Sometimes, the entire aim could be showing the (non-)existence of spillovers. For example, you could have a program that reduces pregnancies among a certain targeted group, with theorized spillover effects among untreated women in treatment areas, leaving your cluster level impact at zero. Showing that no such perverse effects happened would be important in that case. Second, the authors can actually make a statement of the following kind based on the analysis above: “For every 10 people treated in treatment villages, there were 6 additional voters.” This is simply a function of the treatment impact, taking into account the intensity of the treatment and the size of the spillovers. This is also important, as it tells us that, without the spillover design (i.e. excluding every fifth HH in the sample from treatment in treatment clusters), the cost of an additional vote garnered would be 5 times the number reported on the basis of the calculations above.

So we know there are spillovers and we know that they were positive. But, we still don’t know how cost-effective this program really is. This is because the treatment intensity was fixed at 20%. For example, had the authors excluded 3 out of every 5 households rather than just one (yielding a treatment intensity of 10%), would they have gotten same effect? How about 30%? We know, for example, that if this was a medical intervention with a biological spillover mechanism (take your pick of deworming pills, vaccinations, or male circumcisions), we would get threshold effects: by the time I treated a large enough percentage of my target population, I wouldn't need to treat anyone else because the marginal gains would approach zero. With voter information provided in close quarters (the study clusters seem to be densely populated), would we get a similar effect? This design does not tell us the answer.

The ideal way to answer this question would have been to randomize the intensity of the treatment in treatment clusters. This way, the control group, at zero treatment intensity, would have been on the continuum of treatment intensities that ranged, say, from 5% to 35%. The figures could have been optimized based on the context, prior knowledge of what is feasible, power calculations, etc. That way, we could get much closer to finding the exact point where the marginal gains from the additional person treated would equal the marginal cost. There may have been a good reason why the authors did not design the study this way, but I did not see it in the paper. It would be helpful to discuss.

But, fortunately for all of us, the authors had more tricks up their sleeves. First, using GPS data for each household (a la the “Worms” paper of Kremer and Miguel 2004), they confirm statistically significant spillovers for people living up to a kilometer away. Second, they can use real voting data (rather than the ink on the finger from the household visits to confirm voting) from 21 polling stations. There happen to be different shares of treated females per polling station by chance because different numbers of clusters fall within the catchment areas of these polling stations (and treatment was randomized at the cluster level). While it is not obvious to think of a way these shares would be endogenous (though we would have to know more about how these stations are located, how they relate to the clusters, etc.), and despite the fact that the authors use polling station controls as robustness checks, I worry about two things on relying on this variation ex-ante for identification.

First, the authors seem to have gotten lucky. The average treatment intensity at each polling station should be 20% and they could have easily gotten a small variation around this mean rather than a larger, more meaningful one to identify effects. Second, with only 21 polling stations, the probability of chance bias, i.e. some unobserved polling station characteristics to be correlated with the treatment share (or intensity) is high. (It would be useful if the authors provide a scatterplot of the share treated vs. share voted data for the 21 polling stations, so that the reader can see this almost one-to-one relationship visually.) You would ideally not want to rely on this as the identification strategy to answer this particular question. This is particularly the case in cluster RCTs, where the number of clusters is barely powered to detect moderate sized impacts: we cannot rely on a random chance variation at a coarser geographic level to give us the answers with sufficient confidence.

The good news from this suggestive evidence that the authors have from the polling stations is that the program is even more cost-effective that it would appear to be with the constant intensity spillover design: for every 10 women treated, 9 more women would vote (and these marginal voters would vote more diversely!). But, there would be no changes in men’s voting share or patterns, and the authors speculate about what the reasons might be. Like I said above, this is a paper well worth your read.

In the meantime, for those of you who plan to incorporate randomized treatment intensity into your programs, a pointer/warning: worry about sampling weights. The intensity of treatment is not the share you’re treating in your sample, but in the population. So, if your sampling percentages across strata, clusters, etc. were not random, but rather correlated with baseline characteristics, then your treatment intensity would not be random, even if you randomly assign treatment within the sample. This could happen if you sampled less people from larger clusters, oversampled a particular group, etc. For example, you might be conducting a lottery within your clusters to assign treatment and you might fix the absolute number of winning tickets in each cluster. If you do this, the share treated will be a direct function of cluster size, i.e. not random.

(There is also a nice caveat here that I have not seen discussed much elsewhere. The intensity of the treatment may not only determine the size of the spillover effect: it may also change the size of the treatment effect on the treated. Just like we can have spillovers of treatment on untreated, we can also have them on the treated.)


Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000