Paul Gertler recently asked on twitter “Does anyone have thoughts on pooling data from identical experiments from multiple countries in the case in which sample sizes are under-powered in any given one?”. This is a situation I’ve experienced a few times, and seen other teams/projects discussing, so I thought it might worth a post. I think that multi-country experiments are also becoming much more common as more experiments move online with the pandemic – I currently have one 6-country experiment in progress, and another 2-country one for example.
An obvious question is whether pooling data from the same treatment in multiple countries is any different from pooling data from the same treatment in multiple villages, cities, or regions within a country, which is extremely commonplace? Thinking about what makes it the same and what makes it different is useful for understanding when and how you might want to pool data.
Let me start by giving a couple of examples of multi-country experiments, that we can use to make concrete some of the issues in the discussion. A first example is a five-country experiment in the Western Balkans (ungated version) that I conducted with Ana Cusolito and Ernest Dautovic. Here we wanted to test an investment readiness program, which was open to firms from Croatia, Kosovo, North Macedonia, Montenegro, and Serbia. The program was launched at the same time in all five countries, with centralized applications and 346 firms chosen. The number of firms ranged from 159 in Serbia down to only 4 in Montenegro. The program then was run across the five countries by the same implementing team, and treatment involved a mix of online training, individual mentoring, and in-person masterclasses that rotated around the different locations. A second example comes from Vega-Redondo et al. (2019), who ran an RCT with almost 5,000 entrepreneurs from 49 African countries. The whole sample took an online business course, and treatment consisted of creating peer groups to interact with one another – either from the same country, or across countries. In both these two cases, the experiment was designed as a multi-country experiment in the first place, with the same treatment implemented by the same organization in different countries. I will contrast these with the case of pooling together estimates from different microfinance experiments implemented by different organizations in several countries, as in the seven experiments that Rachael Meager studied.
1. What is the estimand we care about/what decision do we want to take?
Lihua Lei pointed to this useful paper by Miratrix et al. (2020) on estimating effects from multi-site individually randomized trials, which notes that there are several effects that could be of potential interest. Two key ones are:
· The effect for the average individuals in the sample: for example, what is the impact of the investment readiness program for the 346 entrepreneurs that took part in the Western Balkans experiment? This was our object of interest in our study, since we wanted to test whether such a program could help the types of individuals that would apply for such a program in the region. Often this is also the object of interest in multi-site studies within a single country.
· The effect for the average site/country in the sample: for example, what is the impact of the investment readiness program for the average among the five countries in the Western Balkans. This was not our object of interest, but it could be in other settings. Likewise, within a country, we might want to know what the effect of a program is in the typical village it is offered in.
If the treatment effects are homogeneous, then the average treatment effect will be the same in the two settings. If treatment effects are heterogenous, then these two estimands could differ, and different weighting schemes will lead to different estimates. For example, for the average effect across individuals, my treatment effect will mostly capture the impact of entrepreneurs in Serbia and Croatia, since this is where most of the sample comes from. In contrast, if I am interested in the effect for the average country, I could construct an estimate separately for each country, and then average those, which would end up putting relatively more weight on observations from North Macedonia, Kosovo, and Montenegro. An issue here is that if sample sizes are small for some of the countries (which was the whole reason for combining countries in the first place), then the country-specific estimates one gets will be extremely noisy (e.g. I would try to estimate an impact off of only 4 observations in Montenegro, or only 31 observations in North Macedonia, and then average these). A further issue to note is that if the proportion allocated to treatment varies across countries, then the regression estimator with country fixed effects may not deliver an unbiased estimate of either the person-weighted or site-weighted average, trading off some bias for added precision (see Miratrix et al.).
In addition to being interested in the average treatment effect in the sample, we might also consider as our estimand the average treatment effect in some broader population. This is useful if our question of interest is not “did the treatment work for the sample we had” but instead a question like “what should we expect if we expanded the treatment to more individuals” or “should we expand to another country” (that is similar to the ones our experiment was done in). Changing to one of these population estimands affects how much uncertainty there is in our estimate, and hence standard errors. This is the concept used in Meager’s work on microfinance, where she does not just want to say “what is the impact of microfinance for the sample in these experiments” but rather use these with a Bayesian hierarchial model that partially pools locations to understand what we should infer about average effects in a broader population. A further issue with the population perspective is that it seems rare that an implementor would randomly choose which countries to implement a program in from a broader pool of countries, and so the “super population” that is being considered is not easily defined in most empirical contexts.
2. Does it make sense to talk about “the treatment”?
Even when we do a single experiment at the same time within a country, there are likely to be differences in how it is implemented in different places. E.g. different teachers might implement an education intervention slightly differently in different schools, or take-up rates might differ in a large city to in a smaller one, or a training program may use different providers in different areas. But still, since we are rarely interested in the specific impact of a teacher or a specific trainer, we are usually content to average across these and think it makes sense to still discuss it as a single treatment.
When it comes to thinking about doing an experiment in multiple countries, these issues could be compounded. In online treatments, where everyone in different countries is getting exactly the same treatment, it is not an issue. In something like our Western Balkans project, where the same intervention team was implementing in different countries, the treatment is still basically the same across countries. But this differs from the microcredit example, where different providers are selecting participants in different ways and offering products with different features, and so there is a lot more heterogeneity in what the treatment is, and thus it is less clear what we are getting by the average – or the treatment becomes a vaguer concept (e.g. the average impact of “microcredit of some form” rather than of “a 2-year group-based loan offered to female poor borrowers at an 18% interest rate with no collateral requirements and subject to BRAC’s loan enforcement procedures”).
Of course, even if the treatment is the same, its impact may differ in different places because of differences in the characteristics of the individuals participating, or because of differences in the markets, institutions, and other context that affects how successful the intervention is. For example, a job training program may have different impacts in countries with high unemployment and reliance on informal referrals to those in countries with tight labor markets where employers typically hire based on C.Vs and job platforms. If sample sizes are small, we might not be able to have enough power to separately estimate impacts in each of these different settings, and there can be debate about how interesting the average effect is, if impacts vary so much from one place to another. My feeling is that this is still one of the potential advantages of doing the experiment in multiple places, since the treatment estimate is much less likely to reflect special circumstances in just one location, and so external validity will be greater.
My takeaways
My view is that when the same experiment is being run in multiple countries, it can make sense to pool the data to boost power. This is best if specified ex ante, rather than being an ex post decision, and you need to be clear what estimand you are interested in, and to make sure this is clearly described in your write-up. What is clear is that there are a number of plausible estimands and estimators that could be used, and it is not that one is always right and the others are always wrong, but just that they help answer slightly different questions. For my work, in most cases I think the finite sample approach rather than super population approach makes more sense for countries, and that weighting each individual equally, rather than weighting estimates from each country equally will be of most interest. I’m glad that this concurs with the concluding advice of Miratrix et al, who write “It is difficult to go wrong starting a presentation of findings with the finite-sample person estimand. Even if the ultimate goal is understanding a program’s effects in a super population, it is difficult to imagine situations where it is of no interest to know the extent that a program was effective for the study participants at the study sites.”
Join the Conversation