Why I am now more cautious about using or recommending matched pair randomization and like matched quadruplets instead


This page in:

In the 2009 paper I wrote with Miriam Bruhn, we examined different approaches to improving balance in randomized experiments. In our simulation results, found pairwise matching to perform best in achieving balance in small samples, provided that the variables used in forming pairs have good predictive performance for future outcomes.

Pairwise matching was championed as a way of improving covariate balance for many variables at the same time, as opposed to stratification (sometimes called blocking), which typically only allow for a couple of variables. With pairwise matching, pairs are formed so as to minimize the Mahalanobis distance between the values of all the selected covariates within pairs, and then one unit in each pair is randomly assigned to treatment and the other to control. Imai et al. (2009) argued that “from the perspective of bias, efficiency, power, robustness or research costs, and in large or small samples, pairing should be used in cluster-randomized experiments whenever feasible; failing to do so is equivalent to discarding a considerable fraction of one’s data”.  Thomas Barrios (2014) blogged about an optimal way to select these matched pairs as part of our job market blog series.  More recently, Bai (2020) provides “an econometric framework in which a certain form of matched-pair design emerges as optimal among all stratified randomization procedures”.

So given all these benefits, why have I not used matched pairs for a while in my own studies, and why am I cautious about recommending it to others? Since I had a friend write to me with the request “I have a student who was considering doing pairwise randomization. I feel like we've had a previous discussion where your opinion was that this was not wise and can lead to problems.”, I thought I’d set out my current thinking on reasons to be cautious, even though I haven’t had time to work completely through a couple of the issues as noted below.

Reason 1: The Imbens argument on variance estimation

In his 3ie lectures in 2011, and in Chapter 10 of his 2015 textbook with Rubin, Guido Imbens notes that the obvious estimator for the average treatment effect in a pair is the difference in means between the treatment and control unit in that pair. But he notes this creates a complication for the estimation of the sampling variance of this effect, which requires at least two treated and two control units to calculate. A solution is to assume constant treatment effects and then take the sample variance of the pair-level treatment effects. But this can be conservative with treatment effect heterogeneity, negating some of the gains in power that come from pairing.

This variance issue has not been a strong reason for me to avoid matched pairs. Imbens notes that randomization inference based approaches will still work fine here (provided you restrict the set of permutations appropriately), and in a regression-based approach to experimental analysis, our simulations in Bruhn and McKenzie showed with individual-level randomization and matched pairs, you could get the correct size and strong power, provided you include pair dummies in the regression. Moreover, recent work on inference in matched pairs experiments by Bai et al. (2021) shows that while a standard t-test can be conservative, an adjusted t-test can work well, where an adjustment is made to the variance estimator by averaging across “pairs of pairs” the product of outcomes corresponding to a treated and untreated observation in adjacent pairs. However, the use of these “pair of pair” tests depends in part on assumptions on the sampling framework – Bai assumes units are drawn from a superpopulation and potential outcomes and covariates are random. In a finite sample setting, these pair of pair tests may not have correct size. A new paper by de Chaisemartin and Ramirez-Cuellar (discussed below) discusses this in more detail, along with some simulation evidence. In feedback to me on this post, Yuehao Bai notes “We also find that as long as we don't pair on too many variables, e.g., when pairing only on the baseline outcome, the test performs very well across the simulation studies in our JASA paper and in my job market paper. However, I have found in simulation studies with small samples and where I pair on many covariates, the tests based on the variance estimator in Bai, Romano, and Shaikh (2021) may not perform as expected, and in such cases there certainly may be a benefit to using matched quadruplets.”.

Reason 2: optimality results depend on there being a single outcome, and gains in balance can be largely attained by matched quadruplets or stratification.

The optimality results of Barrios and Bai take as a starting point that there is a single outcome of interest, and given this, then matching on covariates which best predict that outcome of interest will be optimal. However, in practice there are usually several outcomes we care about, and different users of the data, different policymakers etc. may have different weights on how much they care about these different outcomes.  Likewise, we might care about the outcome at different time horizons – so not just employment after three months, but also employment and wages after two years, job quality, etc. By stratifying on a couple of variables, and then forming matched quadruplets based on other variables, one can still improve balance and get most of the power gains across multiple outcomes of interest, without having to take an explicit stand on there being a single outcome (or well-defined weighted average of several outcomes that collapses to a single index).

Indeed, Bai conducts simulations for a jobs experiment where he considers a secondary outcome (number of applications) after matching pairs on the baseline value of the primary outcome (search hours), and finds that the mean-squared error from matching is almost the same as if no stratification was done at all – when there are multiple outcomes and they are not strong predictors of one another, then optimal balance on one outcome may not help in getting balance for looking at other outcomes. When there are multiple outcomes or covariates of interest, and samples are small, Bai notes there is a curse of dimensionality, where the units matched with one another are not close enough in terms of their covariates, and so his asymptotic results may not hold- with this problem mitigated by using quadruplets, formed as pairs of matched pairs.

Reason 3: the main reason for being cautious about matched pairs is the likelihood of attrition

King et al. (2007) argue that a potential advantage of pairwise matched clusters is that they are more “politically robust”, in that if “we lose a cluster for a reason related to one or more of the variables we matched on, such as low-income areas or clusters within cities, then no bias would be induced for the remaining clusters” – the pair unit can just be dropped, and the set of remaining pairs will still be as balanced as the original dataset, whereas in a pure randomized experiment, if even one unit drops out, it is no longer guaranteed that the treatment and control groups are balanced on average.

However, there are two problems that can arise with attrition. The first is that if units drop out at random, then the matched pair design will throw out the paired unit as well as the attriting unit, leading to a reduction in sample size and potential loss in power than if an unmatched randomization was used. E.g. Suppose you have 100 matched pairs, and in 10 of them both units attrit, in 20 one unit attrits, and in 70 pairs no unit attrits. Then attrition is 20% (40/200). But dropping all pairs with at least one unit attriting means dropping 30% of the units, a big increase in effective attrition.

The second issue is that the King et al. argument only works if attrition is random or for reasons perfectly predicted by the variables you have matched on. For example, suppose that there are heterogeneous treatment effects, and control units who would have had large treatment effects get disappointed with being assigned to the control, and drop-out. If these treatment effects are not perfectly predicted by the variables matched on, then dropping the pair of the control unit will show balance on observables for the remaining sample, but won’t deal with the endogenous attrition and lack of balance on unobservables in the non-attritors.

Since in most cases I tend to think that attrition is either as good as random (people dropping out for all sorts of reasons in their lives unrelated to treatment) or else systematically related to treatment effects but difficult to predict, I think we are rarely in the “goldilocks zone” where the attrition is non-random, but can be completely predicted by the variables we match on. As a result, I see the downside of losing observations by dropping the unit in a missing pair as being more important than the potential gains to having formed pairs – especially when the alternative is not pure randomization, but stratification and matched quadruplets for example. It is much less common to have both treatment units or both control units attrit in a quadruplet, and if high attrition is expected, this would be an argument to use even bigger strata.

Note that if attrition is completely at random, then there are methods that can avoid dropping the other observation in a pair. For example, one could not include matched pair dummies in the regression, effectively just comparing the mean for all non-attriting treated units to that of all non-attriting control units. But this then requires us to use the conservative variance estimator, leading to larger standard errors and more difficulty detecting treatment effects.

A final point to note on this is that attrition within a matched pair is much more likely to be a problem for individual-level randomizations than cluster-level randomizations – it is much rarer to have a whole school or whole village attrit than to have individual units attrit.

Reason 4: with clustered paired experiments, another issue to consider

A new paper by Clément de Chaisemartin and Jaime Ramirez-Cuellar raises a further consideration when it comes to cluster-level randomization with matched pair experiments. For example, consider an experiment which randomizes schools or villages, in which outcomes are measured for children within the schools, or households within the village. In a matched pair design, one would match pairs of schools, or pairs of villages.  The standard regression approach would then be to regress the outcome (e.g. test scores for children) on an indicator for treatment and a set of matched pair fixed effects, and then cluster at the school level.  However, they show this variance estimator can be severely downward biased, and that one should instead cluster at the pair level instead (This issue does not apply to the case of individual-level randomization, since the degrees of freedom adjustment effectively equates to halving the number of units).

You might ask (like I did), how to reconcile this with the Abadie et al. paper on when to cluster standard errors,  which I thought had finally resolved the issue of when to cluster with the answer of cluster at the level of randomization. The issue here is that in a matched pair experiment, assignment to treatment and control within a cluster is perfectly negatively correlated, since once you have selected one unit within the matched pair to be treated, the other unit’s treatment assignment automatically becomes control – and so you are effectively only doing N/2 random assignments, where N is the number of clusters, and N/2 the number of matched pairs. With individual-level randomization, the degrees of freedom adjustment takes care of this, but with clustered randomization with 10 or more units per cluster, the degrees of freedom adjustment is not sufficient, and one needs to cluster at the pair level. They are currently revising their paper, which will also provide recommendations for applied researchers on what to do with clustered randomization when you have only a few units per cluster.

There are also some open questions about how to best do inference in paired clustered experiments when the number of pairs is small (e.g. under 40 pairs) – related to some of the more standard issues one faces with clustered errors with few clusters. Randomization inference approaches that take account of the paired structure may be the most reliable approach in those cases.

Note that the de Chaisemartin and Ramirez-Cuellar paper also note that this issue of the correlation of the treatment status of units within a pair extends to the case of clustered stratified random assignment, if there are a small number of units within each strata. For example, they note with 5 units per strata, a 5% t-test is rejected 8% of the time, whereas with 10 or more units per strata, this becomes less of an issue. So matched quadruplets with clustered experiments is not immune to this issue.

I haven’t run simulations yet to see how statistical power varies in practice with the need to cluster at the pair level, but if you are planning matched pair clustered experiments, you should make sure to account for this pair clustering in your power simulations.

Bottom line

There are many advantages of stratifying on a few key covariates to improve baseline balance and improve power. But in many cases, these benefits seem like they can be achieved by using stratification or matched quadruplets, and some additional issues appear to arise when we go to the level of matched pairs. Unless you have an experiment with an extremely small number of units, it seems to be unlikely that the gains from going to the pair level are likely to be worth it, and my default is therefore to typically use quite a few strata, or matched quadruplets. However, most of my experiments involve individual-level randomization, and the de Chaisemartin and Ramirez-Cuellar paper requires thinking through these issues further in terms of how much power is gained from matched pairs or small strata in clustered experiments once one properly adjusts the standard errors.

Here is my current recommended approach:

Step 1: Decide if there are one or two key variables that you want to stratify on. This could be for reasons other than maximizing balance on the main outcome of interest, such as thinking that there could be treatment heterogeneity by gender or geographic location. Use this to form a relatively small number of strata.

Step 2a) If there are only a couple of main outcomes of interest, and they are in the same family (e.g. profits and sales for a firm experiment, or math test score and English test score in a education experiment), then form an aggregate index of this outcome and sort units according to this index. Then form quadruplets within each strata. For example, in an ongoing experiment on improving exporting in Colombia, we stratified by firm size and whether firms were exporting at baseline, and then within each of these strata, formed matched quadruplets according to an index of export management practices.

Step 2b) If there are many different outcomes of interest in multiple domains, or if the outcome of interest is not available, or if it is the same for everyone (e.g. everyone in a training program may be unemployed at the start of the study), then construct the Mahalanobis distance between units on a set of covariates that you think are likely to predict the main outcomes well. Then form matched quadruplets. One way to do this is to form matched pairs, and then match pairs of pairs on the basis of the median of the covariates of each pair.

Step 2c) Often we may have more than one treatment. Then I form matched triplets, matched quadruplets, matched quintuplets, etc. as in 2a), and then assign one unit within each triplet or quadruplet etc. to each treatment. For example, in a recent experiment on irregular migration in the Gambia, we had 4 treatment groups – we stratified by geography, and within each region, formed quadruplets of villages based on an index of migration intentions and migration experience, with one village within each quadruplet allocated to each treatment.

Note: thanks to both Yuehao Bai and Clément de Chaisemartin for interesting discussions around this blogpost that helped clear up several issues I was confused about. Both authors are currently revising their papers, so I encourage you to check back with their webpages in a few weeks or so and get more details on these issues.


David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

Jason Kerwin
April 19, 2022

Fantastic post, this is very helpful. I wasn’t totally clear on what the endogenous attrition problem could be if we drop entire strata in response to units attritting. Shouldn’t we still have a valid experiment within the remaining strata? An obvious issue is that treatment effects may vary so the TOT estimate may not match the population ATE, but that doesn’t seem to be what you’re saying.

April 19, 2022

Hi Jason. I didn't want to write equations in the post, but here is a toy example. Suppose our matched pairs consist of a boy and a girl, but we don't observe gender. Our pairs of (T, C) are then either (B, G) or (G, B). Our full sample has equal proportions of boys and girls in treatment and control and is balanced. But suppose girls attrit if they get put in control. Then we drop all (B,G) pairs, and are only left with (G, B) pairs - so the remaining pairs now have gender perfectly correlated with treatment. If gender also affects the outcome, then we no longer have a valid experiment. We would still be in trouble with a non-matched design, but would at least observe boys in both treatment and control statuses and could use bounds to deal with G attrition.