## Tools of the trade: when to use those sample weights

In numerous discussions with colleagues I am struck by the varied views and confusion around whether to use sample weights in regression analysis (a confusion that I share at times). A recent working paper [1] by Gary Solon, Steven Haider, and Jeffrey Wooldridge aims at the heart of this topic. It is short and comprehensive, and I recommend it to all practitioners confronted by this question. Gary taught me graduate micro-econometrics, and I happy to say that his writing ability mirrors the clarity of his lectures.

As is true for much in life and in research, there is no simple prescriptive rule for the use of sample weights. The authors’ main point is that we need to clearly understand why we want to weight (hence their paper title “What are we weighting for?”). The answer to this question, along with some exploratory diagnostics, should determine the best empirical strategy.

Allow me to quickly review why we calculate and use sampling weights in the first place. Quite often we work with surveys that sample different segments of the population with different probabilities. Surveys are designed this way to obtain more precise information on the smaller subgroups in the population. Sampling weights (the inverse probabilities of selection for each observation) allow us to reconfigure the sample as if it was a simple random draw of the total population, and hence yield accurate population estimates for the main parameters of interest. This could be something like the population poverty rate, or the average number of children born to mothers of a given age.

If we wish to use our sample to calculate a descriptive statistic that accurately measures the true value in the population, then we need to weight. After all, this is the original purpose of sampling weights – to reverse the distortion imposed by the differential sampling probabilities. However most analysis, and virtually all analysis related to impact evaluation, is not concerned with the accurate measure of population parameters but with the estimation of causal effects.

Yet if our goal is to estimate causal effects then the question of whether to use weights is not a straightforward one – there are situations that call for the use of weights and situations that don’t. The authors review three motivations for the possible use of weights.

__To correct for heteroskedasticity__. All of us have likely been taught that weighted least squares (WLS) will correct for any heteroskedasticity in error terms and thus improve efficiency. In practice this is a common strategy in policy evaluations that exploit spatio-temporal variations in the introduction of new laws or regulations, such as the numerous difference-in-difference evaluations of policy change in the US that leverage cross-state variation in the timing of legislative change (here is one example [2] that looks at divorce laws).

However there is no automatic gain in efficiency from using weights. The authors summarize an example in a paper from Lee and Solon [3] where weighting actually *reduces* the efficiency of the estimates – the exact opposite of what should happen. It turns out that weighting can reduce precision when the individual-level error terms are clustered within a group (such as a state). If this group average effect is relatively large (as may often be the case) and fairly homoscedastic then weighting will actually impose heteroskedasticity and unnecessarily increase the standard errors (this consequence was first pointed out by Dickens [4]).

So what can a practitioner do? The authors recommend (a) testing for heteroskedasticity (such as with the Breusch-Pagan test [5]) rather than simply assuming it exists, (b) continuing to do what many of us do already do and that is to report heteroskedasiticity robust standard errors, and (c) even report both weighted and unweighted results (more on this below).

__Endogenous sampling__. Another reason to consider weights would be to obtain the correct parameter estimates in the presence of endogenous sampling. We can think of endogenous sampling as cases where the regression error term is related to the sampling criteria. For many IE settings this probably won’t apply, but sometimes when we study hard to reach populations such as illegal drug users we rely on the use of convenience samples or techniques such as snowball sampling [6]. If we then mix these samples with a sample from the general population to estimate a treatment effect, we will have to deal with complications from endogenous sampling.

In the presence of endogenous sampling, unweighted estimates may well be biased, but will be corrected when weighted by the inverse probability of selection. On the other hand if the sampling probability is known to vary across certain strata and those strata indicators are included in the estimating equation, then the probability of selection should no longer be related to the error term, and weighting is not necessary (and may indeed reduce precision if the error term is mostly homoscedastic, as in the example above).

__Identifying average partial effects__. If the impact of treatment is heterogeneous – if it interacts with characteristics of the treated population – then linear regression (OLS) and WLS will identify different averages of the heterogeneous treatment effects. So should we weight or not? Unfortunately the best answer appears to be: it depends.

It’s natural to think that weighting regressions to reflect population shares of different sub-groups will yield a consistent estimate of the population averaged treatment effect when the treatment effect varies by sub-group. But this is not necessarily correct if the variance in the characteristics differs across the sub-group. For example if the variance of a characteristic that affects the magnitude of treatment differs in urban and rural areas, simply weighting by the population share of urban and rural will not yield the population averaged treatment effect. In fact the authors demonstrate that the OLS and WLS will both be inconsistent, and one does not necessarily dominate the other.

So what can be done? The authors suggest that instead of trying to average out the heterogeneity through weighting, the heterogeneity should be explored to understand better how to account for it. One convenient specification that can help in this exploration, when feasible, is a fully saturated model. A fully saturated regression model includes both dummies for each characteristic and a full set of interaction terms with treatment. This specification also has the benefit that it should yield consistent estimates of the population averaged treatment effect.

Remember I mentioned above that the authors recommend comparing the results from OLS and WLS. It’s important to bear in mind that in analytic practice our adopted parametric models are almost certainly mis-specified to some degree. But we want them, as the authors write, to be “a good enough approximation to enable nearly unbiased and consistent estimation of the causal effects of interest.” Under either endogenous sampling or misspecification that fails to capture heterogeneous effects, the OLS and WLS results should diverge – in the theoretical parlance they have different probability limits. Thus the comparison of the OLS and WLS results can be used as a diagnostic tool for these problems.

- Tags:
- Tools of the Trade [7]
- sample weights [8]
- weighted least squares [9]