In the design and analysis of many experiments, we measure the key outcome (income, profits, test scores) at baseline, do an intervention, and then measure this same outcome at follow-up. Analysis then proceeds by running an Ancova regression, in which the outcome is regressed on treatment, any randomization strata, and this lagged value of the outcome. Where possible, we often try to stratify the experiment on this baseline outcome, e.g. by forming groups of individuals or firms with similar outcome values. Using this approach can greatly improve power over using just the follow-up data (and over using difference-in-differences).
In the last couple of weeks I’ve received several questions that concern how should we adapt this approach when the setting is one in which the key outcome is the same for all individuals at baseline? This is actually pretty common:
-
E.g 1: programs that work with youth or the unemployed are interested in impacts on employment and wages, but at baseline no one is working (and everyone has zero income).
-
E.g. 2: programs that try to help people start businesses are interested in impacts on business ownership and profitability, but at baseline no one runs a firm (and thus all have zero profits)
-
E.g. 3: an intervention is aiming to change migration behavior, starting with a group of individuals who are all currently non-migrants – it might be trying to dissuade risky migration, or facilitate migration through safe channels.
The key question in such settings is then what can researchers do that is better than simply measuring the outcome at follow-up, and doing a simple regression of the outcome on treatment using only the follow-up data?
What can you do at the experimental design stage?
The key thing here is to first measure what you think are likely to be the key predictors of the future outcomes of interest, and then, secondly, to potentially use these predictors in how you randomly allocate subjects.
To make this concrete, consider a vocational training program for the unemployed in Turkey. There we had a sample of over 5,000 unemployed people who applied for a wide variety of courses, lived in different places all over the country, and varied by gender and age. All of them were unemployed at the time of applying. But we know that labor market outcomes are likely to vary considerably depending on your age, gender, local labor market conditions, and type of occupation you are likely to be looking for. We therefore stratified by geographic location/course, gender, and age (above or below 25). This gave 457 different strata, with assignment to treatment then within strata. Sure enough, the control group employment rates at follow-up differed greatly by these strata: averaging 62% for men over 25 vs 49% for men under 25, 40% for women under 25, and only 30% for women over 25. Stratifying and controlling for these strata then effectively amount to doing this on the unobserved baseline propensity to find work over the next year. Note that this was a setting in which the randomization had to take place using just data collected on an initial application form, so we could not also ask and measure other likely correlates of the likelihood of finding a job, such as measures of skills, labor market histories, and job-finding expectations.
The advantage of the above approach is that you do not need more data than your baseline, and you can stratify on variables that you think are likely to matter for a number of your key outcomes, without specifying in advance how they matter for outcomes. You can also think carefully through theory and use local knowledge to try to measure what you think will be the key predictors, irrespective if previous surveys have ever tried to measure these aspects.
An alternative, when there is only a single outcome you care about, and when you have some panel data pre-treatment, is to formally predict the follow-up outcome, given the set of baseline covariates available. Thomas Barrios’ job market post explains how to do this. So in Turkey, for example, we might have used exactly how age, gender, subject specialization and city were historically associated with the likelihood of exiting unemployment and form matched pairs based on this predicted outcome. But, as in many cases, we did not have such data, and moreover, we were interested in more than one outcome, so that approach would have required further adaption to trade-off how much we valued different outcomes.
Note that the focus here is on measuring the best predictors of future outcomes, regardless of whether or not there is a causal relationship (so long as we can reasonably assume the prediction relationship is reasonably stable).
What can you do at the analysis stage?
Of course, there are limits on how many variables you can stratify on, and often we might not have a lot of prior data to be good at predicting the future outcome – or, as in the Turkey case above, your random assignment may have to take place before all the baseline data are collected. What can you do in these cases?
In these cases, with no baseline outcome and few strata to control for, we may instead want to control for covariates that are predictive of our outcome in our treatment regression. This ex-post covariate adjustment can improve precision, but the concern is that it can introduce additional researcher degrees of freedom, and that post-hoc selection of a subset of covariates out of a larger set of potential controls can have the potential to lead to biased estimates of the treatment effect.
Two approaches can then be used:
-
Ex-ante specification and registration of a subset of exactly which variables you will control for: So in my vocational training example, the pre-analysis plan could have specified that the outcome equation for employment would control for 10 variables that capture things I think might matter for finding a job, where I would specify in advance exactly how these variables are to be defined and coded. This can be a useful approach if you don’t have a lot of potential controls.
-
A pre-specified machine learning approach: a recent approach that offers more flexibility is to use a machine-learning approach to variable selection, with double-lasso regression of Belloni et al. the leading candidate. Nice descriptions are provided by Urminsky, Hansen and Chernozhukov, and in Esther Duflo’s NBER masterclass from last summer. Suppose you have a whole set of potential baseline variables to condition on, W1 to W100. Then to decide which controls to include, you i) use lasso to regress the outcome Y on these W’s and select which variables are most useful in predicting the outcome; ii) use lasso to regress treatment T on these W’s and select which variables are most useful in predicting treatment assignment (this helps correct for chance imbalances); and iii) Finally, regress Y on T and the set of W’s selected in either of step i) and ii). If you have stratified, the strata dummies should be partialled out before the lasso estimation.
This machine-learning model sounds attractive and there is Stata code (pdslasso) to enable you to do it. This description makes it seems like it takes away a lot of concerns about ad hoc choices of control variables. But this is not the case unless further steps are also taken. First, the researcher needs to specify in advance exactly which variables (W1-W100 here) they will consider. This can include quadratics, interactions, and other transforms of variables. Of course, this also requires making sure you measure these variables at baseline. Second, the choice of the lasso tuning parameter can greatly affect which variables this method selects and is important for avoiding overfitting. So precise details of how the tuning parameter will be chosen also need to be pre-specified. A final point to note is that this method will likely choose different control variables for each outcome variable – just as the lagged outcome is different for each outcome in standard Ancova.
These approaches have the potential to greatly increase power in situations where everyone has the same baseline outcome value, but where future outcomes are reasonably predictable. In contrast, if you are in a setting where it is really difficult to predict the future outcome, the gains may be very modest and you may not be able to do much better than just simply regressing follow-up outcomes on treatment.
Isn’t it a bit dodgy that people can effectively choose their standard errors?
In discussing adding controls with a colleague, it was noted that it somehow seems fishy when people present results with controlling only for randomization strata, get one set of standard errors and significance, and then show results having controlled for a bunch of other baseline variables with other standard errors and sometimes more significance – suggesting that there are two sets of possible standard errors and researchers can choose amongst them. Even when p-hacking concerns are reduced by advance specification as in the above, how can two sets of standard errors both be right? The colleague went on to note that if we want to compare to what randomization inference would give, shouldn’t there be “a” meaningful concept of standard errors that is determined by experimental design, not the researcher’s specification choice?
My answer to the above is that you are effectively considering treatment effects on two different outcomes. Just as I can choose to specify ex ante that I will look at treatment effects on income or log income, I can also specify that I am interested in the effect on income (the standard specification without lots of controls), or the treatment effect on the part of income that I can’t predict from baseline variables (the specification after using double-lasso for example). The key here is that the decision should not be made ex post, and especially should not be made based on which gives the smallest p-values! But so long as I specify in advance which I will focus on, then it is ok.
Join the Conversation