In the design and analysis of many experiments, we measure the key outcome (income, profits, test scores) at baseline, do an intervention, and then measure this same outcome at followup. Analysis then proceeds by running an Ancova regression, in which the outcome is regressed on treatment, any randomization strata, and this lagged value of the outcome. Where possible, we often try to stratify the experiment on this baseline outcome, e.g. by forming groups of individuals or firms with similar outcome values. Using this approach can greatly improve power over using just the followup data (and over using differenceindifferences).
In the last couple of weeks I’ve received several questions that concern how should we adapt this approach when the setting is one in which the key outcome is the same for all individuals at baseline? This is actually pretty common:

E.g 1: programs that work with youth or the unemployed are interested in impacts on employment and wages, but at baseline no one is working (and everyone has zero income).

E.g. 2: programs that try to help people start businesses are interested in impacts on business ownership and profitability, but at baseline no one runs a firm (and thus all have zero profits)

E.g. 3: an intervention is aiming to change migration behavior, starting with a group of individuals who are all currently nonmigrants – it might be trying to dissuade risky migration, or facilitate migration through safe channels.
The key question in such settings is then what can researchers do that is better than simply measuring the outcome at followup, and doing a simple regression of the outcome on treatment using only the followup data?
What can you do at the experimental design stage?
The key thing here is to first measure what you think are likely to be the key predictors of the future outcomes of interest, and then, secondly, to potentially use these predictors in how you randomly allocate subjects.
To make this concrete, consider a vocational training program for the unemployed in Turkey. There we had a sample of over 5,000 unemployed people who applied for a wide variety of courses, lived in different places all over the country, and varied by gender and age. All of them were unemployed at the time of applying. But we know that labor market outcomes are likely to vary considerably depending on your age, gender, local labor market conditions, and type of occupation you are likely to be looking for. We therefore stratified by geographic location/course, gender, and age (above or below 25). This gave 457 different strata, with assignment to treatment then within strata. Sure enough, the control group employment rates at followup differed greatly by these strata: averaging 62% for men over 25 vs 49% for men under 25, 40% for women under 25, and only 30% for women over 25. Stratifying and controlling for these strata then effectively amount to doing this on the unobserved baseline propensity to find work over the next year. Note that this was a setting in which the randomization had to take place using just data collected on an initial application form, so we could not also ask and measure other likely correlates of the likelihood of finding a job, such as measures of skills, labor market histories, and jobfinding expectations.
The advantage of the above approach is that you do not need more data than your baseline, and you can stratify on variables that you think are likely to matter for a number of your key outcomes, without specifying in advance how they matter for outcomes. You can also think carefully through theory and use local knowledge to try to measure what you think will be the key predictors, irrespective if previous surveys have ever tried to measure these aspects.
An alternative, when there is only a single outcome you care about, and when you have some panel data pretreatment, is to formally predict the followup outcome, given the set of baseline covariates available. Thomas Barrios’ job market post explains how to do this. So in Turkey, for example, we might have used exactly how age, gender, subject specialization and city were historically associated with the likelihood of exiting unemployment and form matched pairs based on this predicted outcome. But, as in many cases, we did not have such data, and moreover, we were interested in more than one outcome, so that approach would have required further adaption to tradeoff how much we valued different outcomes.
Note that the focus here is on measuring the best predictors of future outcomes, regardless of whether or not there is a causal relationship (so long as we can reasonably assume the prediction relationship is reasonably stable).
What can you do at the analysis stage?
Of course, there are limits on how many variables you can stratify on, and often we might not have a lot of prior data to be good at predicting the future outcome – or, as in the Turkey case above, your random assignment may have to take place before all the baseline data are collected. What can you do in these cases?
In these cases, with no baseline outcome and few strata to control for, we may instead want to control for covariates that are predictive of our outcome in our treatment regression. This expost covariate adjustment can improve precision, but the concern is that it can introduce additional researcher degrees of freedom, and that posthoc selection of a subset of covariates out of a larger set of potential controls can have the potential to lead to biased estimates of the treatment effect.
Two approaches can then be used:

Exante specification and registration of a subset of exactly which variables you will control for: So in my vocational training example, the preanalysis plan could have specified that the outcome equation for employment would control for 10 variables that capture things I think might matter for finding a job, where I would specify in advance exactly how these variables are to be defined and coded. This can be a useful approach if you don’t have a lot of potential controls.

A prespecified machine learning approach: a recent approach that offers more flexibility is to use a machinelearning approach to variable selection, with doublelasso regression of Belloni et al. the leading candidate. Nice descriptions are provided by Urminsky, Hansen and Chernozhukov, and in Esther Duflo’s NBER masterclass from last summer. Suppose you have a whole set of potential baseline variables to condition on, W1 to W100. Then to decide which controls to include, you i) use lasso to regress the outcome Y on these W’s and select which variables are most useful in predicting the outcome; ii) use lasso to regress treatment T on these W’s and select which variables are most useful in predicting treatment assignment (this helps correct for chance imbalances); and iii) Finally, regress Y on T and the set of W’s selected in either of step i) and ii). If you have stratified, the strata dummies should be partialled out before the lasso estimation.
This machinelearning model sounds attractive and there is Stata code (pdslasso) to enable you to do it. This description makes it seems like it takes away a lot of concerns about ad hoc choices of control variables. But this is not the case unless further steps are also taken. First, the researcher needs to specify in advance exactly which variables (W1W100 here) they will consider. This can include quadratics, interactions, and other transforms of variables. Of course, this also requires making sure you measure these variables at baseline. Second, the choice of the lasso tuning parameter can greatly affect which variables this method selects and is important for avoiding overfitting. So precise details of how the tuning parameter will be chosen also need to be prespecified. A final point to note is that this method will likely choose different control variables for each outcome variable – just as the lagged outcome is different for each outcome in standard Ancova.
These approaches have the potential to greatly increase power in situations where everyone has the same baseline outcome value, but where future outcomes are reasonably predictable. In contrast, if you are in a setting where it is really difficult to predict the future outcome, the gains may be very modest and you may not be able to do much better than just simply regressing followup outcomes on treatment.
Isn’t it a bit dodgy that people can effectively choose their standard errors?
In discussing adding controls with a colleague, it was noted that it somehow seems fishy when people present results with controlling only for randomization strata, get one set of standard errors and significance, and then show results having controlled for a bunch of other baseline variables with other standard errors and sometimes more significance – suggesting that there are two sets of possible standard errors and researchers can choose amongst them. Even when phacking concerns are reduced by advance specification as in the above, how can two sets of standard errors both be right? The colleague went on to note that if we want to compare to what randomization inference would give, shouldn’t there be “a” meaningful concept of standard errors that is determined by experimental design, not the researcher’s specification choice?
My answer to the above is that you are effectively considering treatment effects on two different outcomes. Just as I can choose to specify ex ante that I will look at treatment effects on income or log income, I can also specify that I am interested in the effect on income (the standard specification without lots of controls), or the treatment effect on the part of income that I can’t predict from baseline variables (the specification after using doublelasso for example). The key here is that the decision should not be made ex post, and especially should not be made based on which gives the smallest pvalues! But so long as I specify in advance which I will focus on, then it is ok.
Join the Conversation