# Are we over-investing in baselines?

When I was in second grade, I was in a Catholic school, and we had to buy the pencils and pens that we used at school from a supply closet. One day I felt like getting new pencils, so I stood in line when the supply closet was open and asked for two. Before reaching for the pencils, the person who operated the supply closet, Sister Evangelista, told me a story about her time volunteering in Haiti, how the children she taught there used to scramble about in garbage heaps looking for discarded pieces of wood, charcoal, and wire so that they could make their own pencils. I left the closet that day without any pencils and with a permanent sense of guilt when buying new school supplies.

I now feel the same way about baseline data. Most of the variables I have ever collected – maybe even 80 percent – sit unused, while only a small minority make it to any tables or graphs. Given the length of most surveys in low- and middle-income countries, I suspect that I am not alone in this. I know that baselines can be useful for evaluations and beyond (see this blog by David McKenzie on whether balance tests are necessary for evaluations and this one by Dave Evans for suggestions and examples of how baseline data can be better used). But do we really need to spend so much time and resources on them?

This baseline guilt perhaps explains why two recent perspectives on baseline data have caught my eye, one that deals with the extensive margin (should we do a baseline or not) and one that focuses more on the intensive margin (how long should our baseline be). In a chapter in the Handbook of Economic Field Experiments, Karthik Muralidharan recently suggested that we should consider skipping the baseline survey, particularly when conducting an RCT of a government program, and proposed that we instead improve statistical power by increasing the end line sample. Karthik argues that skipping the baseline can be good risk management, as we can focus our evaluation efforts on programs that get implemented and on randomized assignments that are not compromised. Otherwise, we can get stuck with 6-figure baseline surveys for interventions that never happen or for experimental designs that are so compromised they go straight to the file drawer.

Whenever I mention similar arguments to my colleagues working in operations, I’m typically greeted by slightly hopeful but very skeptical looks, as though I just declared that kidneys were overrated. When Karthik expressed this argument during a recent RISE conference, he unleashed a small torrent of baseline love on Twitter (which Dave Evans links to here), where the baseline’s role in policy dialog, checking balance, addressing attrition, and increasing power through covariates were all mentioned as good reasons to continue doing baseline surveys. Don’t get me wrong – I think these all are good reasons; I just think we overestimate the frequency and extent to which baseline surveys play some of these roles. I can point to some examples where findings from a baseline survey led to policy changes (in Niger or Kenya, for example), but I think we can all admit that this is still rare. While our survey data might be the only way to check balance across treatment and control groups in some evaluations, there are certainly situations in which aggregated administrative or census data will do just fine. Karthik also argues that we can spend resources (resources that might have gone towards a baseline) to make sure that the randomization is not compromised, and the same can be said for documenting and preventing attrition.

The argument about covariates and their impact on statistical power is harder to deal with as the statistical power of a test statistic (or the precision of a treatment estimator) can be improved both by collecting data on more individuals and by collecting more covariates that predict our outcome of interest. But can you know in advance how much power your particular covariates are going to buy you? David McKenzie has already demonstrated that for some outcomes – particularly those that are noisy and less autocorrelated – it is better to skip to the baseline and instead maximize the number of follow-up surveys.

Of course, to increase precision, we want to include covariates that predict our outcome variable, but how do we decide which set would do this optimally given that we also want to keep our budget down? Recently, a paper by Pedro Carniero, Sokbae Lee, and Daniel Wilhelm proposes a strategy for determining which covariates, if any, to include in a baseline survey, providing both the intuition and the Matlab code for doing this. To determine if a covariate adds enough power to justify the costs of baseline data collection, think of a simple case where the per-interview cost does not vary across survey rounds and we’re considering two options: (i) running a baseline and end line with just one covariate (for example, a baseline test score) and (ii) running just an end line with double the sample. It makes sense to collect the baseline test scores if the asymptotic variance of the treatment estimator in the baseline + endline option is less than the asymptotic variance of the endline only option. In this simple case, this turns out to be equivalent to comparing the ratio of the samples sizes for the baseline + end line and endline only options (or 50%) to 1 minus the R-squared that would result from regressing the baseline test score on the outcome of interest.

Now, how can you make such a comparison when you have a whole set of covariates you want to include that all have different costs of data collection? For this more general case, Carneiro et al propose using a modification of the orthogonal greedy algorithm (a method that makes locally optimal choices at each stage) that iteratively chooses covariates until the budget is exhausted. For a set sample size, it first finds the covariate with the highest correlation with the outcome of interest. It then regresses the outcome on that covariate and takes the residual. Next, among the remaining set of covariates, it finds the covariate that has the highest correlation with this residual, regresses the outcome on both covariates selected so far, and takes the residual. This continues until the budget constraint is binding. Next the algorithm repeats the process with different sample sizes to find the sample size and set of covariates that minimizes the residual variance of the treatment estimator. Carniero et al use this procedure on data from an evaluation of access to free day care in Brazil and find that researchers could have increased the precision of their treatment estimator *and *cut their survey budget by 45 percent by collecting only one covariate at baseline (whether or not the respondent finished secondary school) and essentially doubling their sample size at end line.

But how are we to know any of this in advance, before we have collected data, particularly when there isn’t any data on our outcome of interest within a 1000-mile radius? This problem is similar to what we face when we need to do power calculations. Wouldn’t it be great if someone could take all the data that is now publicly accessible (thank you, research transparency), collect additional data on the costs of data collection for each data set, and then for commonly used outcomes (test scores, early childhood development scores, stunting, take-up of various health products and services, labor force participation) run Carniero et al’s procedure on each data set? Identifying the set of covariates (if any) and sample sizes that maximize precision for a range of budgets and for a range of contexts would be a valuable public good that could help guide evaluation design and help target our scarce data collection budgets (not to mention our time and the time of respondents) to where there will be the highest payoff. Will someone please do this? That way, I can cross baselines off my list of things to fret about (of course, with a pencil that I didn’t buy).

## Join the Conversation