Doing a baseline when you have very little time before the program starts or before people know their program status


This page in:

I received an interesting question last week from Andrea Guariso, that I think speaks to a more general issue that I have faced in several evaluations, and so I thought I would share his question, my responses, and see whether others have experience or advice to share on this type of problem.

Andrea is working with two colleagues (Tara Mitchell and Carol Newman) at Trinity College Dublin, and three researchers (Marcus Holmlund, Chloe Fernandez, and Serge Adjognon) in the DIME group at the World Bank on an evaluation of an entrepreneurship support program for refugees and host populations in Niger. The AEA registry entry provides more details on the trial. The program will give treated individuals a cash grant and some business training, and they wish to measure both economic outcomes, and also whether the program affects social outcomes like social networks and social interactions.


The problem: not being able to survey everyone before they know whether they are in the program

Here is how Andrea described the set-up and main issue:

“The design is such that within each treatment village there will be an assessment (made by two village committees) to identify the eligible households in the village. Then a lottery will determine who will benefit from the program among the eligible people. From the pilot that is currently ongoing we are seeing that roughly 20% of the eligible individuals are going to receive the program.

The big question we are currently struggling with is how and when to run our baseline survey. Ideally, we would like to survey individuals after eligibility is determined, but before the lottery, to avoid that the outcome of the lottery itself could influence their answers (which is unclear whether it should be the case, but perhaps knowing for certainty about participation in the program might influence some decisions, or there could also be some psychological influence playing a role). But we do not have the resources to survey everyone and with just 20% of eligible households being selected we would have power issues if we were to simply survey a representative sample of eligible households”.

The team were considered a two-stage approach, where they sample a random sample of households after eligibility (which will oversample the control group), and then a booster sample of households after the lottery (oversampling the treatment group), and then using this to test whether responses differ with knowledge of program status. But they wanted to see how others had dealt with these issues.

Potential solutions

This issue of trying to interview everyone after selection but before status is known or implementation starts is a reasonably common one. Here are some possibilities I suggested.

Approach one: skip the full baseline and just use data from the program application form. Often you require people to apply to a program by filling out some quite basic application form. This normally contains some basic demographics and questions used to determine eligibility status, but it may be possible to add another page or two to this to collect a little more data (and importantly to make sure you ask for as much contact information as possible to help in re-interviewing them). This was the approach I used in the Nigeria business plan competition I worked on. There we had almost 24,000 people apply, and almost 6,000 get through to a second phase where they submitted business plans. All of this was done online, and it would have been very expensive to interview them in person. At this second stage, I added a short baseline datasheet that had them fill in a little more background information themselves online. This approach may be particularly useful for evaluations in which the key outcome of interest might be similar for everyone to begin with – for example, employment status (not-working) and income (zero) might be the pretty much the same for a sample of unemployed young job-seekers, or business start-up status (not started) and profits (zero) might be pretty much the same for a program designed to help people set up new businesses. As a result, the baseline won’t be so informative for predicting future outcomes. The first follow-up survey can then be used to collect more time-invariant characteristics of interest, and perhaps some retrospective variables.

Approach two: do the baseline once you know treatment status (and perhaps even as the intervention is starting), but focus on variables that are time-invariant, retrospective over a longer period, or slow moving. I’ve used this approach in several studies in which random selection occurs in batches on a rolling basis, with interventions starting very quickly thereafter. For example, in this evaluation of a vocational training program for the unemployed in Turkey, individuals applied to different courses and we then would randomize if the courses were oversubscribed. We ended up having 130 different courses throughout the country, and once the application deadline was reached, random selection occurred immediately and then courses started quickly thereafter. As a result, only one-third of the sample was able to be interviewed before the courses started, and the remainder during the first weeks of the course. Our baseline balance table then focuses on demographics, and employment history (e.g. ever employed), rather than on current employment status would could change in a few weeks once people know if they have the course or not. Similar, in this non-experimental evaluation of a seasonal worker program, workers were recruited in small batches and then migrated quite quickly after selection. Our surveys then asked remaining family members about demographics, work status in the previous year, and slow moving variables like housing infrastructure that we thought would not change quickly.

Approach three: non-public lottery and notification with delay. Another possibility which I thought might work in the Niger case could be to just do the lottery privately or semi-publicly (e.g. with a government official or village leader as a witness), and then delay telling people for a week or so the results, doing the baseline survey in between. This approach could be helpful if it is really important to get baseline information on something you think could change with knowledge of treatment status.

Approach four: disassociate the survey from the program. We might be concerned about two issues in collecting data after treatment status is known, or training has just begun. The first is a reporting effect/experimenter demand effect – where people just change how they report to you because of the outcome of randomization. E.g. if I am in the control I might claim to be poorer than I am with the hope of being moved to treatment, or if treated I might want to make you happy by telling you what I think you want to hear. Disassociating the survey from the program (by having it done by a separate organization, with it being framed for a different purpose) can help here. Alternatively one could bound the size of the effects as discussed in this post. But a second concern is genuine short-term changes – e.g. if I am not selected for the training program, I might start working, whereas I stop working and wait for training to start if I am selected for treatment. Or I might genuinely have an increase in happiness from being chosen for the program, even if it hasn’t started yet. Then it will not be enough to disassociate the survey from the program, and you either need to focus again on variables that don’t change so quickly, or use an alternative approach.


The best approach depends on why you need a baseline at all

We’ve previously blogged about several related issues here (see Alaka Holla’s post on whether we over-invest in baselines; and my posts on what to do when everyone has the same baseline value, or when measurement of the outcome changes between baseline and follow-up), and Jeffery McManus has a nice post on when to collect baseline data on the IDInsight blog. But assuming you are in a situation where you at least know who is in the experimental sample and have some very basic information on them, then we can consider several key reasons for the baseline that matter here:

·       Improving power by controlling for baseline covariates: this reason is most compelling when your baseline variables (including the baseline of the main outcome) are strongly predictive of the future outcome of interest (see here). As noted in approach one, it is less likely to be a key factor in employment generation or business start-up programs. However, it could matter for social networks, since we would expect the contacts you have today to be reasonably predictive of who you will talk with a year from now. (Note I’m abstracting here from the additional power gains possible from stratifying on baseline covariates, since this isn’t possible in the Niger case).

·       Collecting enough baseline controls for a balance table and to test and correct for possible attrition: for the balance table, it may be enough to use the administrative data and some basic information collected in my approach one or two above. See here for discussion on use of such a table. But if you are concerned that attrition may be high, having more variables that are closely related to the outcome of interest can be helpful for convincing readers that attrition is not causing bias, and for possibly reweighting estimates.

·       Heterogeneity analysis: I’m usually struggling enough for power in looking at main effects that I try to be pretty parsimonious in heterogeneity analysis, and focus on relatively simple to collect variables like gender, education level, or firm sector have been what I want to look at. But as people increasingly turn to data-hungry machine-learning heterogeneity approaches, they may want to collect more baseline variables for this purpose.  In their case, they are interested in heterogeneity by how much they already interact with others, and by some type of measure of baseline psychological wellbeing. You could then ask about interactions with others last month/before they applied for the program to get some pre-program measure that is perhaps ok for the former, but the concern would more be that simply learning treatment status might have at least short-term effects on psychological wellbeing.


Each of the approaches discussed above involves different trade-offs in terms of logistics, the types of data one can collect, and what it allows you do. But starting off by being very clear why you want a baseline and which variables you care about most might help in making these trade-offs.

Readers – any other creative approaches you have used, or suggestions for alternatives that could work in these settings?



David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

Thomas Escande
October 06, 2021

Hi David,

Thanks for the interesting read. We also had a very similar question for a program that was supposed to take place but was then later cancelled, so I don't know of a good example of our solution being implemented in the field. So not sure about the practicality of it.

The idea was simply to have something very similar to the two stage approach. Basically, it would be first dividing all the elligible households in two groups, a small one and a larger one. Then, the probability of being assigned to treatment in the small group would be 50% and the probability of being assigned to treatment in the large group would be just to complete the treatment group. You could then survey everyone in the small group, for a lower cost, and it should be representative of the control and the treatment group as selection is random. From the ethic view, at the beginning, everyone has the same probability of being assigned to the treatment or control group.

Now the difficulty is as always in the implementation, either coming up with a good story for explaining the two groups, or having the randomization done privately...

I definitely would be interested in knowing how it goes and which solution is picked in the end.


Juliane Zenker
October 11, 2021

Hi David,

we faced a similar problem in Afghanistan, where the lottery for a CfW program had to be public and VERY transparent, but the baseline needed to happen before the lottery (i.e without knowing who is in the treatment group) to not bias baseline responses. I have spent quite a while optimizing the sampling for the project (which is currently on hold because of recent events). To avoid problems at analysis stage down the line, I found that in our situation it was best to sample/interview all household in each included community before the lottery.
To make the decision of sampling all households in the communities affordable, I ran simulations and found it helpful to particularly adjust the sampling from two angles: (1) average size of communities to include in the community sampling frame and (2) cluster size vs number of clusters in power calculations:
Decreasing the average size of communities in the sampling frame helps to increase the probability of treatment households in the baseline sample when every household is sampled. Of course, external validity is limited for larger communities in that case. However, I suspect that it would be possible to get an idea of whether treatment effects vary with community size as long as there is some variation in size of communities included in the sample. One could then extrapolate treatment effects for larger communities that are not in the sample.
With respect to cluster size in power calculations, I found that in our case in a certain range of cluster size the power trade-off of increasing cluster size while decreasing numbers of clusters is relatively low. We decrease the number of sampled clusters (communities) by roughly 30 percent from our initially targeted number. By doing that the required cluster size (i.e. households interviewed within community) became larger to fit the same power level, i.e. closer to the value of (the small) average community size included in the sampling frame.
We still had to make some other trade-offs, for instance making the questionnaire relatively short and dropping a few research questions, to make it work. But eventually, I think it was a pretty good solution for a ver tricky problem.

Hope this helps