Collecting more rounds of data to boost power – the new stuff


This page in:

My paper “Beyond Baseline and Follow-up: the case for more T in experiments” was recently accepted at the JDE. As with most papers that go through review, the accepted version is a definite improvement on the working paper version. So I thought I’d attempt to pay heed to Berk’s lament that we focus too much on the working papers and sometimes ignore the changes that occur at publication by attempting to briefly summarize the key changes made.

First, to recap, the main points of the paper are as follows:

·         Most experiments in economics rely on just a single baseline and follow-up. This design is suitable for highly autocorrelated outcomes like test scores, or BMI.

·         However, for outcomes which are noisy and much less autocorrelated, like business profits, household consumption and income, and episodic health outcomes, there are big gains in power to be had from considering multiple measurements at relatively short intervals.

·         With a fixed budget, it can be sometimes optimal to do no baseline at all and 2 follow-ups (although the paper discusses lots of caveats on this), or to do 2 follow-ups and 1 baseline with a smaller cross-section than doing 1 follow-up and 1 baseline on a larger cross-sectional sample.

·         With a fixed cross-sectional sample, adding more waves of surveys can lead to a big improvement in power.

·         When autocorrelations are low, there are large improvements in power to be had from using ANCOVA instead of difference-in-differences in analysis.

So what’s new in the accepted version?

·         Section 2.3 - The derivations of results (and Stata’s sampsi power calculations) are based on an assumption that autocorrelations are equal over all points of time in the study, and are equal between the treatment and control group. Section 2.3 now discusses the consequences of deviations from these assumptions

o   If the treatment changes the autocorrelation structure for the treatment group compared to the control group, then optimally one would choose different sized treatment and control groups – with it depending on which method of analysis you are planning whether the larger sample should be from the group with more or less autocorrelation. However, I show in practice that this deviations from equal-sized treatment and control groups are likely to be slight.

o   If the autocorrelation is not constant over time, then I show one should just use the average autocorrelation in power calculations.

·         Section 2.4. Where do the gains from more T come from? One comment that has come up in discussing this paper with people is that if more measurement is good, then what is to stop you from moving from monthly to weekly to daily to by minute data, and does this really help?  This section shows that the gains come from assuming that the definition of the outcome measure don’t change with how frequently it is measured.


What the paper does not imply is any preference for using, for example, three survey waves of monthly profit data versus one survey round in which profits are measured over the last three months. Assuming a constant treatment effect and no measurement error, the power to detect this treatment effect will instead be the same if we aggregate up the three months of profit and estimate the impact on three month profits, as if we use the three rounds of profit data as panel data. Likewise, we can obtain a much more accurate measure of the impact of an intervention on daily profits by measuring 30 days of profits and using them in the estimation than if we just looked at the impact on the last day’s profits, but don’t do better from measuring the impact on daily profits from 30 days of measurement than if we looked at the impact on the last months profits.


Where the gains from multiple measurements come from with flow variables is therefore in the ability to either extend the time horizon which measurement takes place, or to reduce measurement error within this horizon (there are further gains if one wants to measure the trajectory of impacts). It is common practice for surveys to only ask about the last month’s profits or the last week’s food expenditure, because microenterprise owners and households may not be able to recall data accurately for longer periods. So having multiple survey rounds, each of which asks about the flow variable over this pre-specified period is better than having just one round which does the same. Alternatively, one might choose to push firm owners or households to recall data over longer horizons, but there is likely to be more measurement error in doing so, and in such cases multiple measurements can help improve power by reducing the noise in this aggregate.


·         A final additional thing that might be of use to readers from the revision is that, at the request of the journal, I wrote some Stata code that does the budget-constrained choice of n (number of cross-sectional rounds) and T (number of time series rounds). This code is here.


David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation