There is much demand from practitioners for “shoestring methods” of impact evaluation—sometimes called “quick and dirty methods.” These methods try to bypass some costly element in the typical impact evaluation. Probably the thing that practitioners would most like to avoid is the need for baseline data collected prior to the intervention. Imagine how much more we could learn about development impact if we did not need baseline data! Even with the expansion in impact evaluations over the last 10 years or so, I doubt if any form of contemporaneous baseline data are available for more than 10% of current development projects. (Less than 10% of the World Bank’s lending operations have impact evaluations, and I doubt if the Bank evaluates less than average.) Should we just give up on the 90%?
Another thing we might try to avoid is the need for objectively assessed outcome measures. Development outcomes such as consumption or income require relatively complex and costly surveys.
There is a potential shoestring solution for avoiding both: We can ask post-intervention qualitative/subjective questions of both the treated and un-treated groups on how much their welfare has improved since the intervention began. This could dramatically lower the costs of impact evaluations. And it would open up many new opportunities for learning about policy effectiveness. It could be especially helpful in addressing a common problem in impact evaluations of development projects, namely that the time period is often constrained to fall short of the period in which the full impact is to be expected.
But how confident can we be about this shoestring method? Arguments can be made for and against. We need evidence.
In a new paper I report on an experiment designed to test the idea of using retrospective subjective data as a substitute for baseline data from a contemporaneous socio-economic survey. After collecting baseline and post-intervention data for treatment and comparison units to allow estimation of a standard double-difference (DD), a series of subjective recall questions were asked on how various dimensions of welfare had changed since the time the project was introduced. This allows what I will call the “shoestring double difference” (SDD) estimator. I studied two versions of the SDD estimator:
SDD1: This assumes that no baseline data are available. Only an ex-post survey can be done. Thus no adjustments can be made for selection bias into the treatment based on contemporaneously observed pre-intervention differences that might influence subsequent trajectories.
SDD2: This assumes that only the data on outcomes are missing. Thus standard corrections can be made for selection bias based on other observables at the baseline.
Note that the difference is in whether an allowance is made for selection on observables. If the recall of changes since the introduction of the project works well then both SDD1 and SDD2 will be able to address selection based on (time-invariant) unobserved factors.
Importantly, the shoestring evaluations were tacked onto the last stage of a full-scale evaluation. This was for a large antipoverty program in poor areas of rural China. The study was thus able to compare SDD1 and SDD2 to the “actual” DD, as estimated from high-quality, comprehensive and contemporaneous baseline and follow-up surveys. The implications for the structure of recall errors were also examined.
What do I find? Neither the “expensive” nor “shoestring” double-difference estimates suggest that the poor-area development program had a significant long-term impact on living standards in poor areas of rural China (though the full-scale evaluation found short-term impacts during the disbursement stage, and longer-term impacts for certain sub-groups; full details can be found here.)
But the fact that DD and SDD agreed on average impacts was not because the retrospective subjective assessments provided good proxies for the changes in consumption derived from high-quality contemporaneous surveys. Indeed, my analysis suggests that long-term subjective recall of the household’s overall standard of living contained only a weak and biased signal of changes in consumption. Controlling for the actual change in consumption, the recalled improvement in living standards tended to be higher for initially richer households. There were clear signs of telescoping in the recall responses, but the bulk of the benefits occurred in the earlier half of the recall period, which was given too little weight by respondents in treatment villages. Recall was clearly also affected by many idiosyncratic factors not accountable to consumption.
Furthermore, I found signs that the shoestring method can be deceptive. By not being able to effectively address the problem of selection bias based on the unobserved factors that determined which villages got selected for the program, the method proved to be vulnerable to spurious impact signals. In this particular case, the SDD2 method suggested more positive impacts. The paper argues that the most likely reason is that the selection bias based on observables is working in the opposite direction to that based on unobserved factors. Thus, only reducing the former bias (by balancing the distribution of observables between treated and comparison units) makes matters worse.
So (alas) this case study does not offer much encouragement on the reliability of this shoestring method. Of course, this is just one study, and (to my knowledge) the only one to date in the context of policy or program evaluation. Further tests are needed. Thankfully, the marginal cost of doing such tests in the context of a full-scale evaluation is not too high. Maybe we can learn more about shoestring methods and when they might be reliable.
If you would like to read more about the experiment in testing this shoestring method for the poor-area development program in China see my paper, “Can we trust shoestring evaluations?” Policy Research Working Paper 5983, World Bank.