Syndicate content

Fundamentally unknowable? Can we learn whether our firm policies in Africa are working?

David McKenzie's picture

Millions of dollars are spent each year trying to improve the productivity of firms in Africa (and those in other developing countries), yet we have very little rigorous evidence as to what works. In a new working paper I look at whether it is even possible to learn whether such policies even work, and what can be done to make progress.

Small number of firms + Large heterogeneity = Not much power

Firm census data reveal that once one looks at SMEs and large firms, the entire populations of interest in many African countries are relatively small – approximately 1000 to 2000 firms with 10 or more workers, and only 100 to 200 with 100 or more workers. Moreover, the World Bank enterprise surveys data reveal these firms to be very heterogeneous in terms of firm performance: the average cross-sectional coefficient of variation in firm sales is 3 to 4. i.e. the standard deviation of sales for these firms is typically three to four times the mean.

A typical World Bank project designed to improve the productivity of these firms involves between 100 and 500 firms over the course of five years. Table 1 shows the power of a randomized experiment with 300 treated and 300 control firms to detect different outcomes of interest, under the wildly optimistic assumption of 100% compliance with the treatment.

Typically one requires 80 to 90% power, whereas with a single follow-up survey one would only have 12.9% power to detect a 20% increase in sales, and 53.2% power to detect a 50% increase in sales. Increases of 10-20% in sales are often the targets set in developing such projects. As such, our power to detect whether a World Bank firm project meets its targeted goals is very low when such a single follow-up survey is done after the project is implemented.

Taking more rounds of surveys can help a bit – although how much depends on the autocorrelation of the outcome of interest (denoted ρ in the table). Typically for something like firm sales this autocorrelation is 0.5 or lower, in which case even a baseline and 4 rounds of follow-up surveys still only gives us 26% power for detecting a 20% sales increase.

What can be done?

The paper discusses several implications of these facts for designing firm experiments, and for what one can learn. The main ideas are:

·         Focus on a smaller number of more homogeneous firms: there is little to gain and much to lose in terms of power by trying to look at an average effect in a sample that includes both lots of medium sized firms and a couple of really large ones – throw out the large firms.

·         Collect a lot more data on these firms: Another recent paper of mine focuses on the role of repeated measurement in improving power. The gain in power from measuring the same noisy outcome at frequent intervals and averaging out noise can be large when outcomes are not that highly autocorrelated. With really large firms it may even be possible to get daily or weekly production data, as we do in a recent experiment on textile firms in India.

What does this mean for our efforts to learn about firm policies? The overall impact of some of our policies will be fundamentally unknowable without imposing additional structural assumptions. However, by collecting a lot more data and focusing on the largest, most homogeneous, subset of firms, experimental or non-experimental impact evaluations should have enough power to detect average impacts of the program for a large group of firms that matters.

This paper was the basis for my contribution to an interesting session last month at the Oxford CSAE conference entitled “Experiments or Structural Methods (or neither or both)? Video presentations are up here.

Comments

First, I cannot access the linked working paper. Second, I wonder to what degree a test statistic could improve power. At a minimum, a one-sided test statistic could improve power versus an otherwise comparable two-sided version. We could also ask if looking at a difference of means is always the most efficient statistic. Would a rank sum or a statistic that considered the variances of the treatment and control groups give us better power against the type of effects we expect from our treatments? Rosenbaum (2010, ch. 14): "The t-test has slightly better power than the Wilcoxon test for data from a Normal distribution, but substantially inferior power for distri- butions with longer tails, which is a basic reason that the Wilcoxon test is preferred; its power is robust." Rosenbaum is a strong advocate of distribution free statistics for a variety of reasons, but even if rank based stats aren't your thing, the point remains: test statistics can make a big difference if we entertain data is not Normal.