Empirical evidence on the effectiveness of productivity incentives in the public sector is sparse. However donor enthusiasm is growing  for this general approach and certain lessons are emerging. In the spirit of the “learning from failure ” movement now emphasized in the World Bank, two recent studies present lessons from relatively unsuccessful public sector incentive pilots. These lessons, while they may appear commonsensical to us, are often over-looked in the frenzied design and implementation stage of a new project.
The first study takes place in the UK which, starting the late 1990s, piloted the use of financial incentives for public sector workers in numerous contexts. One such effort introduced team based incentives in the large UK public agency tasked with finding jobs for the unemployed.
Several output indicators were chosen on which to base incentives, however only one was an easily quantified measure of output – the number of job placements in a given period. This target was precisely measured for each office that was then rewarded according to its achievement.
An additional three targets attempted to capture various aspects of service quality. These indicators were assessed less regularly and, since their measurement entailed the use of “mystery shoppers ” and provider questionnaires, they were presumably measured with less accuracy. Even more important for the purposes of the results interpretation, the quality metrics were estimated only at the district level so that the entire district received one incentive payment for the aggregate quality indicators, shared equally across all offices.
From a theoretical standpoint, rewarding workers on the basis of aggregate team performance rather than individual production, even though the individual worker largely determines her own output, results in a weakened incentive program. If individual contributions to office output are not separately observable and/or rewarded this will give rise to the free-rider problem, especially for larger offices where any individual contribution to output is even smaller. The same logic holds for the number of offices in a district: an office in a large district has a lower expected return to effort since the office receives the same bonus value as would an office in a small district, but this office has a smaller influence on whether the district meets its target.
So the free-rider challenge is likely “baked in” to this incentive design. Another key interpretive aspect relates to the degree of measurement precision in the incentivized indicator. As Holmstrom and Milgrom worked out over 20 years ago , the power of the incentive is lower when output is measured with error. Where workers choose to allocate effort across incentivized domains, the measurement accuracy of the target variables will determine how much effort to devote towards achieving each target. Workers will optimally allocate more effort to those indicators that more accurately reflect their achievements.
Ok, so given this background, Burgess, Proper, Ratto, and Tominey attempt to assess the effects of the UK public sector pilot reform . The pilot ran for one year, from April 2002 to March 2003, with very little pre-announcement. As we might expect, this lack of pre-announcement made it impossible to field a baseline survey. Further challenging the evaluation, the reform was implemented in purposively selected districts (17 in total) that were chosen to reflect “a cross-section of different communities and customer-bases”. The authors argue that in practice this boiled down to a stratified random selection, nevertheless the authors adopt a matching model at the facility level to better balance selected covariates – most importantly, since the main outcome is job placement, they attempt to ensure balance in local labor market conditions across treatment and control areas.
Burgess et al. find significant heterogeneity in the office response to the new incentives. An increase in the quantitative target was only observed in offices with relatively few staff thus indicating the strong presence of free-rider effects in the larger offices. The incentive program resulted in an additional 230 unemployed people placed in jobs for each of the smaller offices – but this effect falls to zero as the size of the office increases.
While the quantity target galvanized behavior (at least in the smaller offices), the incentive scheme had no effect on any of the quality measures. This lack of impact holds even among the smallest study districts (in terms of number of offices), suggesting that both measurement imprecisions played a key role in this failure as well as the greater dilution effect of paying incentives at the district level.
The same message on measurement accuracy is found in the second recent study from Uganda. This study takes a qualitative look at a health sector incentive pilot  that previous work determined had no effect on the incentivized health indicators.
The qualitative study, by Ssengooba, McPake, and Palmer, ascribes part of the failure to the low accuracy of indicator measurement. It turns out that the data extraction process at the facilities was more time-consuming than expected, and the additional resources and time spent by the data teams at each facility apparently led facility directors to doubt the accuracy of the data. This loss of faith in the accuracy of outcome measurement, as argued by the authors on the basis of their qualitative interviews, reduced the effort the directors expended towards indicator achievement.
There were two other implementation problems that contributed to the no impact finding:
1. The program allowed facility directors to choose their preferred incentivized indicators from a broader indicator list. However directors were only given a short period of time to select their targets yet their choices were locked in for the entire two year study period.
2. The pilot activities stalled for an entire year soon after the facility directors signed the facility performance contracts. So facility directors had to be reminded of the program after one-year, few had made preparations to take advantage of the incentive program, and some directors had moved on leaving the new directors uninformed of the performance contract. The researchers conclude that the memory loss and lack of purposeful action to achieve pilot targets are key reasons for failure.
As I said, these design problems are fairly straightforward and would be predicted by incentive theory, but these considerations can get lost in the design phase, especially if the process is rushed. Clearly Incentives for performance should be based on accurate indicators clearly ascribable to the action of the targeted individual or (small) group.