Syndicate content

Sifting through data to detect deliberate misreporting in pay-for-performance schemes

Jed Friedman's picture

As empiricists, we spend a lot of time worrying about the accuracy of economic and socio-behavioral measurement. We want our data to reflect the targeted underlying truth. Unfortunately misreporting, either accidental or deliberate, from study subjects is a constant risk. The deliberate kind of misreporting is much more difficult to deal with because it is driven by complicated and unobserved respondent intentions – either to hide sensitive information or to try to please the perceived intentions of the interviewer. Respondents who misreport information for their own benefit are said to be “gaming”, and the challenge of gaming extends beyond research activities to development programs that depend on the accuracy of self-reported information for success.

Examples of such programs include pay-for-performance schemes that reward agents on the basis of self-reported indicators. The classic way to assess and deter this misreporting is through audit. Previously I’ve reviewed studies that assess the accuracy of reporting either with or without the risk of third party audit. It appears that unpredictable audit likelihood or fuzzy audit trigger thresholds are more effective than standard auditing approaches. 

Without the possibility of audits, it is far more difficult to detect gaming behavior. However through the creative analysis of data, patterns of respondent gaming can still sometimes be identified. Take this 2010 paper by Gravelle, Sutton, and Ma that assesses the reporting of British doctors under the comprehensive pay for performance reforms of the UK’s National Health Service. These massive reforms, begun in 2004, reimburse General Practitioners in the UK according to 146 self-reported quality indicators. The introduction of pay-for-performance substantially increased average income by 25% and, perhaps, also increased the quality of care in select dimensions.

The pay-for-performance tool, known as the Quality and Outcomes Framework (QOF), determines the bonus payment through a bevy of ratio indicators reported by the clinic. One such indicator is the ratio of heart disease patients whose blood pressure has recently been measured at less than 150/90, in relation to all heart disease patients registered with the practice. As the clinical quality indicators are ratios, the incentives aim to nudge practitioners to increase the numerator. However the QOF also gives some leeway to how doctors report the denominator because they can declare certain patients to be ineligible and except them from the calculation. Ineligibility is largely a doctor’s judgment call based on various grounds such as the patient having refused treatment or failed to attend scheduled appointments. And because of this leeway, practitioners can choose to increase their payment by excepting patients that really shouldn’t be excepted.

How exactly do the authors investigate gaming? Gravelle, Sutton, and Ma propose two tests for gaming behavior with respect to exception rates:

1. Do characteristics of the health practice correlate with exception rates?
The authors posit that, conditional on patient characteristics, exception reporting should have no relation to the measurable characteristics of clinics such as the number of doctors working there or the number of local competitors that each clinic faces (assessed by the Herfindahl index). While this assumption may be strong for some tastes (including mine), the data contain a wealth of patient information including demographics, morbidities, ethnic and religious background, poverty status, etc. – after working for years in low-income countries, I find the data rich environment of the NHS incredible!

Controlling for all of these patient and community characteristics, the authors find  exception rates to be significantly correlated with the number of doctors per patient (the more doctors in the practice, the higher the exception rate) and the Herfindahl index (the less market competition, the higher the exception rate). This last finding is consistent with the authors’ speculation that practitioners facing less competition have less need to provide quality to attract patients and thus derive a higher reward from gaming their reports instead.

2. Does exception reporting respond to non-linearities in the payment schedule?

To calculate the payment tied to a particular indicator, the QOF scales the ratio according to indicator specific threshold triggers. If the indicator falls below a low threshold (say 25%), no payment is given for the results. On the other hand, payment is maximized at a high threshold (say 80%) and an indicator score above that does not result in additional funds. In between these thresholds, payment rises in a linear fashion with increases in health care quality. This approach creates a non-linear function of compensation that practitioners can soon learn to navigate. For example if, in the first year of the QOF, a clinic does not reach the upper threshold then the practitioners should realize one way to increase payment in subsequent years is to except more patients from the denominator. Clinics that have already reached the upper threshold and believe they can stay there have less incentive to game in subsequent years.

The authors do indeed observe differential reporting behavior by the initial level of quality. They estimate the 2005 exception rate for firms who in 2004 exceeded the upper threshold at 7.2%. In contrast the 2005 exception rate for firms who were under the threshold in 2004 is 8.6%. This increase may be slight, but noticeable and precisely estimated.

Taken all together, it appears that UK doctors did deliberately misapply the exception criteria, although the magnitude of such gaming is relatively slight.

What are the take-away lessons for developing country programs? While the wealth of data analyzed here will likely not be reflected in low resource settings, the basic approach can still be of value for discerning possible gaming behavior.

It’s not clear whether population health outcomes legitimately be related to facility or provider characteristics after controlling for broad population characteristics.  The level of comfort with this assumption will depend on the setting and the detail of available data. However exploring the way population indicators vary with respect to local service characteristics can be a useful approach either to highlight correlates of unobserved quality or to indicate provider characteristics that may be linked to gaming. These findings can then flag service providers of certain characteristics for further investigation of their needs and, possibly, a greater likelihood of audit.

If the pay-for-performance scheme exhibits non-linearities in how performance measures are converted into funding (and non-linearities through threshold effects are not uncommon), then reporting behavior on either side of the “kinks” in the payment function can be contrasted to get a handle on the extent of strategic behavior. Results from this analysis may also flag service providers reporting certain levels of achievement for a greater likelihood of audit.