Published on Data Blog

Data quality in research: what if we’re watering the garden while the house is on fire?

Michael M. Lokshin

July 05, 2018

This page in:

A colleague stopped me by the elevators while I was leaving the office.

“Do you know of any paper on (some complicated adjustment) of standard errors?”

I tried to remember, but nothing came to mind – “No, why do you need it?”

“A reviewer is asking for a correction.”

I mechanically took off my glasses and started to rub my eyes – “But it will make no difference. And even if it does, wouldn’t it be trivial compared to the other errors in your data?”

“Yes, I know. But I can’t control those other errors, so I’m doing my best I can, where I can.”

This happens again and again — how many times have I been in his shoes? In my previous life as an applied micro-economist, I was happily delegating control of data quality to “survey professionals” — national statistical offices or international organizations involved in data collection, without much interest in looking at the nitty-gritty details of how those data were collected. It was only after I got directly involved in survey work that I realized the extent to which data quality is affected by myriad extrinsic factors, from the technical (survey standards, protocols, methodology) to the practical (a surprise rainstorm, buggy software, broken equipment) to the contextual (the credentials and incentives of the interviewers, proper training and piloting), and a universe of other factors which are obvious to data producers but usually obscure and typically hidden from data users.

Many data problems are difficult to detect

Figure 1: The share of households responding positively to three categories of questions declined with the progression of the survey. For example, the share of households that experienced any accidents over the last year dropped by half by the fifth month of the survey being in the field.

An analysis of a recent survey generated some interesting findings: the proportion of households reporting chronic diseases declined by more than 30 percent over the course of the fieldwork (Figure 1). With few exceptions, the characteristics of households interviewed in the beginning of a survey should not differ from those interviewed towards the conclusion of the survey period. Were I a skeptical person, I might wonder whether the interviewers gradually came to understand that positive answers to specific questions generated a lot more work to do (a range of follow-up questions about history of illness, treatments, medical expenses, and more). Accordingly, these interviewers might have learned that nudging respondents to give a different answer might minimize their workload (“Come on, my blood pressure is twice as high as yours and I’m fine!”) or else simply mis-record respondents’ replies (“no” instead of “yes”) to free up the rest of the day. This may be a caricature, but errors like these are difficult to catch even if you check data in real time and apply sophisticated validation algorithms.

There are many similar situations when the quality of data is seriously altered by substandard interviewer training, insufficient supervision, and interviewers shirking their responsibilities. For obvious reasons, many (if not most) data producers are typically unwilling to reveal information about problems in the field to the agencies which fund surveys and to the people working with the data.

Poor data can be amplified into bad policy

This problem is well recognized by the leading statistical institutions in developed countries who use sophisticated econometric techniques to better understand how data collection progresses through the enumeration cycle, to identify strategic opportunities, to evaluate new collection initiatives, and to improve how they conduct and manage their surveys (i.e., Groves and Heeringa, 2006).

Unfortunately, developing country statistical offices lack the resources to establish such practices. As such, survey data quality can become suspect. Imagine being the researcher analyzing a relationship between the presence of chronic disease and poverty. The economic model is complex, with a highly non-linear econometric specification that relies on an instrumental variable approach to address the issue of reverse causation. Even small errors in the data could lead to large divergences in the estimation results. Unless a researcher is directly involved in the fieldwork, they might never realize the magnitude of the problem with the data. They might write a paper based on this incorrect data, which might then generate a report with policy recommendations, which may next justify a large investment in that country’s health care system to implement the reform — a cascade of causation on the basis of faulty data.

It is hard to say how damaging this situation is for a particular program. The effectiveness of economic policies is a result of complex interactions of many factors, of which empirical justification might not be the most important one. A positive result bias of economic publications, common sense, political interests, and bureaucracy will likely dampen the negative effects of incorrect conclusions. However, the impact of systematic errors in multiple surveys could be extremely serious and lead to the adoption of concepts which might be otherwise difficult to refute.

Data quality is unglamorous, but economists need to take it seriously

Recent efforts to improve the replicability of economic research (i.e, Maniadis and Tufano, 2017) focus on empirical methodology and algorithms, which miss the potential errors coming from data. Trying to replicate results of economic analysis using new data could be expensive and problematic because of intertemporal consistency problems (i.e., Siminski at al, 2003). Papers could still be published in good journals based on erroneous data. Journal referees usually have little means to validate the quality of micro-data coming from developing countries.

Survey work is, unfortunately, not always considered a “brainy” activity. All too often, it is delegated to less experienced staff. For example, among almost 400 World Bank staff registered as users of the Survey Solutions data collection platform, 87% are the institution’s most junior staff or consultants. We see a similar demographic among the attendees of the seminars and workshops that are focused on survey design and logistics of the fieldwork.

If quality data is indeed important for policymaking, the status quo (and the attitudes which inform it) must change. The economics profession must acknowledge and own the responsibility to provide informed advice to practitioners in developing countries and propose better mechanisms for data quality validation and replication of results. Nothing less than a serious appraisal of the reality of these hidden data quality issues—and clear actions to countermand them—is needed to end the potentially pervasive problem of bad data and to mitigate any resultant consequences.

The good news is that data quality can absolutely be improved through the right combination of resources and human capital, including mainstreaming proper survey supervision, randomly repeated interviews, the use of advanced technologies to monitor interviews in the field, and more broadly through efforts to strengthen statistical capacity more holistically in developing countries (Statistics Canada 2008). Understanding the problem is the first, key step.

The World Region

Get updates from Data Blog

Authors

Michael M. Lokshin

Lead Economist, Office of the Regional Chief Economist, Europe and Central Asia

More Blogs By Michael

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.