Few will argue against the idea that data is essential for the design of effective policies. Every international development organization emphasizes the importance of data for development. Nevertheless, raising funds for data-related activities remains a major challenge for development practitioners, particularly for research on techniques for data collection and the development of methodologies to produce quality data.
If we focus on the many challenges of raising funds for microdata collected through surveys, three reasons stand out in particular: the spectrum of difficulties associated with data quality; the problem of quantifying the value of data; and the (un-fun) reality that data is an intermediate input.
First things first – survey data quality is hard to define and even harder to measure. Every survey collects new information; it’s often prohibitively expensive to validate this information and so it’s rarely done. The quality of survey data is most often evaluated based on how closely the survey protocol was followed.
The concept of Total Survey Error sets out a universe of factors which condition the likelihood of survey errors (Weisbeg 2005). These conditioning factors include, among many other things: how well the interviewers are trained; whether the questionnaire was tested and piloted and to what degree; whether the interviewers’ individual profiles could affect the respondent answers, etc. Measuring some of these indicators precisely is effectively impossible—most of the indicators are subjective by nature. It may be even harder to separate the individual effects of these components in the total survey error.
Imagine you are approached with a proposal to conduct a cognitive analysis of your questionnaire. - How often were you bothered by the pain in the stomach over the last year? A cognitive psychologist will tell you that this is a badly formulated question: the definition of stomach varies drastically among the respondents; last year could be interpreted as last calendar year, 12 months back from now, or from January 1st until now; one respondent said: it hurt like hell, but it did not bother me, I am a Marine... (from a seminar by Gordon Willis)
In the cognitive analysis, a team of phycologists and linguists analyze questions to be asked during interviews to determine how well those questions are understood by a typical respondent and to correct questions where respondents from the focus groups have difficulties. This exercise could be expensive and time consuming. And how much would such an analysis be worth? This is different from willingness to pay; the question is: what’s the intrinsic value of the analysis itself? $30,000? $300,000? It’s hard to decide because you do not know whether or how such analysis will affect the quality of your data.
And what if I told you that the Total Survey Error could be reduced by 15% were you to make several adjustments to the survey instrument? Is this improvement in precision worth the money? To answer these questions, you need to understand the value of your data.
The research community has long been struggling to offer reliable methods to quantify the value of data. Many approaches have been proposed and no consensus yet exists on the best metric. One thing everyone agrees on that it’s difficult and more investment to crack this nut is needed (Slotin 2017). Most research focus on the estimation of the aggregate indicators, such as value of data for a country or an industry, and very few try assessing returns on investment for a particular survey. The problem might be especially acute for the large, multi-topic surveys that contain several hundred questions about various aspects of household well-being. The results of these surveys might be used by many agencies and researchers to design a wide range of policy interventions. Various policies would rely on quantitative analysis to a different degree and there will be a large heterogeneity in the importance of quality data for these policies.
Data as an intermediate input
Okay, now suppose that your survey has the well-defined, narrow focus of identifying people with abdominal pain to assist them in getting a treatment. At first glance, this case looks trivial. You calculate leakages and undercoverage of the program due to an error and compare these losses with the cost of achieving higher precision.
However, the outcome of any program is a result of complex interactions of multiple components. The effectiveness of the program depends not only on how well we can identify the recipients, but also on the efficiency of the distribution mechanism, the qualification of people administering the program, the willingness of the recipients to accept the treatment, political support for the program, and a range of other factors. A priori, it is difficult to say whether you should invest more in collecting better data or, for example, in the improvement in the administrative system. In other words, data itself is not an output, it is an input into the production of outputs that organizations or people are interested in and are willing to pay for.
A product everyone agrees is crucial but few are willing to finance
So here we are: the value and quality of survey data are hard to measure and even the highest quality survey data are an intermediate input into the production of downstream outputs which often only indirectly depend on the data itself. To all of this we should add the ubiquitous free rider problem associated with the generation of data as a global public good, and we find ourselves in a weird place: a product everyone agrees is crucial, but few are willing to finance.Development data (done right) is expensive. The need for broader and more consistent investment in development data is significant (and growing) and long-term returns on investments in data could be large. However, with national governments in low and middle-income countries constantly weighing crucial development priorities (in which data needs are often underfunded or entirely crowded out), with virtually zero private sector interest in investing in development data, and with the current reality in which a handful of international organizations are essentially subsidizing the planet’s development data needs, the outlook for development data may not be great.
It may seem natural to conclude that the current handful of international development organizations should continue to play a leading role in supporting research and methodological improvements in data production and sponsor large data collection programs in poorer countries. However, the share of Official Development Assistance spent on data and statistics was only 0.3% in 2015 (Paris 21, 2017). It remains to be seen if this level of financing reveals the actual demand for microdata, or if it indicates the scale of the collective action problem we face.
Looking for an answer might require first to provide a clearer definition of "development data", related to the quality issue by pointing to the needs of its users (International development organisations ?) for whom it is so important.
Chronically under-financed? but who must finance? Its main users, countries, IOs, CSOs? There are so many important domains that are chronically under-financed, how to single out "data for development"?
Another question is: which organisations must be in charge of producing the data and get the necessary financial resources?
If it is under-financed, I guess it is because the importance has not been effectively demonstrated and proofed, and properly advertised towards those who have to provide the needed resources.
Brilliant piece! Cogent yet well explained.Especially true for the human subject. Sociologists study this as part of graduation studies. I just wonder why it has not been addressed so clearly, earlier. Nonetheless, happy to read and retweet and spread this important information.