Syndicate content

Opening Up Microdata Access in Africa

Gabriel Demombynes's picture

Recently I attended the inaugural meeting of the Data for African Development Working Group put together by the Center for Global Development http://www.cgdev.org/ and the African Population & Health Research Center http://www.aphrc.org/ here in Nairobi. The group aims to improve data for policymaking on the continent and in particular to overcome “political economy” problems in data collection and dissemination.

In my view, the key data problem in Africa is data access. Typically, data from a household survey or census is collected, used to produce a single survey or report, and then sits nearly untouched for years. For administrative data like that collected by health and education ministries, the situation is typically even worse: great effort is expended to collect detailed school and health clinic data, and the data is never used for anything beyond producing a few aggregate summary statistics.

One reason that data is hidden away is that data producers are often embarrassed by the quality of the underlying data and unwilling to have someone sniffing around pointing out problems. A second reason is that “Data is power,” but not in a good way. Organizations keep a tight grip on their data because it is a thing of value. As long as they hold exclusive access, they have the possibility of receiving contracts for analyzing the data or outright selling the data.

One example is the 2005-06 Kenya Integrated Household Budget Survey (KIHBS), the country’s most recent multi-purpose consumption survey conducted. This survey should be a keystone reference for understanding poverty, agriculture, employment and many other issues. Unfortunately, although in principle the data is available on request from the Kenya National Bureau of Statistics, in practice it has been made available to only a very small circle of researchers (including those at the World Bank) under the proviso that it not be shared more widely. As a result, there are just 157 citations in Google Scholar for the KIHBS since 2008—and many of those cite just simple published tabulations. In contrast, there have been nearly 26 times as many citations (4020) for the Kenya Demographic and Health Survey, for which the microdata is widely available.

Another example is the long-running Kenya rural panel survey conducted by the Tegemeo Institute. This data is unusually rich and could be used to explore a variety of crucial policy questions for the country. The data is not generally accessible to researchers not associated with Tegemeo http://www.tegemeo.org/ , and as a result has been greatly under-used. (Very recently, Tegemeo did announce that data collected more than 8 years ago will be made public, and the institute will consider requests for more recent data.)

A similar problem exists with researchers who generate new data often clutch their data because they want to have exclusive access for own research, fear others pointing out problems in their analysis, and do not want to take the time to put their data into a publicly available form. To take one example, I asked the Millennium Villages Project to share the data used in its published evaluation work—for which the initial data was collected in 2005. I was told that the project would consider sharing some data once the entire research project was complete and all resulting work had been published—i.e., sometime after 2020. Such an extreme “closed data” attitude is hardly unique to the MVP.

You might think that data producers and researchers are within their rights to control access to their data—except that with few exceptions data collection in Africa (and elsewhere) has been paid for with public money, either by the country’ citizens paying taxes to their governments or by taxpayers abroad who fund organizations like the UN and World Bank that support data collection. It is those citizens who are the rightful owners of that data. As a broad principle, publicly funded data should be freely available to the public.

Of course this principle should be subject to some conditions. Individual identifying information should be stripped for datasets, and adequate time should be allowed for the researcher or data producer to process and take a “first cut” at the data. A sensible rule might be all public data should be available to the public on request within 2-3 years of collection.

There are already a number of laudable data access models, such as the Afrobarometer, http://afrobarometer.org/, the Demographic and Health Surveys http://www.measuredhs.com/ , International IPUMS census project https://international.ipums.org/international/ . Making these models the rule rather than the exception will require governments and organizations that fund data collection to do two things: 1) make Open Data access the norm for funding agreements, and 2) ensure that data dissemination is funded from the start.

There are number of details and possible exceptions that would have to be considered in such policies. For example, as a colleague pointed out to me last week, it can be particularly difficult to ensure anonymity for respondents in qualitative data. This suggests that Open Data policies should differentiate between quantitative and qualitative surveys. Open Data policies don’t need to be absolute. But given how much data access is currently closed, we have a long way to go in the direction of Open Data to make sure that data is placed within reach of citizens who ultimately pay for it.

Comments

Submitted by Juan Bonilla on
Hi, I am a regular reader of your blog and enjoy it very much. Lately, however, I haven't been able to get email notifications for new posts. I subscribed again to it and got an automated reply saying I am already a subscriber. I also checked my spam folder to make sure the emails were not getting in there. They are not. Is there anything I can do to get the notifications again? is it the gmail account? Many thanks, Juan

Hi Juan, We have to manually click a button to ensure the posts get sent to subscribers. It seems somehow the default has shifted from automatically having this ticked to a default where it is not ticked. So if it is just a couple of posts you have missed, blame us for somehow not ticking this. But if you are missing all the recent posts in your email, I am not sure the reason. David

Dear Gabrial and others Thanks for an insightful post. I completely agree. Certainly, part of the challenge is to make funders insist that data should be public. But the practical side matters too. As part of a project where we have committed ourselves to making data public three years after the last data collection, I have been looking for something like http://www.measuredhs.com/, but where I can pay a fee to get someone else to distribute my data with data protection procedures and meta-data in place. Do you (or others) know of such a service? Ole

Submitted by Gabriel on
Ole, This is a great question, and I applaud the fact that your project is committed to making data public the years after the last data collection. I don't know of an organization that does what you're looking for, but I will ask some colleagues if they have suggestions. If I come across something, I will put that information in a reply to this post. There is the International Household Survey Network, which provides free software and guidance on data dissemination: http://www.surveynetwork.org/ This software provides something roughly like the basic interface you get with the DHS or Afrobarometer websites, although not with the sophisticated online tool. Best, Gabriel

Submitted by A. Tasso on
I generally have most of my feet in medicine and public health, but on occasion I dip my toes back into economics (my first academic home). I continue to be struck by the level of ignorance, on both sides, about the professional standards for the two fields. An economist who has a hard-money position can make a successful career out of conducting secondary analyses of other people's data. Collecting data is, generally, not the primary emphasis -- the emphasis is on the theory that informs the econometric models, and on the extent to which the econometric models permit or do not permit causal inferences to be drawn. However, economists also don't spend 10 years of a career analyzing the same dataset. First, economists (and sociologists) also generally jam all of their data into a single 40-60-page journal article that explores all of the different possible mediators, back channels, etc. Second, it's just not part of the professional culture to spend your entire career analyzing a single dataset. You are expected to get a paper or two out of a single dataset and then move on. So, once you analyze a dataset (even if you have collected it yourself) and you produce one paper, maybe two papers, you are pretty much done with that dataset. There's absolutely no professional cost in posting the data online. So I find it unsurprising that economists are at the forefront of calls for data sharing. In medicine and public health, most positions are soft money. You spend all of your time writing grants, and in general it is harder to obtain grant funding for secondary analyses of already-collected data. So because of this structural deficiency in how researchers are paid, you have all these researchers writing grants to fund new data collection (rather than thorough secondary analyses of existing data) every 3-5 years. Furthermore, you don't jam all of the variables into a single paper. There might be an article reporting the primary findings to a high-profile medical journal. Another paper reporting the child health outcomes in a pediatrics journal. Another paper exploring mediators and moderators of the treatment effect. And so on and so forth. One might cynically label this salami slicing, and in many cases that does accurately characterize what is going on. The PI's of the large cohort studies (NHS, HPFS, etc) are probably the most guilty of this practice -- how else does someone like Walt Willett accumulate >1000 publications? You have one paper looking at diet at CVD risk, one paper looking at diet and obesity, one paper looking at diet and stroke, etc. But, practically speaking, you could never get a 60-page manuscript published. Word limits are strictly enforced, and editors know that it is difficult to get a reviewer to read a lengthy manuscript. And it's just not how things are done. Unfortunately, it takes time to write all of these manuscripts, and meanwhile you are on a 3-5 year funding cycle, so after you've published the primary findings and perhaps 1-2 other companion papers in specialty journals, you have to move on to writing more grants (but you still don't have any academic incentive for giving up on the other papers that could potentially be written). So in general while I support your proposal to make data publicly available, and while I typically agree with the general badmouthing of medical and public health journals that takes place on this blog, I find your 2-3 year window to be absurd.

Submitted by Gabriel on
Thanks very much for your thoughtful comment. My post mixes up two rough categories of cases. On the one hand you have datasets that have a potential wide variety of uses--such as national censuses and household surveys like the DHS and household budget surveys. For these datasets, which are the primary focus of my post, it's hard to see any good argument that the data shouldn't be made public as soon as it is practically possible. On the other hand you have datasets collected by researchers for more narrowly tailored research projects. Here the issue of "data is our currency", as one academic put it to me once, comes in. In these cases, the researcher wants to hold on to the data as long as possible to milk it for all its worth before some other researcher can come along and beat him or her to the punch. In this second case, I see two considerations. First, for work that's already been published, I don't think its defensible for the underlying data to not be released--at the very least the microdata with the variables used in the publication. There are some prominent cases in economics where re-analysis of data in published work revealed errors which overturned the original work. A researcher who refuses to make his or her data available in a peer-reviewed publication is saying "Just trust me", and that shouldn't be good enough. Second, I think the standard for how long someone can sit on their data if its collected with public funds should be *entirely* determined by the public good. Public bodies--the NSF, the World Bank, the Government of Kenya, etc.--doesn't fund data collection so that researchers can accumulate lots of publications. Taxpayers pay for research to advance the frontier of knowledge and makes the world a better place. Perhaps for specialized datasets, the benefit of more rapid dissemination is outweighed by the incentives a period of "closed data" access creates for the researcher contemplating undertaking a study. I'm open to this possibility. Maybe having that "closed data" period makes it more attractive for the researcher to invest energy in the data collection in the first place. (Indeed, an economist I know told me that if he had to release data within 3 years of collection from his study, he wouldn't carry out the study in the first place.) But I think this incentive argument could take you in the opposite direction. Perhaps a time limit on "closed data"--maybe 4 or 5 years if we judge 3 to be too short--would push researchers to put out their best findings as quickly as possible, rather than sitting on data for a decade out of the hope that they might be able to squeeze out one last paper when they have time.

Submitted by Calogero Carletto on
Gabriel, needless to say, I wholeheartedly agree with the fact that data should be made publicly available within a "reasonable" time. The problem is often defining "reasonable" among less reasonable folks! Most data have a rather short shelf life and while national government and researchers should be given a head start to analyze data they have spent much effort and resources to collect, restricting data access beyond a year or two results in enormous wastage. I would also like to add to your list of "laudable" data access models the Living Standards Measurement Study -Integrated Surveys on Agriculture (LSMS-ISA). The program has been working with several countries in Sub-Saharan Africa to establish long-term system of multi-topic panel household surveys modeled on the LSMS, a thirty-plus year initiative in the Development Research Group at the World Bank which, since its onset, has been at the forefront of making microdata publicly available. In the LSMS tradition and going beyond, all data collected under the LSMS-ISA are made publicly available within 12 months of completion of data collection. And differently from the MVP, by completion of data collection we mean completion of each survey wave and not the end of the project! Thus, data collected, say, between Jan and Dec 2010 are made available on the web for free download by Dec 2011. Both data and documentation are generally made available both on the respective NSO website and at www.worldbank.org/lsms-isa. All surveys are geo-referenced at the household level and to overcome the obvious problem of distributing geo-references, we have developed a protocol to create a set of geovariables at the household level which can also be safely distributed without compromising confidentiality. These household-level geovariables, which include information on distance from services/infrastructures, climate, soil and vegetation, inter alia, are also distributed with the main dataset, together with a full documentation facilitating the use of the data. I also agree with you that there is often a strong correlation between data quality and data access thus suggesting that continuous support to countries in improving the quality of their statistical systems will ultimately result in more open data policies. Also, working on raising the local demand for microdata beyond the rather small group of mostly foreign researchers will also contribute to greater opening by statistical offices and other data producers.

Submitted by Ron on
Gabriel and Ole: One of the best and most cost effective ways to share your data is to use the "The Dataverse network project, a free service hosted by the Institute of Quantitative Social Sciences at Harvard. Individual researchers, departments, journals, institutes as well as international organizations that have focussed on development research including notable ones like J-PAL, IFPRI have uploaded their datasets from past projects onto the site. Very useful for research and teaching http://dvn.iq.harvard.edu/dvn/ Ron

Submitted by Gabriel on
Gero: Apologies for forgetting to mention LSMS-ISA, which is a model for how data should be disseminated! Ron: Thanks a bunch for the link to the Harvard site, which I think is what many researchers may be looking for.

Submitted by Abhijeet Singh on
Gabriel, Thanks for a great post. To your list of reasons why data does not get archived/disseminated better, I'd like to add the following based on my experience working in a statistical agency for two years: 1. While collecting survey data is necessarily time-bound and outputs such as poverty figures and MDG targets are time-sensitive, any dissemination of the data (as opposed to the findings) never faces similar time pressure. The only plausible external source of pressure comes from those (few) donor agencies like the World Bank and a few UN agencies which have the capacity to engage with raw data; they are easy to appease by releasing data selectively and don't usually push for the general good. There are kudos for new (hard-copy) survey reports, none for archiving data. 2. Data archiving, cleaning and documentation is tedious. It is sometimes difficult to focus manpower on these tasks, especially if they know they could be getting DSA on more interesting survey work elsewhere in the organization. These two things together (and the reasons you mentioned) make it very difficult to get the organization geared up to go that last mile towards archiving even if you are committed to it in principle. Thanks for mentioning the International Household Survey Network - it's a great initiative and hopefully will see a lot more use with users putting the actual microdata and not just the data description as they seem to do now. Best, Abhijeet

Submitted by Gabriel on
Abhijeet, Thanks for the comment. Yes, I know that there are always a million other things to do other than prepare the data for dissemination, and all those data prep steps take a lot of work. That's why I said (but didn't emphasize sufficiently) in my original post, that funders of data collection need to make distribution of the data part of the plan from the beginning--so that public distribution of the data is a "deliverable" under the funding agreement, and so that that funding includes money explicitly for the tedious data prep work. As I understand it, IHSN actually does not have the mandate to distribute data. The Harvard Dataverse mentioned in the comments looks to be one good option for those looking for a cost-free distribution option. Gabriel

Submitted by Olivier Dupriez on
Hi Gabriel, Thanks for this great post. A few comments: The fear of being criticized for imperfect data (the "embarrassment" factor) and the "Data is power" issue only partially explain why statistical agencies do not provide open and free access to their microdata. But to be fair to statistical agencies, I would add the following: - the "contradiction issue". I had talks with several heads of statistical agencies in Africa who told me that the main reason why they are reluctant to share microdata is not that the data are of poor quality. Their main fear is that official results (often generated with ad-hoc external support) might be challenged, and that they have limited capacity/expertise to defend them if/when needed. Agencies and consultants who provide technical support to analysis should make sure to provide statistical agencies with all they need to be able to replicate the analysis. There is room for improvement in this area. - legal issues: many statistical legislations are outdated and still forbid microdata dissemination. We must encourage statistical agencies to publish more of such data, but this must be done in the framework of an enabling legislation. Increasingly, these legislations are being modernized. This is an area where more support and advice may be needed in Africa. - technical issues: technical support is needed to support data producers in documenting, anonymizing and disseminating microdata. Such support is rarely included as a component of technical support to survey implementation. A specific program supported by the World Bank is however addressing this, which provides support to microdata archiving and dissemination in over 50 countries, many in Africa. And the IHSN will soon release a collection of free tools and guidelines for assessing and reducing the statistical disclosure risk in microdata (this remains a bottleneck to more open microdata dissemination). - financial issues. How many survey budgets have a budget line for documentation, anonymization, dissemination and preservation of the microdata? Not many, I think. This could be solves easily. There is good reason to be optimistic. More and more countries in Africa are adopting international standards (such as the DDI metadata standard) and good practices for microdata documentation, and I am convinced that a lot more data will be available soon. This will be the consequence of cultural and technical changes (availability of tools and training on microdata anonymization and dissemination, upgrade of legislations, contractual obligations imposed by survey sponsors, momentum of the Open Data movement, etc.) I also have a comment on "when should data be released"? My opinion is that we need to distinguish data produced by statistical agencies/line ministries from the ones generated by researchers. I understand the reasons why a researcher who spent time raising funds and collecting data needs an exclusive access to the data for a few years. But the mandate of statistical agencies is not to publish in academic journals. Their role is to provide timely and relevant data. Ideally, the microdata should be released immediately after publication of the official survey results. Some time may be needed after the first results are published to "package" the datasets, but that should not take more than 2 to 3 months. The policy of LSMS-ISA --publish the microdata no more than 12 months after end of data collection-- is another good option. Last: as mentioned in one of your replies, the IHSN does not have the mandate to publish microdata. This may change if there is demand for that. Ole's post makes me think (as IHSN coordinator) that we need to gauge the need for a non-profit central repository for data (maybe on the model of the Inter-University Consortium for Political and Social Research - ICPSR?). The World Bank's Microdata Library (http://microdata.worldbank.org) is another possible option.