Published on Data Blog

Are we good at sharing... data?

This page in:
Are we good at sharing... data?

During the recent holiday season, many people likely made charitable donations. For those of us donating children’s toys, games, and clothes that our children have grown out of, there would inevitably be a plea from the receiving charity to avoid donating clothes with stains or holes or toys and games that are missing parts or instructions. One might wonder – why do donors need this warning? Isn’t it obvious that it’s difficult for a child to play with a toy that’s missing parts, and if you’ve seen how complex games can be, it’s nearly impossible to figure out their logic and rules without instructions. Nevertheless, donors do need this warning.
 

It turns out this is also true for those of us sharing data. Making data available through data repositories is becoming increasingly common among researchers conducting surveys and randomized controlled trials in low- and middle-income countries. Not only does data sharing further objectives such as research transparency and the creation of global public goods, many peer- reviewed journals and some funders of research have made it a requirement. At the Strategic Impact Evaluation Fund (SIEF), for example, we make our disbursements conditional on submission of data to the World Bank’s main data repository for survey data, the MicroData Catalog. We require data to be submitted within 6 months of collection, and this data should be made available for licensed use within 2 years and publicly available within 4 years. SIEF has funded stellar evaluation research which has been published in top tier journals and which has had real policy impact. When we review our record on data sharing, however, we see we’re getting too many toys with holes and games without instructions.  


We’re not ready to share data that can be used by others.

For data to be shared ethically and in a way that permits use by people who were not part of the original research team, the data must meet a few basic criteria. First, and most obviously, it must be discoverable in an established data repository, not filed away on a laptop or available only on a personal website or shared drive. Second, data users should not be able to identify individuals or families in the dataset, either directly (through information such as IDs, phone numbers, or addresses) or indirectly (through a combination of variables that could uniquely identify them, such village location, birth date, family size, and the occurrence of a recent illness). Third, the data must be documented in a way that allows users to understand what exactly is in the shared datasets. This usually means users can access the survey instruments and survey protocols that generated the data. Variables in the dataset would have names or at least labels that explain what survey questions they correspond to and what the response options are. Users should understand why some variables may have missing observations. If users can only see that a variable is called m1q24_final_clean and that response options are 1, 2, and -99, this isn’t helpful, especially if the questionnaire implies there should be more or fewer response options.
 

From 2023-2024, we did a stocktaking of the data funded by SIEF.  When we started in 2023, only around 60 percent of datasets (corresponding to around half of all funded studies) that should have been submitted had been published in the MicroData Catalog (see Figure 1 below). Some researchers – 25 percent of teams - just never shared at all: they either never submitted data or made an initial submission that was so incomplete the MicroData Catalog couldn’t even start to process it. A small share of datasets, less than 3%, required minimal inputs and were ready to publish in the catalog when they were submitted.  Around one third of all datasets, and the majority of datasets that had not been published in the repository, were not ready to be used by third parties.


Figure 1. Publication of datasets funded by SIEF in the MicroData Catalog in 2023 

Image


What was wrong with these datasets, why weren’t they ready to be published? Many of them did not meet the basic criteria for the ethical sharing of data usable by third parties (Figure 2). Around 10 percent of datasets lacked the survey questionnaires that generated them – that is, users had no way of knowing what exactly was asked of survey respondents. Almost 30 percent of datasets contained personally identifiable information (PII), including one that even contained bank account information, despite standards and training created to address this specific problem such as HIPAA, NIH, CFR, HHS training, and CITI Program and despite human subjects review boards that require research teams to protect the privacy of research subjects. Nearly 40 percent of the unpublished datasets lacked variable names. This means that variables appeared in the dataset as hh_a2_trgt_prnt, ht_college,  std_grade1, score_sum, std_score without any descriptive labels for these variables.   Some of these problematic datasets were already published in other data repositories, including some that were posted alongside papers published in reputable academic journals. Many datasets (45 percent) could not be published because other datasets that came from the same study had problems.

 

Figure 2. Issues in datasets flagged as not ready to be published by the MicroData Catalog

Image


We take our time sharing usable data.

These problematic datasets came from 22 studies. Once we and the MicroData Catalog team flagged data issues to research teams, they certainly took their time to resubmit publishable data (Figure 3). In fact, in many cases, we even had to hire consultants to do the work required to make the datasets usable. On average, 6.5 months elapsed before the data from these studies could be published in the repository, which required an average of 9 follow-up emails from us. The most difficult cases took longer than 13 months and required 10 emails from us, while the more responsive research teams could help us get publishable data in just under 4 months and required an average of 8 emails from us.
 

Figure 3. Time and effort required to make data shareable, by quartile of months elapsed

Image


There are easy ways to get better at data sharing.

Despite these experiences, we are optimistic that everyone can get better at sharing data that can be used by others. For example, there are resources available to help researchers de-identify data. There are guidelines, for example, to manage potentially identifying variables during data collection and to identify which variables could have personally identifiable information (DIME Wiki on deidentification, J-PAL guide to deidentification). Before publishing their data, researchers can conduct additional checks to assess the risk of disclosing private information for the dataset as a whole and review all the variables once more to assess if they are ready to be published (video on measuring disclosure risk, video on anonymization methods for microdata (both produced by the World Bank’s Data Group),  a practical guide for sdcMicro, J-PAL guide to publishing research data).  These steps are ideally taken in real time – during data collection and when data are first being cleaned – so that researchers need not rely on recall many years later.  
 

Similarly, there are packages available to aid the creation of descriptive labels for variables and response options before analysis, such as iecodebook (DIME Wiki iecodebook), which allows batch changes to names and labels. Before publishing data, simple Stata commands like describe, label list, labellacking can provide a simple and effective check on missing labels throughout the data at one glance. Accompanying labels with notes that could help users understand skip patterns, missing observations will increase usability of the data and reduce the risk of misinterpretation. Lastly, questionnaires and manuals are the heart of the data generating process and are a low-cost method of providing context on the variables and observations in a dataset. We recommend sharing manuals for enumerators so users can understand the conditions under which variables were collected (for example, were they probing, who else was present in the room with the respondent).  Submitting metadata that describes the content, context, and structure of the data will help users understand what exactly is in the dataset and use the data appropriately (video on metadata).


While these strategies alone will not be sufficient for the ethical sharing of usable data, they should make it easier to get data ready for publication. Sharing data should certainly be easier than tracking down the missing pieces for that game you’ve been meaning to donate.


Alaka Holla

SIEF Program Manager

Laura Becerra Luna

Research Analyst, World Bank Human Development Chief Economist Office

Meyhar Mohammed

Consultant, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000