Co-authors and I are soon to complete (fingers crossed) some new work on climatic shocks and neo-natal mortality. But our findings are not the topic of this post. Rather I want to discuss the necessary behind-the-scenes data construction work that had to take place before the first regression could be estimated. The work involved the aggregation of fifty plus national level microdata sets (from Demographic and Health Surveys) and then a merger with geo-coded historical weather data (from NOAA). While doing this painstaking work, it struck me that there has to be a better way – our team was surely are not the first to do an exercise of this type and unfortunately we won’t be the last. Indeed similar basic data infrastructures have been replicated several times before. But how many times must the wheel be reinvented?
Well fortunately it appears that, one day, there will be a better way. DHS and NOAA just two examples among many of high-quality widely dispersed data sets that are potentially linkable. Micro-data and geo-referenced data across a variety of topics can, in principle, be integrated but currently have incompatible or inadequate geographic identifiers (one of the biggest translation challenges is the inevitable shifting of administrative boundaries over time). To merge these data requires extensive legwork of line by line or coordinate by coordinate matching in order to integrate the disparate data sources. Now an international effort led by the University of Minnesota is starting to do this type of work en masse and make it available to the global research community.
Terra Populus (aka TerraPop) is a new data infrastructure that will help researchers merge spatially based data from a variety of sources. Currently under development, the goal of TerraPop is to become a sustainable international organization that will preserve data and grant access over many years to come through continual updating as data storage technology inevitable changes.
Let’s have Terrapop describe it’s own efforts to date in more detail:
[we] have assembled the world’s largest collection of spatiotemporal population data. This massive effort was financed with approximately $100 million contributed by funding agencies in North America and Europe. TerraPop will merge these human population data with a vast body of environmental data derived from government land-use statistics, satellite imaging, and climate records. Project collaborators have already gathered census-based land-use and land-cover records for over 22,000 different political units around the world, and are extending the collection back as far as possible. TerraPop will fuse these census-based records with major global land-cover databases derived from satellite imagery. We will also incorporate historical data on climate—including station measurements of temperature, precipitation, and cloud cover and geospatially-gridded data products—which sometimes date back to the nineteenth century
Terrapop is starting with four major categories of data: 1. population, 2. land use, 3. land cover, and 4. climate. As the data architecture is developed, global data on these topics will be made available to any interested user at very low levels of geographic resolution. And as the architecture is developed it will also be expanded to include further data categories such as economic activity, conflict, transport infrastructure, and so on – any data that is explicity or implicitly geo-referenced is fair game.
As part of its mandate, TerraPop also invests in data preservation, which has included activities such as spiriting the decomposing data tapes of the 1973 Sudanese census to a laboratory before they were lost to posterity. To preserve and integrate social and natural science data and metadata from across the globe is truly a global public good. It is critically important that we collectively invest in the infrastructure for international data sharing and the early funders for TerraPop, such as the National Science Foundation, should be applauded.
This is the edge of a slow building wave of interlinked geo-referenced social and natural science data. I can imagine even greater integration in the future, hopefully involving the World Bank as well through the integration of data projects such as the Mapping for Results Platform and Aidflows.
The TeraPop prototype release will cover two countries: Brazil and Malawi. Sign up with them now and perhaps you will be able to test the beta version for these countries and give important feedback…
Join the Conversation