Syndicate content

A new global data network on population and environment

Jed Friedman's picture

Co-authors and I are soon to complete (fingers crossed) some new work on climatic shocks and neo-natal mortality. But our findings are not the topic of this post. Rather I want to discuss the necessary behind-the-scenes data construction work that had to take place before the first regression could be estimated. The work involved the aggregation of fifty plus national level microdata sets (from Demographic and Health Surveys) and then a merger with geo-coded historical weather data (from NOAA). While doing this painstaking work, it struck me that there has to be a better way – our team was surely are not the first to do an exercise of this type and unfortunately we won’t be the last. Indeed similar basic data infrastructures have been replicated several times before. But how many times must the wheel be reinvented?

Well fortunately it appears that, one day, there will be a better way. DHS and NOAA just two examples among many of high-quality widely dispersed data sets that are potentially linkable. Micro-data and geo-referenced data across a variety of topics can, in principle, be integrated but currently have incompatible or inadequate geographic identifiers (one of the biggest translation challenges is the inevitable shifting of administrative boundaries over time). To merge these data requires extensive legwork of line by line or coordinate by coordinate matching in order to integrate the disparate data sources. Now an international effort led by the University of Minnesota is starting to do this type of work en masse and make it available to the global research community.

Terra Populus (aka TerraPop) is a new data infrastructure that will help researchers merge spatially based data from a variety of sources. Currently under development, the goal of TerraPop is to become a sustainable international organization that will preserve data and grant access over many years to come through continual updating as data storage technology inevitable changes.

Let’s have Terrapop describe it’s own efforts to date in more detail:

[we] have assembled the world’s largest collection of spatiotemporal population data. This massive effort was financed with approximately $100 million contributed by funding agencies in North America and Europe. TerraPop will merge these human population data with a vast body of environmental data derived from government land-use statistics, satellite imaging, and climate records. Project collaborators have already gathered census-based land-use and land-cover records for over 22,000 different political units around the world, and are extending the collection back as far as possible. TerraPop will fuse these census-based records with major global land-cover databases derived from satellite imagery. We will also incorporate historical data on climate—including station measurements of temperature, precipitation, and cloud cover and geospatially-gridded data products—which sometimes date back to the nineteenth century

Terrapop is starting with four major categories of data: 1. population, 2. land use, 3. land cover, and 4. climate. As the data architecture is developed, global data on these topics will be made available to any interested user at very low levels of geographic resolution.  And as the architecture is developed it will also be expanded to include further data categories such as economic activity, conflict, transport infrastructure, and so on – any data that is explicity or implicitly geo-referenced is fair game.

As part of its mandate, TerraPop also invests in data preservation, which has included activities such as spiriting the decomposing data tapes of the 1973 Sudanese census to a laboratory before they were lost to posterity. To preserve and integrate social and natural science data and metadata from across the globe is truly a global public good. It is critically important that we collectively invest in the infrastructure for international data sharing and the early funders for TerraPop, such as the National Science Foundation, should be applauded.  

This is the edge of a slow building wave of interlinked geo-referenced social and natural science data. I can imagine even greater integration in the future, hopefully involving the World Bank as well through the integration of data projects such as the Mapping for Results Platform and Aidflows.

The TeraPop prototype release will cover two countries: Brazil and Malawi. Sign up with them now and perhaps you will be able to test the beta version for these countries and give important feedback…

Comments

This is an excellent and important initiative, Jed. There is quite a bit of exploratory work being done at Purdue University in this area, including "proof of concept" work that I am doing (in collaboration with scientists at NASA) to link DHS, LSMS, and MODIS (remotely-sensed) data in Nepal and Uganda. The goal there is to better-understand the links between agricultural potential/performance and nutrition outcomes. There is also a very active group organized at CIFOR (the Poverty Environment Network, or PEN) that is working with global, georeferenced survey data from forest margin areas. Those data are rich in household and institutional detail. I would be very interested in interacting with others on methods and findings.

Gerald, thanks very much for the comment - it sounds like you are doing some very important work! Your post leads me to think that a bulletin board/wiki board would be a very useful forum for researchers interested in this type of geo-spatial data mash-up...

Jed, I cannot count the number of times I have been psyched to perform an analysis only to realize that there was to be a lot of data scrubbing to be performed, if it was possible to be performed at all. This effort appears to take some of that issue "off the table", allowing folks to get to the process of developing insights sooner.

Thanks for the comment David, that was very similar to my initial thought as well! Unfortunately for analysis that absolutely needs the micro-data we may still have to do the leg work since TerraPop cannot release that level of data due to confidentiality concerns... however even for small-area aggregates it should suffice.