If the data and related metadata collected for impact evaluations was more readily discoverable, searchable, and made available, the world would be a better place. Well, at least the research would be better. It would be easier to replicate studies and, in the process, to expand them by for example: trying other outcome indicators; checking robustness; and looking for heterogeneity effects (e.g. gender). There is also a wealth of other things one could do with the related metadata, including: looking at how different wording of survey questions generates different answers and getting parameters for power calculations. Last but not least, making these data available would allow for a wide range of non-impact evaluation research.
So this is arguably a good thing. But there isn’t that much of it. Why? First, researchers need some return on their investment. We spend a lot of time developing the instrument, negotiating the entire set up of both the survey and the evaluation, finding the money, etc. Second, making data available is a pain. You have to document variables you might not use. The format has to be somewhat friendly. Concealing confidential information takes careful attention and some creativity. And then people might email you with questions such as: “Why does the age of respondent of 4319 go down across survey rounds?” Third, there are no rewards or incentives in the economics profession as a whole for bearing this cost or pain.
So it would seem that this is doomed. And it might be, but I am optimistic. Let’s look at the arguments above a bit more. 1) Return on the investment. Yes – but if there is agreement on this not going on for ever, then this becomes a time bound option instead of a monopoly. 2) The pain: it’ll never be zero, but with global outsourcing (as is the case with data entry for example) these costs are rapidly declining. 3) The incentives. One big move to solve this (and to set some bounds on the return on investment as well) is that increasingly journals are requiring you to make the data available when you submit a paper – and in some cases when you publish, the data go on the web. One nice example of this is the American Economic Journals’ data policy. The journal policy is key because it lines up availability and incentives. But it is primarily geared towards replication, not access for broader use.
A broader option is to make the whole dataset (and the attendant documentation) available. And this is starting to happen. Two examples are JPALs website and the World Bank’s Impact Evaluation Microdata Catalog. Some coauthors and I were guinea pigs for the early work on the World Bank site and we deposited three rounds of our earlier Kenya HIV work. So the experience was fairly easy – there was a discussion of what had to be stripped for confidentiality, there were a bunch of questions on the documentation and then the folks who maintain the site did the processing and put it up. So this was a dataset where we had done most of the papers we had planned on. And, critically, the processing – organizing the data files, documentation, setting it up, was all done for us. So now the datasets are there and folks can use them to see what the answers were, do comparisons and the like (I encourage readers interested in exploring the World Bank site to check out the guide on how to work it – it wasn’t obvious to me). What has made me particularly happy is the number of requests we have gotten for the full dataset. The way it works here is that you can ask for the full dataset, but you have to explain what you are going to do with it. This is cool for me as they come up with topics I hadn’t thought of using it for – apparently there are people looking at the effects on shocks on children’s anthropometric outcomes and others looking at access to community savings groups.
The JPAL website has 14 datasets and the Bank has 17 – which is a start. So what I’d like to do is get a discussion started on how we might grow these things. Clearly a centralized repository isn’t the answer – in addition to the sites above there are other sites with surveys from developing countries that could be of use (for example ICPSR at Michigan, the UK data archive to name two that cover both developing and developed countries). So what we need is a way to aggregate all of these – maybe something like a Travelocity of development surveys. It could find surveys and heck, if we get the metadata attached to the surveys right – it could even go inside, finding variables, giving us a range of values, etc. But what would you like to see? Do you know of other sites which put data and some form of documentation up in a way that is pretty easy to download (let’s say both for impact evaluations in particular and surveys more generally)? What other ideas are out there to make these things more available?