As I wondered which of the many fascinating ideas from the World Bank’s inaugural annual Data Day to recap in a blog, it occurred to me that there was likely selection bias in those who chose to attend. Presumably, some skeptics of big data chose to skip the day entirely. So this blog is aimed first and foremost at the skeptical.
The Bank has traditionally been a leader in collecting, normalizing, harmonizing, and disseminating data to the wider research community. But for most of our history, we’ve started from questions we wanted to answer, like: “How can we compare income across countries?” which led to collecting prices and expenditures across the globe and producing PPP weights (the International Comparison Program). In contrast, big data evangelists start from whatever humongous data set they can access and then try to find questions to answer. While at first that may seem backwards to traditionalists, the point is that often these big data sets (in combination with appropriate machine learning techniques to extract nuggets of wisdom) hold great promise for helping to answer important questions, often at lower cost and in a more timely way than might have been hoped for with traditional data collection.
In looking through recent work, I’ll highlight five categories of data where we’ve seen value:
- Imagery (satellite, photos, videos).
- Geo-locational data (e.g. CDR, Uber/Lyft).
- Network data (e.g. Linkedin, Facebook).
- Transactions and Price Data (e.g. credit cards, Amazon, Alipay, BPP).
- Text Mining (e.g. news, Google searches, tweets, text messages, e-mails).
Suppose we wanted to use big data techniques to figure out who had actually come to my talk on Data Day. If we had cameras panning the audience, we could have matched the images against the Bank’s internal directory photos to identify attendees. Or, we could have used the geo-locational data (stored by many apps on your cell phones) to establish where you were. And we could have built up your network of frequent collaborators from your calling records and then run algorithms to try to identify clusters of data lovers and data haters. Or, failing in that, we might have resorted to looking at all your emails, and combined text mining and sentiment labelling to extract your views on data.
Which brings me to the elephant in the room: PRIVACY. That last paragraph was deliberately written to be creepy. While big data can be applied to answer many questions, it must be used judiciously. The World Bank can and should play a key role in defining protocols and serving as a trusted intermediary. Big data is a great opportunity but a great responsibility.
I hope I’ve generated some interest in at least a few skeptics.