Syndicate content

Revolutionizing Data Collection: From “Big Data” to “All Data”

Nobuo Yoshida's picture

The limited availability of data on poverty and inequality poses major challenges to the monitoring of the World Bank Group’s twin goals – ending extreme poverty and boosting shared prosperity. According to a recently completed study, for nearly one hundred countries at most two poverty estimates are available over the past decade.Worse still, for around half of them there was either one or no poverty estimate available.* Increasing the frequency of data on poverty is critical to effectively monitoring the Bank’s twin goals.
Against this background, the science of “Big Data” is often looked to as providing a potential solution. A famous example of this science is “Google Flu Trends (GFT)”, which uses search outcomes of Google to predict flu outbreaks. This technology has proven extremely quick to produce predictions and is also very cost-effective. The rapidly increasing volumes of raw data and the accompanying  improvement of computer science have enabled us to fill other kinds of data gaps in ways that we could not even have dreamt of  in the past.

However, recent articles have raised some concerns. One of these is that “Big Data” can fall into a so called “Drunkard’s search” or “Streetlight effect”. This can be roughly illustrated, as follows:
A drunkard is looking for his lost key under a streetlight. A policeman asks "What did you lose". The man answers "a key, but I can’t find it." The policeman asks him "Do you remember where you lost the key?" He replies “Yes, over there”. The policeman, who appears confused, asks "Then, why don't you look for it over there?". The drunkard answers "because there is no light!"
Statisticians use this “Drunkard’s search” to explain a type of observational bias where people search for answers  in places where looking is easiest. “Big Data” science demonstrates how tremendous advances can be made via easy access to massive data and the high powered software and computational power needed to analyze the data. But, is there any guarantee that “Big Data” will generate insights on those questions we are most interested in? Or does “Big Data” simply pertain to information than can be easily collected and assembled? Notwithstandingthe recent improvements in internet access in developing countries, many of the poor remain without access to the internet and the necessary computer literacy. The “Drunkard’s search” refers to a real concern for the use of “Big Data” for poverty estimation and analysis.  
Tim Harford raises an interesting example of this kind in the Financial Times. To predict whether Franklin Delano Roosevelt would win the presidential election, The Literary Digest, a magazine, conducted a massive postal opinion poll, with the ambition of  reaching 10 million people, a quarter of the electorate. After processing around 2.4 million returns, the Literary Digest predicted the republican candidate, Alfred Landon, would win, while a far smaller survey conducted by the opinion poll pioneer George Gallup predicted Franklin Roosevelt would win. We all know who won this contest.
Why did Gallup win? Gallup’s sample was carefully designed to reflect the distribution of republican, democrat, and independent voters. The Literary Digest, on the other hand, mailed out forms to people on a list compiled from automobile registrations and telephone directories: a sample that was disproportionatly representative of the rich Worse, Landon supporters turned out to be more likely to respond to the survey. The combination of these two sampling peculiarities resulted in a large bias in predicting the winner of the election.
The moral of this story is that without proper sampling, any statistic derived from a survey could be biased. Increasing the sample size may reduce sampling error but will  not eliminate the bias. A large but biased sample will produce “precisely wrong statistics,” : statistics with an extremely small sampling error, but still reflecting biases.
Whether “Big Data” methods work or not comes down to whether the data reflect the distribution of populations of interest or not. Analysis using a database of customers, which is often massive, will be extremely useful for a retailer because they are mostly likely the retailer’s population of interest, i.e., future customers. But, “Big Data” collected through internet or compiled in data clouds might not be particularly useful for poverty estimation and analysis because the data providers might not include the poor in whom we are most interested. Even if the data include the poor, the relative frequency of observations describing the rich will likely  be higher than for the poor, and thus simple averages of this kind of data will be biased towards the rich.
To address issues of “Big Data”, some are now proposing blending the convenience of “Big Data” approaches with the statistical rigor of “Small Data” approaches. Lazer, et. al. (2014) call this an “All Data Revolution”. The article shows how the Centers for Disease Control and Prevention (CDC) successfully reduced the error of GFT by using information collected by a traditional “Small Data” approach. The US Census and Labor Department are also proposing several ideas of how best to achieve a “blend of Big Data with Small Data”.
The World Bank Group is also now proposing various initiatives that blend “Big Data” with “Small Data” for poverty estimation. SWIFT (Survey of Well-being via Instant and Frequent Tracking) is one such initiative. Like typical “Small Data” efforts, SWIFT collects data from samples that are representative of underlying populations of interest. Like typical “Big Data” approaches, SWIFT applies a series of formulas/algorithms, as well as the latest ITS technology, to cut the time and cost of data collection and poverty estimation. For example, SWIFT does not estimate poverty from consumption or income data, which is time-consuming to collect, but uses formulas to estimate poverty from poverty correlates, which can be easily collected. Furthermore, by embedding the formulas into the SWIFT data management system, the correlates will be converted to poverty statistics instantly. To further cut the time for data collection and processing, SWIFT uses Computer Assisted Personal Interview (CAPI) linked to data clouds, and if possible, adopts a cell phone data collection approach.
“Big Data” science is still at its early stages and innovations in this field are rolling out at the speed of light. Such innovations might yield entirely new solutions in the near future. But I strongly believe that it is very important to recognize the risks that could attend use of “Big Data” methods. Before conducting any analysis using ”Big Data”, we must carefully check whether the underlying data accurately capture the populations of interest; otherwise, we might simply be trapped in a “Drunkard’s search”.

*Hiroki Uematsu, Serajuddin, Umar, Christina Wieser, and Nobuo Yoshida. (2014) “Monitoring Poverty and Shared Prosperity: Data in Developing Countries,” mimeo, World Bank.


Submitted by Will Durbin on

This is an excellent and timely piece addressing an inconvenient feature of Big Data often overlooked. I think the potential for bias in Big Data is similar to one of the other most exciting and promising areas of data collection: mobile phone surveys. Like Big Data, mobile phone surveys offer the possibility of much faster, cheaper and more frequent data collection. But we know that the sample of mobile phone users in the developing world is likely to be biased towards wealthier, more educated, younger households, and towards more men than women. (For example, see Blumenstock and Eagle’s analysis of mobile phone use in Rwanda (2012).)

With all these new tools, now is a very exciting time to be involved in data collection, whether through Big Data, mobile phones, tablets, or more traditional methods. But I’m glad this article reminds us of some potential pitfalls these tools offer – most importantly, that we could miss the very people we most seek to reach: those without access to these new technologies.

Thanks for your comments. As you said, the choice of technology can cause biases in responses, which can result in severe biases in poverty and other key socio-economic indicators. In addition to the paper you mentioned, Listening to LAC study also produced many useful and careful studies on this topic. Good news is that we could successfully minimize response biases in our Serbia telephone survey pilot by carefully designing the data collection process based on lessons learned from Listening to LAC and others. I think it is very important to keep improving and sharing our knowledge/experiences on the use of new technology for data collection as well as how to limit possible biases.

Thanks for your sharing your thoughts Nobuo, I'm particularly keen to find our more about SWIFT. A few teams in the Bank are working on using call detail records - CDRs (mobile phone metadata - a popular big data source) to estimate socio economic variables.

I agree that as with every data source, it's important to understand biases and try to estimate and correct for them. In the case of CDRs, one sample bias correction technique that appears to work well is to consider the systematic bias to be a function of cell phone penetration and to correct accordingly. This sort of technique of course relies on having good baseline data (e.g. the new DHS which includes questions on mobile phone ownership) and is somewhat limited by the geographic granularity at which official statistics are usually available.

I'd be keen to have your input on a project we're starting in this area and I'll get in touch shortly. Thanks again!

Many thanks for your comments. We are certainly interested in your project, particularly the sample bias correction technique. We are also studying the sample bias correction technique and contacting sampling experts in academics. It will be great if we can exchange several ideas on this.

Add new comment