Published on Development Impact

The data behind your data

This page in:

As we do more data collection by phone, and using tablets, there are some things that change by necessity. For example, lengthy in-person training and piloting exercises (it wouldn’t be uncommon for us to spend 2-3 weeks training for a large and complex survey, pilot, and then come back to the classroom for revisions – research assistants, field supervisors, and at least one principal investigator all together with the enumerators) are now being replaced with online training sessions. Monitoring and supervision from afar is possible, but difficult and costly, especially if we are talking about international calls – with spotty connections, etc. Some survey software do allow you to record interviews or even potentially listen in, but there are added concerns about and complications with each of these options. So, what do we do if we want to ensure high quality data collection and fidelity to established interview protocols?

One thing that I recently became aware of is what else is on the secure server that houses your data. See, when we think of data, we are thinking of the answers to each of the questions in a survey, for example. That’s what you or someone on your team downloads, cleans, and makes ready for analysis. But, when data from the tablets are being uploaded into the server, not only the data you care about are being uploaded, but also data about those data. Until recently, I had never examined these “data behind the data” quite carefully to see what they contain. Turns out, there is quite a bit in there and it can be quite useful to get an insight into how the data are being collected.

When we did a deep dive into these other data on our server recently (we use Survey Solutions, a free survey software from the World Bank’s Development Economics Group), the first thing that I learned is that we do have an “action log.” In our case, the tablets are being used offline by the enumerators, meaning that they do not have to be on 3G or wi-fi, i.e. connected, to be able to conduct the interview. At some point, later in the day, they hotspot for a few minutes and sync the tablet to the cloud. So, if you are, say, practicing and do a mock interview or something happens that makes you think you should delete the interview and not sync it (say, you realized you were interviewing the wrong person), we the study team would not see these pretend or real data that an enumerator collected in such sessions because they were deleted prior to the sync (obviously, you could have protocols that discourage such deletions, more on this below, but a large part of the point is that you don’t know whether the protocol is being followed by everyone on your team). The action log contains information about every new interview that was created, started, stopped, restarted, paused, resumed, completed, closed, and deleted. Such data can be an important source of information if there is something to be learned about survey protocols being adhered to and continually improved.

The second, and much more detailed source of information is actually about the interviews that were synced to the server. What Survey Solutions calls “paradata” essentially has a digital footprint of every action during an interview with a timestamp. It may help to know if a lot of people are entering the wrong answer first, deleting it, then entering the correct one, for example – to improve the question/answer options design. It definitely helps to know where things are slowing down and where they may be going too fast. Often times, it is likely that good enumerators found more optimal ways of doing things than you originally considered. Examining the paradata and discussing some of the interesting or unexpected patterns with the field supervisors and enumerators can help researchers find out what is going on, make the operation more efficient, and the data higher quality. Of course, sometimes most efficient is not what you want: smart enumerators may be optimizing for themselves and not to your specific study goals: in those cases, going over the paradata with the enumerators, showing them how their surveys have different patterns than others, and discussing whether this is desirable or not may help greatly if you need them to slow down and take their time during section X, questions Y-Z, for example.

The deep dive into these two data sources on our servers taught us a couple of things: First, while it may be a bit more work for the team to sort through and weed out incomplete surveys or distinguish real surveys from practice ones, telling your team to not delete anything, i.e. upload/sync everything, may be the optimal strategy. If you’re like me, you want to know everything that is happening. That means everything. There is plenty of space in the comments section, etc. for the enumerators to let the supervisor know why there was a long pause in the interview, or why some survey was left incomplete, etc. Second, getting a sense of not just the duration of interviews, but their flow can be really useful to gain insights into both the survey design and how different enumerators are operating: everyone has a different style, but there are general patterns in a survey of how long various questions take to answer, how long it takes to obtain consent, and so forth. When needed, these can be extremely valuable – perhaps as valuable as the data that will go into Table 2 in the main paper.

Perhaps many of you were already aware of these datasets (of which there are more) and had made regular use of them in the past. More likely, most principal investigators never really spent hours and days poring over action logs and paradata – at least not since the time they were research assistants themselves. There can be even more data to examine if your enumerators work online (connected) and you choose options to utilize various things that are available to survey teams. So, apologies if this is old hat to you. If you have had the pleasure of more experience than I have, please share any useful insights in the comments section below.

Happy dives into all of your data…


Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000