The data behind your data


This page in:

As we do more data collection by phone, and using tablets, there are some things that change by necessity. For example, lengthy in-person training and piloting exercises (it wouldn’t be uncommon for us to spend 2-3 weeks training for a large and complex survey, pilot, and then come back to the classroom for revisions – research assistants, field supervisors, and at least one principal investigator all together with the enumerators) are now being replaced with online training sessions. Monitoring and supervision from afar is possible, but difficult and costly, especially if we are talking about international calls – with spotty connections, etc. Some survey software do allow you to record interviews or even potentially listen in, but there are added concerns about and complications with each of these options. So, what do we do if we want to ensure high quality data collection and fidelity to established interview protocols?

One thing that I recently became aware of is what else is on the secure server that houses your data. See, when we think of data, we are thinking of the answers to each of the questions in a survey, for example. That’s what you or someone on your team downloads, cleans, and makes ready for analysis. But, when data from the tablets are being uploaded into the server, not only the data you care about are being uploaded, but also data about those data. Until recently, I had never examined these “data behind the data” quite carefully to see what they contain. Turns out, there is quite a bit in there and it can be quite useful to get an insight into how the data are being collected.

When we did a deep dive into these other data on our server recently (we use Survey Solutions, a free survey software from the World Bank’s Development Economics Group), the first thing that I learned is that we do have an “action log.” In our case, the tablets are being used offline by the enumerators, meaning that they do not have to be on 3G or wi-fi, i.e. connected, to be able to conduct the interview. At some point, later in the day, they hotspot for a few minutes and sync the tablet to the cloud. So, if you are, say, practicing and do a mock interview or something happens that makes you think you should delete the interview and not sync it (say, you realized you were interviewing the wrong person), we the study team would not see these pretend or real data that an enumerator collected in such sessions because they were deleted prior to the sync (obviously, you could have protocols that discourage such deletions, more on this below, but a large part of the point is that you don’t know whether the protocol is being followed by everyone on your team). The action log contains information about every new interview that was created, started, stopped, restarted, paused, resumed, completed, closed, and deleted. Such data can be an important source of information if there is something to be learned about survey protocols being adhered to and continually improved.

The second, and much more detailed source of information is actually about the interviews that were synced to the server. What Survey Solutions calls “paradata” essentially has a digital footprint of every action during an interview with a timestamp. It may help to know if a lot of people are entering the wrong answer first, deleting it, then entering the correct one, for example – to improve the question/answer options design. It definitely helps to know where things are slowing down and where they may be going too fast. Often times, it is likely that good enumerators found more optimal ways of doing things than you originally considered. Examining the paradata and discussing some of the interesting or unexpected patterns with the field supervisors and enumerators can help researchers find out what is going on, make the operation more efficient, and the data higher quality. Of course, sometimes most efficient is not what you want: smart enumerators may be optimizing for themselves and not to your specific study goals: in those cases, going over the paradata with the enumerators, showing them how their surveys have different patterns than others, and discussing whether this is desirable or not may help greatly if you need them to slow down and take their time during section X, questions Y-Z, for example.

The deep dive into these two data sources on our servers taught us a couple of things: First, while it may be a bit more work for the team to sort through and weed out incomplete surveys or distinguish real surveys from practice ones, telling your team to not delete anything, i.e. upload/sync everything, may be the optimal strategy. If you’re like me, you want to know everything that is happening. That means everything. There is plenty of space in the comments section, etc. for the enumerators to let the supervisor know why there was a long pause in the interview, or why some survey was left incomplete, etc. Second, getting a sense of not just the duration of interviews, but their flow can be really useful to gain insights into both the survey design and how different enumerators are operating: everyone has a different style, but there are general patterns in a survey of how long various questions take to answer, how long it takes to obtain consent, and so forth. When needed, these can be extremely valuable – perhaps as valuable as the data that will go into Table 2 in the main paper.

Perhaps many of you were already aware of these datasets (of which there are more) and had made regular use of them in the past. More likely, most principal investigators never really spent hours and days poring over action logs and paradata – at least not since the time they were research assistants themselves. There can be even more data to examine if your enumerators work online (connected) and you choose options to utilize various things that are available to survey teams. So, apologies if this is old hat to you. If you have had the pleasure of more experience than I have, please share any useful insights in the comments section below.

Happy dives into all of your data…


Berk Özler

Lead Economist, Development Research Group, World Bank

Join the Conversation

Sarah Hughes
May 12, 2020

Berk, thanks for pointing to the paradata from Survey Solutions. The use of paradata to improve survey processes and understand interviewer behavior is very common in the U.S. and Europe. The key to effective use of paradata is to program paradata reports and determine how to provide rapid feedback to interviewers and supervisors before starting your main data collection. In resource- poor environments, examining paradata may take lower priority to callbacks and supervisor observation. However, quickly catching suboptimal interviewer behavior through paradata analysis can reduce errors, save time and anguish later and open up the black box of data collection. The rich literature on paradata use is too large to cite in a comments section, but Choumert-Nkolo et al., 2019 ( presents results from a study in Tanzania and includes many oft-cited references on paradata from Beegle, Carletto, Himelien, Couper, Kreuter and others.

May 12, 2020

Hi Sarah,

Great comment - thanks for this. I will make use of the literature you cited and more use of the paradata ;-)


Klaus Blass
May 15, 2020

Hi Berk,
At last someone pointing out the importance of one of the most useful features of Survey Solutions (in my opinion)!
Indeed, the wealth of information hidden in the paradata is absolutely essential to find out where your survey is going wrong. As a Survey Solutions consultant I am using this information on an regular basis to monitor every survey I am involved in. If you ask any NSO they will confirm your statement that “smart enumerators may be optimizing for themselves and not to your specific study goals”, sometimes even by bypassing the respondent when generating the answers.
The problem is the sheer volume of the paradata which can reach several gigabytes for the full survey. No way to manually inspect them in an editor. Generating reports, as mentioned by Sarah in her comment, is one way to identify problems – if you know what to look for. Another way is to visualize paradata graphically to detect strange things in seconds. I have written a paradata viewer to step through the interviews. One look tells me if something seems wrong or not. Here is an example of a normal interview: []

The interview starts about 6:50 in the morning and ends 20 minutes later. The blue line shows the timestamp for each answered question. At the upper right I see all the information for the currently selected question (here the first one): question text or label (here the label “RB010”), question variable, response time (less than 1 second), when answered and what the answer was (here the start time and date).
(I can quickly step through all questions and look at the answers and response times.)
Compare this to the following interview: []

This one started at 6 am with a single question (a GPS reading), continued 9 hours later at 3 pm, another one hour break and another 10 minutes of interview. Then it continues the next day at 5:30 in the morning and finishes an hour later.
The strange thing is the isolated GPS question at the beginning. Filtering interviews for this enumerator (and highlighting the GPS question) it turned out to be his standard behavior: Taking the GPS reading for a number of interviews and doing the actual interviews during the next days, often at night. Looking for this behavior we immediately found another one working the same way.
Another useful view on the paradata is the generation of interactive reports for interviews satisfying several irregular conditions: []

At the lower right you see the interview keys (or codes or enumerator names, as selected at the top) for interviews which were spread over 3 or more days and had an isolated GPS reading long before the actual interview and had parts of the interview done during the night.
I also generate a “fingerprint” for each interview, the colorful barcode at the top of the diagram:
It shows a vertical bar for each answer, representing the relative response time for that question (relative to the other interviews). It shows a red bar if the response time is shorter than in 97% of the interviews for the same question, grey if it isn’t shorter than in 75% of the interviews, and yellow or orange for values in between.
This lets you easily identify questions which were answered without having been asked. A good candidate for this are food security questions which have long and repetitive text. I understand the enumerators who do not like to read these long questions out over and over again, but if they don’t you get no real data to analyze.
I hope I wasn’t too elaborate in my comment, I just feel one cannot overemphasize the importance of the paradata for the quality of the survey data.
(unfortunately I could not include screenshots or links in the comment section, so I put the link urls between brackets)