Issues of data collection and measurement
This page in:
About five years ago, soon after we started this blog, I wrote a blog post titled “Economists have experiments figured out. What’s next? (Hint: It’s Measurement)” Soon after the post, I had folks from IPA email me saying we should experiment with some important measurement issues, making use of IPA’s network of studies around the world. At the time, we made some effort to organize an initial workshop to discuss what we know and what we need to know, but things fizzled out. Fast forward to this Spring when IPA did send out an invitation for a workshop on measurement issues (attendees found out that this is actually the second of such workshops, the first of which took place in 1999, which was when Dean Karlan met Chris Udry for the first time: the two organized last week’s workshop at Yale).
The workshop, with its 30 or so attendees, which mixed academics from diverse fields with data collection/technology firms, had a nice format. There were no formal presentations, which were almost discouraged. Each of the five themes (electronic data collection; survey design; respondent disclosure and hard-to-measure concepts; enumerator effects; and behavioral response to research and disclosure to respondents) was assigned discussion leads, each of which spoke for about 5-10 minutes (sometimes making use of a few slides) and then opened up the discussion. Sessions were broken up by long breaks, which allowed us to not worry about going overtime and to continue interesting discussions over coffee/lunch. This worked quite well, I recommend more people organizing workshops adopting such flexible schedules: people not presenting formal papers help this…
Here are a few things that stuck with me throughout the day:
Electronic data collection:
Andrew Dillon talked about survey design, which includes experiments they ran with my colleagues Kathleen Beegle, Jed Friedman, and John Gibson on length of survey modules, short or long recall, who to ask the questions to (direct respondent vs. proxy), what keywords to use, order of questions, definition of a household, etc. We’ll have him blog on what he called the “methodology of survey methodology,” so I won’t get into the details here. He also talked about the use of wearable products, such as fitbits or accelerometers, to measure physical activity for research purposes: this is an interesting avenue, but has its challenges, such as non-compliance and also influencing subjects’ behaviors by even the simple act of having them wear it or providing them with data…
Sensitive topics:
I was in a session with Julian Jamison and Jonathan Zinman on sensitive topics and hard-to-measure concepts. Julian gave a couple of examples from his worked where they used the randomized response technique and list experiments (see this short paper by Blair for a description of these techniques). I reflected on these a little bit in the context of asking sensitive questions, say, on sexual behavior or domestic violence, in developing countries. List experiments require people to count/add, possibly introducing noise to the data – especially if the list is long. Furthermore, unless the “innocent” questions are completely unrelated and have a known distribution, then there is a chance that the treatment in your RCT might have an effect on its distribution. However, as Julian pointed out, designing your common questions that way makes your sensitive ones stand out even more. Overall, I am not sold on list experiments for my own purposes, but willing to experiment with them to see if they can work.
Randomized response technique seems more promising, especially now that we have tablets to help the respondent self-administer the module in private. The problem with this technique is that the respondent has to believe you that her responses truly cannot be identified. Furthermore, even when the subject is prompted to answer “yes” regardless of the truth, he/she might refuse to do so. I feel that there are technological solutions and behavioral nudges here that can reduce these problems and reveal a “closer to the truth” level of behaviors: when the tablet randomly picks “yes” as opposed to “truth,” it can automatically record “yes;” when it randomly asks for the “true answer,” it can immediately crumple up the response and toss it into a rubbish bin that visually destroys it within seconds, etc. I intend to run such experiments in future work, compared to more traditional uses of this technique to see if such nudges make a difference…
Jonathan Zinman talked about his recent work on measuring financial condition (see working paper here), which I really liked. They asked a bunch of simple questions to households in the U.S. about their wealth, savings, and distress: “if you sold all your assets, would it be enough to cover all of your liabilities? If so, how much would you have left?” or “To what extent, if any, are finances a source of stress in your life?” They find that individuals who are “behavioral” in at least one dimension (i.e. deviate from classical assumptions about consumer choice) do worse in financial wellbeing on average, controlling for a bunch of relevant correlates. The elicitation of being behavioral is also interesting, but the idea that we could elicit financial wellbeing by asking a short number of simple questions is intriguing and probably deserves more exploration in developing country contexts.
Enumerator Effects:
Sarah Baird and Dean Karlan discussed enumerator effects, which ended up being one of the more lively and interesting discussions of the day – at least for me. Among the issues discussed were:
Behavioral response to research and disclosure to respondents:
Markus Goldstein, using some of the great data collected under the Gender Innovation Lab that he leads, discussed some of the implications of comparing spouses’ individual responses to the same questions, say, on labor market participation, ownership/control of plots, or power over consumption decisions. This is an inadequately explored, but important area for thinking about intra-household bargaining power (as well as asking a question to a person vs. a proxy) and Markus had some really interesting graphs to show us. One idea was to actually have the spouses play a version of the “dating game,” where they try to predict each others’ answers to questions and use this to sort out whether they’re simply trying to sync their individual answers with their partners’…
We ended the day by Chris Udry asking people about their practices in the field about providing feedback to responses. People agreed that some data collection, biomarker data, for example, is immediately useful to the respondents by providing them previously unknown information about an important topic, say, health status, and linking them to resources to address any revealed problems. People also shared their experiences about sharing survey data with the respondents or their communities, such as preparing an interesting fact sheet and sharing it with each respondent during the next round of interviews or through community meetings, forums, notice boards, etc. Incentives for such activities are low and PIs do worry about providing such information being an intervention in itself, so these are things to think about. My own view is that sharing impact evaluation results (or design at first follow-up) at the beginning of training at each round is greatly motivating for field teams, many of whom might work for the same project multiple rounds (benefitting all parties involved): seeing the purpose of the questions, impacts of the interventions, and the data that confirms or denies their priors can be greatly enjoyable, if not productive, for the field teams. Same worries about providing such information to field teams can be raised as above, but there are ways to minimize any potential bias from such exercises.
Let us know if you have ideas for measurement experiments to be embedded into RCTs, using the comments section, and we’ll see if we cannot put you in touch with the right people…
The workshop, with its 30 or so attendees, which mixed academics from diverse fields with data collection/technology firms, had a nice format. There were no formal presentations, which were almost discouraged. Each of the five themes (electronic data collection; survey design; respondent disclosure and hard-to-measure concepts; enumerator effects; and behavioral response to research and disclosure to respondents) was assigned discussion leads, each of which spoke for about 5-10 minutes (sometimes making use of a few slides) and then opened up the discussion. Sessions were broken up by long breaks, which allowed us to not worry about going overtime and to continue interesting discussions over coffee/lunch. This worked quite well, I recommend more people organizing workshops adopting such flexible schedules: people not presenting formal papers help this…
Here are a few things that stuck with me throughout the day:
Electronic data collection:
- Shawn Cole talked about some experiments they ran comparing phone-based surveys vs. in-person ones. Data collection by mobile phone leads to substantially higher attrition rates compared with in person surveys. However, part of this could be the setup of the experiment. Furthermore, if surveys by mobile phone are much cheaper, one could draw larger samples to account for larger attrition: helping this point was the fact that there did not seem to be differential attrition in phone surveys.
- People also talked about outfits, such as Geopoll and Premise, crowdsourcing data in developing countries using the latest technologies available through mobile phones. While these seem promising and useful for some specific uses or clients, it’s hard to see them become mainstream in the very short run for development economists: GeoPoll’s samples were said to be wealthier, more urban, and younger, while the examples from Premise, such as creating time-series detailed price data through crowd-sourced (and validated) photographs from around the country, raise issues of representativeness.
- Niall Kelleher mentioned work using call detail records – people are exploring uses similar to poverty maps (small area estimation) using such data. But, access is not yet easy and coverage is still an issue. A few years from now, we may not be worried about such issues.
- Computer-assisted modules are quite cool, but not exactly new. We were experimenting with things made much easier by the use of tablets or UMPCs almost a decade ago, such as drop down menus of respondents’ networks, frontloading data to verify mentions of names or retrospective questions in real time, using the tablets to help with random sampling during listing, etc. But, these functions are now much more enhanced and routine, including audio recordings of conversations, embedding photos and videos into the survey, etc.
- Finally, two people from SurveyCTO and IPA talked about the use of technology for improving data quality within data collection outfits. Random audio audits, high frequency checks, etc. are being used to reduce the prevalence of fraudulent or low-quality data.
Andrew Dillon talked about survey design, which includes experiments they ran with my colleagues Kathleen Beegle, Jed Friedman, and John Gibson on length of survey modules, short or long recall, who to ask the questions to (direct respondent vs. proxy), what keywords to use, order of questions, definition of a household, etc. We’ll have him blog on what he called the “methodology of survey methodology,” so I won’t get into the details here. He also talked about the use of wearable products, such as fitbits or accelerometers, to measure physical activity for research purposes: this is an interesting avenue, but has its challenges, such as non-compliance and also influencing subjects’ behaviors by even the simple act of having them wear it or providing them with data…
Sensitive topics:
I was in a session with Julian Jamison and Jonathan Zinman on sensitive topics and hard-to-measure concepts. Julian gave a couple of examples from his worked where they used the randomized response technique and list experiments (see this short paper by Blair for a description of these techniques). I reflected on these a little bit in the context of asking sensitive questions, say, on sexual behavior or domestic violence, in developing countries. List experiments require people to count/add, possibly introducing noise to the data – especially if the list is long. Furthermore, unless the “innocent” questions are completely unrelated and have a known distribution, then there is a chance that the treatment in your RCT might have an effect on its distribution. However, as Julian pointed out, designing your common questions that way makes your sensitive ones stand out even more. Overall, I am not sold on list experiments for my own purposes, but willing to experiment with them to see if they can work.
Randomized response technique seems more promising, especially now that we have tablets to help the respondent self-administer the module in private. The problem with this technique is that the respondent has to believe you that her responses truly cannot be identified. Furthermore, even when the subject is prompted to answer “yes” regardless of the truth, he/she might refuse to do so. I feel that there are technological solutions and behavioral nudges here that can reduce these problems and reveal a “closer to the truth” level of behaviors: when the tablet randomly picks “yes” as opposed to “truth,” it can automatically record “yes;” when it randomly asks for the “true answer,” it can immediately crumple up the response and toss it into a rubbish bin that visually destroys it within seconds, etc. I intend to run such experiments in future work, compared to more traditional uses of this technique to see if such nudges make a difference…
Jonathan Zinman talked about his recent work on measuring financial condition (see working paper here), which I really liked. They asked a bunch of simple questions to households in the U.S. about their wealth, savings, and distress: “if you sold all your assets, would it be enough to cover all of your liabilities? If so, how much would you have left?” or “To what extent, if any, are finances a source of stress in your life?” They find that individuals who are “behavioral” in at least one dimension (i.e. deviate from classical assumptions about consumer choice) do worse in financial wellbeing on average, controlling for a bunch of relevant correlates. The elicitation of being behavioral is also interesting, but the idea that we could elicit financial wellbeing by asking a short number of simple questions is intriguing and probably deserves more exploration in developing country contexts.
Enumerator Effects:
Sarah Baird and Dean Karlan discussed enumerator effects, which ended up being one of the more lively and interesting discussions of the day – at least for me. Among the issues discussed were:
- Randomly assigning enumerators (or enumerator teams) to clusters or subjects vs. having your really good field supervisors optimize.
- Hiring and compensation strategies for enumerators to maximize both enumerator welfare and satisfaction and data quality – given complex, intrinsic and extrinsic motivations, of field workers.
- Controlling for enumerator fixed effects in regression analysis: this is currently usually not pre-specified, so this presents a dilemma. Most of us do think that there are enumerator fixed effects, which would improve power but might also alter treatment effects (when enumerators have not been randomly assigned but rather purposively assigned by field supervisors), so controlling for them seems like a good idea. But then, we’re subject to the Friedman-type criticism of ad hoc regression adjustment (see here and here): if we control for this, why not throw in a bunch of other baseline covariates? One solution to this could be to pre-specify this ahead of time (will or will not include as controls in impact estimation). Another is to commit ahead of time to the method of choosing baseline covariates (e.g. will pick the set of X baseline covariates that minimize the standard error of the treatment effect).
Behavioral response to research and disclosure to respondents:
Markus Goldstein, using some of the great data collected under the Gender Innovation Lab that he leads, discussed some of the implications of comparing spouses’ individual responses to the same questions, say, on labor market participation, ownership/control of plots, or power over consumption decisions. This is an inadequately explored, but important area for thinking about intra-household bargaining power (as well as asking a question to a person vs. a proxy) and Markus had some really interesting graphs to show us. One idea was to actually have the spouses play a version of the “dating game,” where they try to predict each others’ answers to questions and use this to sort out whether they’re simply trying to sync their individual answers with their partners’…
We ended the day by Chris Udry asking people about their practices in the field about providing feedback to responses. People agreed that some data collection, biomarker data, for example, is immediately useful to the respondents by providing them previously unknown information about an important topic, say, health status, and linking them to resources to address any revealed problems. People also shared their experiences about sharing survey data with the respondents or their communities, such as preparing an interesting fact sheet and sharing it with each respondent during the next round of interviews or through community meetings, forums, notice boards, etc. Incentives for such activities are low and PIs do worry about providing such information being an intervention in itself, so these are things to think about. My own view is that sharing impact evaluation results (or design at first follow-up) at the beginning of training at each round is greatly motivating for field teams, many of whom might work for the same project multiple rounds (benefitting all parties involved): seeing the purpose of the questions, impacts of the interventions, and the data that confirms or denies their priors can be greatly enjoyable, if not productive, for the field teams. Same worries about providing such information to field teams can be raised as above, but there are ways to minimize any potential bias from such exercises.
Let us know if you have ideas for measurement experiments to be embedded into RCTs, using the comments section, and we’ll see if we cannot put you in touch with the right people…
Treatment and Measurement - keep 'em separated.
Berk, great post - measurement is a crucial area that needs much more interest and work in our field (development economics). Just wanted to say something that, I think, is understood (perhaps painfully obvious even) and you kind of get at it but just to be clear - it's important to keep treatment and measurement as separate from each other as possible. Here's what I mean. You say "My own view is that sharing impact evaluation results (or design at first follow-up) at the beginning of training at each round is greatly motivating for field teams". This is fine as long as: (a) field teams are not the same guys who administered the treatment (duh); and (b) field teams are kept away from knowledge of respondent treatment status (as much as this is feasible). So, what you say is fine as long as the surveyors are not clued in about the treatment status of their set of respondents (e.g., in what you say, it is fine to provide coarse results or sample averages or some such at first follow up but nothing more specific than that).
Hi Ali,
Thanks. Yes, (a) goes without saying. As for (b) you do the best you can but in many cases, especially long-term studies with multiple rounds of data collection, enumerators do learn...
In the specific case where we have done this, it was in the fourth round of data collection with the results widely available in the media, so we calculated the downsides to be smaller than the upside of rewarding the people, most of whom had been repeat enumerators in our project over a period of five years...
Survey methodology is a highly developed field in which issues of measurement are central, so I wonder if this workshop included any of the many researchers who specialize in measurement and in conducting data collection in support of evaluations of economic development programs. The literature in this field is quite large and growing, so evaluation economists who are not familiar with it might start with work in the area of Total Survey Error (TSE) and the very practical website Comparative Survey Design Guidelines, which recommends best practices in each step of survey development, including selecting data collection companies, appropriate and tested methods for translation, standards for fieldwork observation, etc. and includes references to many of the foundational texts in this field in recent decades (http://ccsg.isr.umich.edu/). For researchers interested in the state of the discipline, the Second International Conference on Survey Methods in Multinational, Multiregional and Multicultural Contexts (3MC 2016) which will be held July 25 - 29, 2016 in Chicago could be of interest. The first 3MC resulted in the volume, "Survey Methods in Multinational, Multiregional and Multicultural Contexts," Wiley (2010).
Not to put too fine a point on it, but I imagine a few survey methodologists would be a tiny bit amused to discover that while, "economists have experiments all figured out," measurement will be the next focus of their attention.
Yes, Yu-Chieh (Jay) Lin from University of Michigan's Survey Research Center was in attendance and extended an invitation to the attendees to 3MC 2016 at the end of his talk.
Some of the research on comparing Geopoll surveys to face-to-face surveys are presented on this webinar, which I found quite useful: https://www.youtube.com/watch?v=DWHODx77ZLA
Thanks, Berk, for the summary. It sounds like a fascinating workshop.
Was there any discussion of the biases that potentially arise from the enumerators knowing the treatment assignments and/or of the respondents knowing the purpose of the survey? There is a very good story about a horse - one renowned for its skills in computation (a.k.a. "Clever Hans") - that illustrates what can go wrong when the enumerator desires a particular response and the respondent wants to keep the enumerator happy. In the case of IE-specific surveys, if enumerators are aware of the treatment assignments (as they are for most interventions clustered at anything above the individual level) and have priors over the treatment effects (which they will do for interventions with some prominence) and if the respondents (maybe out of pure hospitality) are keen to provide the answers that the enumerator wants to hear, then the same problem arises.
There are very few impact evaluations that could conceivably be made double-blind, but what may we do otherwise to mitigate the potential bias? Obviously, informing enumerators of treatment assignments should be avoided if possible, although this also precludes the addition to surveys of treatment-specific monitoring instruments (which are often useful) and what not. In addition, not being entirely forthcoming to respondents about the ultimate purpose of the survey also may help mitigate the risk of responses being affected by respondents' views on the treatment. Even with these measures, though, it's inevitable in most multi-staged surveys that enumerators will discern the treatment assignments and respondents will discern the purpose of the survey. Hence, there is often some potential for bias.
Given this problem, the other issues discussed in your earlier blog, and all the complications of trying to discern behavior from survey responses, I wonder we shouldn't be placing a higher premium on data from sources other than IE-specific surveys.
Hi Andrew,
Thanks. These are tough issues, on which the one-day workshop did not spend a lot of time on discussing. Generally, enumerators are blinded to treatment status, but, as you say they can find out. Economics treatments obviously are not, and not much we can do about that. Your point about other data sources is valid, but some IE-specific data collection will almost always be needed. Thinking hard about these issues and making calculated trade-offs, and triangulating information as much as possible, seems like the best I can think off...
Thanks for the quick response.
In so far as these "Clever Hans effects" are a potential source of bias for many IEs that use their own data and in so far as that bias ordinarily goes in a direction that aligns with the preferences of research principals and the researchers, there is probably room for IE reports and papers to discuss how the potential for such biases has been mitigated. For instance, I don't think I've seen many reports / papers that discussed how the survey was introduced to respondents, what they enumerators knew of the treatment, and/or any differences between the contents of the survey instruments administered in the control and treatment groups. However, these details are probably as important as the many methodological details that are now de rigueur in the literature.
Even the CONSORT guidelines, which mandate standardized reporting, does not get into that kind of detail. Journal templates with space limitations do not help, either. Finally, unless you're experimenting with these aspects, the bias is unknown. So, yes, agreed that these are important considerations but the solutions aren't easy to devise...
That would be a nice experiment, although brave is the soul that risks half of their power to run it on a full-blown IE. In the meantime, though, it'd be nice to see these issues incorporated into the guidelines and lists of best practices that are out there. As of now, I suspect that it's relatively common for enumeration practices to substantively differ by treatment status and, for all we know, this could be generating a fair few false positives.
I want to thank Sarah Hughes for her comment. She is entirely correct about the extent of currently available literature. Among the list of publications is the book Hard-to-Survey Populations edited by Tourangeau, Edwards, Johnson, Wolter, and Bates, published 2014. For research related to sensitive questions, the Demographic and Health Survey project funded by USAID has been asking questions about sexual activity and domestic violence in the developing world for many years.
Other organizations have also been conducting multiple surveys in the developing world including the Multiple Indicator Cluster Surveys (MICS) by UNICEF, and the Living Standards Measurement Surveys (LSMS) by the World Bank. Most of the data collected from these surveys is available online, with a required registration but no fee.
Until just the last few years there has been a real disconnect between those conducting surveys and research in survey methodology in the US and those conducting surveys elsewhere; the two groups were largely unaware of each other, much to the detriment of both groups. Fortunately this is beginning to change. For example, the World Bank has begun participating in the conferences held by the American Association for Public Opinion Research (AAPOR), and is forming a survey research advisory panel that includes the most recognized names in survey research.
Let's hope the trend continues.