About five years ago, soon after we started this blog, I wrote a blog post titled “Economists have experiments figured out. What’s next? (Hint: It’s Measurement)” Soon after the post, I had folks from IPA email me saying we should experiment with some important measurement issues, making use of IPA’s network of studies around the world. At the time, we made some effort to organize an initial workshop to discuss what we know and what we need to know, but things fizzled out. Fast forward to this Spring when IPA did send out an invitation for a workshop on measurement issues (attendees found out that this is actually the second of such workshops, the first of which took place in 1999, which was when Dean Karlan met Chris Udry for the first time: the two organized last week’s workshop at Yale).
The workshop, with its 30 or so attendees, which mixed academics from diverse fields with data collection/technology firms, had a nice format. There were no formal presentations, which were almost discouraged. Each of the five themes (electronic data collection; survey design; respondent disclosure and hard-to-measure concepts; enumerator effects; and behavioral response to research and disclosure to respondents) was assigned discussion leads, each of which spoke for about 5-10 minutes (sometimes making use of a few slides) and then opened up the discussion. Sessions were broken up by long breaks, which allowed us to not worry about going overtime and to continue interesting discussions over coffee/lunch. This worked quite well, I recommend more people organizing workshops adopting such flexible schedules: people not presenting formal papers help this…
Here are a few things that stuck with me throughout the day:
Electronic data collection:
Andrew Dillon talked about survey design, which includes experiments they ran with my colleagues Kathleen Beegle, Jed Friedman, and John Gibson on length of survey modules, short or long recall, who to ask the questions to (direct respondent vs. proxy), what keywords to use, order of questions, definition of a household, etc. We’ll have him blog on what he called the “methodology of survey methodology,” so I won’t get into the details here. He also talked about the use of wearable products, such as fitbits or accelerometers, to measure physical activity for research purposes: this is an interesting avenue, but has its challenges, such as non-compliance and also influencing subjects’ behaviors by even the simple act of having them wear it or providing them with data…
Sensitive topics:
I was in a session with Julian Jamison and Jonathan Zinman on sensitive topics and hard-to-measure concepts. Julian gave a couple of examples from his worked where they used the randomized response technique and list experiments (see this short paper by Blair for a description of these techniques). I reflected on these a little bit in the context of asking sensitive questions, say, on sexual behavior or domestic violence, in developing countries. List experiments require people to count/add, possibly introducing noise to the data – especially if the list is long. Furthermore, unless the “innocent” questions are completely unrelated and have a known distribution, then there is a chance that the treatment in your RCT might have an effect on its distribution. However, as Julian pointed out, designing your common questions that way makes your sensitive ones stand out even more. Overall, I am not sold on list experiments for my own purposes, but willing to experiment with them to see if they can work.
Randomized response technique seems more promising, especially now that we have tablets to help the respondent self-administer the module in private. The problem with this technique is that the respondent has to believe you that her responses truly cannot be identified. Furthermore, even when the subject is prompted to answer “yes” regardless of the truth, he/she might refuse to do so. I feel that there are technological solutions and behavioral nudges here that can reduce these problems and reveal a “closer to the truth” level of behaviors: when the tablet randomly picks “yes” as opposed to “truth,” it can automatically record “yes;” when it randomly asks for the “true answer,” it can immediately crumple up the response and toss it into a rubbish bin that visually destroys it within seconds, etc. I intend to run such experiments in future work, compared to more traditional uses of this technique to see if such nudges make a difference…
Jonathan Zinman talked about his recent work on measuring financial condition (see working paper here), which I really liked. They asked a bunch of simple questions to households in the U.S. about their wealth, savings, and distress: “if you sold all your assets, would it be enough to cover all of your liabilities? If so, how much would you have left?” or “To what extent, if any, are finances a source of stress in your life?” They find that individuals who are “behavioral” in at least one dimension (i.e. deviate from classical assumptions about consumer choice) do worse in financial wellbeing on average, controlling for a bunch of relevant correlates. The elicitation of being behavioral is also interesting, but the idea that we could elicit financial wellbeing by asking a short number of simple questions is intriguing and probably deserves more exploration in developing country contexts.
Enumerator Effects:
Sarah Baird and Dean Karlan discussed enumerator effects, which ended up being one of the more lively and interesting discussions of the day – at least for me. Among the issues discussed were:
Behavioral response to research and disclosure to respondents:
Markus Goldstein, using some of the great data collected under the Gender Innovation Lab that he leads, discussed some of the implications of comparing spouses’ individual responses to the same questions, say, on labor market participation, ownership/control of plots, or power over consumption decisions. This is an inadequately explored, but important area for thinking about intra-household bargaining power (as well as asking a question to a person vs. a proxy) and Markus had some really interesting graphs to show us. One idea was to actually have the spouses play a version of the “dating game,” where they try to predict each others’ answers to questions and use this to sort out whether they’re simply trying to sync their individual answers with their partners’…
We ended the day by Chris Udry asking people about their practices in the field about providing feedback to responses. People agreed that some data collection, biomarker data, for example, is immediately useful to the respondents by providing them previously unknown information about an important topic, say, health status, and linking them to resources to address any revealed problems. People also shared their experiences about sharing survey data with the respondents or their communities, such as preparing an interesting fact sheet and sharing it with each respondent during the next round of interviews or through community meetings, forums, notice boards, etc. Incentives for such activities are low and PIs do worry about providing such information being an intervention in itself, so these are things to think about. My own view is that sharing impact evaluation results (or design at first follow-up) at the beginning of training at each round is greatly motivating for field teams, many of whom might work for the same project multiple rounds (benefitting all parties involved): seeing the purpose of the questions, impacts of the interventions, and the data that confirms or denies their priors can be greatly enjoyable, if not productive, for the field teams. Same worries about providing such information to field teams can be raised as above, but there are ways to minimize any potential bias from such exercises.
Let us know if you have ideas for measurement experiments to be embedded into RCTs, using the comments section, and we’ll see if we cannot put you in touch with the right people…
The workshop, with its 30 or so attendees, which mixed academics from diverse fields with data collection/technology firms, had a nice format. There were no formal presentations, which were almost discouraged. Each of the five themes (electronic data collection; survey design; respondent disclosure and hard-to-measure concepts; enumerator effects; and behavioral response to research and disclosure to respondents) was assigned discussion leads, each of which spoke for about 5-10 minutes (sometimes making use of a few slides) and then opened up the discussion. Sessions were broken up by long breaks, which allowed us to not worry about going overtime and to continue interesting discussions over coffee/lunch. This worked quite well, I recommend more people organizing workshops adopting such flexible schedules: people not presenting formal papers help this…
Here are a few things that stuck with me throughout the day:
Electronic data collection:
- Shawn Cole talked about some experiments they ran comparing phone-based surveys vs. in-person ones. Data collection by mobile phone leads to substantially higher attrition rates compared with in person surveys. However, part of this could be the setup of the experiment. Furthermore, if surveys by mobile phone are much cheaper, one could draw larger samples to account for larger attrition: helping this point was the fact that there did not seem to be differential attrition in phone surveys.
- People also talked about outfits, such as Geopoll and Premise, crowdsourcing data in developing countries using the latest technologies available through mobile phones. While these seem promising and useful for some specific uses or clients, it’s hard to see them become mainstream in the very short run for development economists: GeoPoll’s samples were said to be wealthier, more urban, and younger, while the examples from Premise, such as creating time-series detailed price data through crowd-sourced (and validated) photographs from around the country, raise issues of representativeness.
- Niall Kelleher mentioned work using call detail records – people are exploring uses similar to poverty maps (small area estimation) using such data. But, access is not yet easy and coverage is still an issue. A few years from now, we may not be worried about such issues.
- Computer-assisted modules are quite cool, but not exactly new. We were experimenting with things made much easier by the use of tablets or UMPCs almost a decade ago, such as drop down menus of respondents’ networks, frontloading data to verify mentions of names or retrospective questions in real time, using the tablets to help with random sampling during listing, etc. But, these functions are now much more enhanced and routine, including audio recordings of conversations, embedding photos and videos into the survey, etc.
- Finally, two people from SurveyCTO and IPA talked about the use of technology for improving data quality within data collection outfits. Random audio audits, high frequency checks, etc. are being used to reduce the prevalence of fraudulent or low-quality data.
Andrew Dillon talked about survey design, which includes experiments they ran with my colleagues Kathleen Beegle, Jed Friedman, and John Gibson on length of survey modules, short or long recall, who to ask the questions to (direct respondent vs. proxy), what keywords to use, order of questions, definition of a household, etc. We’ll have him blog on what he called the “methodology of survey methodology,” so I won’t get into the details here. He also talked about the use of wearable products, such as fitbits or accelerometers, to measure physical activity for research purposes: this is an interesting avenue, but has its challenges, such as non-compliance and also influencing subjects’ behaviors by even the simple act of having them wear it or providing them with data…
Sensitive topics:
I was in a session with Julian Jamison and Jonathan Zinman on sensitive topics and hard-to-measure concepts. Julian gave a couple of examples from his worked where they used the randomized response technique and list experiments (see this short paper by Blair for a description of these techniques). I reflected on these a little bit in the context of asking sensitive questions, say, on sexual behavior or domestic violence, in developing countries. List experiments require people to count/add, possibly introducing noise to the data – especially if the list is long. Furthermore, unless the “innocent” questions are completely unrelated and have a known distribution, then there is a chance that the treatment in your RCT might have an effect on its distribution. However, as Julian pointed out, designing your common questions that way makes your sensitive ones stand out even more. Overall, I am not sold on list experiments for my own purposes, but willing to experiment with them to see if they can work.
Randomized response technique seems more promising, especially now that we have tablets to help the respondent self-administer the module in private. The problem with this technique is that the respondent has to believe you that her responses truly cannot be identified. Furthermore, even when the subject is prompted to answer “yes” regardless of the truth, he/she might refuse to do so. I feel that there are technological solutions and behavioral nudges here that can reduce these problems and reveal a “closer to the truth” level of behaviors: when the tablet randomly picks “yes” as opposed to “truth,” it can automatically record “yes;” when it randomly asks for the “true answer,” it can immediately crumple up the response and toss it into a rubbish bin that visually destroys it within seconds, etc. I intend to run such experiments in future work, compared to more traditional uses of this technique to see if such nudges make a difference…
Jonathan Zinman talked about his recent work on measuring financial condition (see working paper here), which I really liked. They asked a bunch of simple questions to households in the U.S. about their wealth, savings, and distress: “if you sold all your assets, would it be enough to cover all of your liabilities? If so, how much would you have left?” or “To what extent, if any, are finances a source of stress in your life?” They find that individuals who are “behavioral” in at least one dimension (i.e. deviate from classical assumptions about consumer choice) do worse in financial wellbeing on average, controlling for a bunch of relevant correlates. The elicitation of being behavioral is also interesting, but the idea that we could elicit financial wellbeing by asking a short number of simple questions is intriguing and probably deserves more exploration in developing country contexts.
Enumerator Effects:
Sarah Baird and Dean Karlan discussed enumerator effects, which ended up being one of the more lively and interesting discussions of the day – at least for me. Among the issues discussed were:
- Randomly assigning enumerators (or enumerator teams) to clusters or subjects vs. having your really good field supervisors optimize.
- Hiring and compensation strategies for enumerators to maximize both enumerator welfare and satisfaction and data quality – given complex, intrinsic and extrinsic motivations, of field workers.
- Controlling for enumerator fixed effects in regression analysis: this is currently usually not pre-specified, so this presents a dilemma. Most of us do think that there are enumerator fixed effects, which would improve power but might also alter treatment effects (when enumerators have not been randomly assigned but rather purposively assigned by field supervisors), so controlling for them seems like a good idea. But then, we’re subject to the Friedman-type criticism of ad hoc regression adjustment (see here and here): if we control for this, why not throw in a bunch of other baseline covariates? One solution to this could be to pre-specify this ahead of time (will or will not include as controls in impact estimation). Another is to commit ahead of time to the method of choosing baseline covariates (e.g. will pick the set of X baseline covariates that minimize the standard error of the treatment effect).
Behavioral response to research and disclosure to respondents:
Markus Goldstein, using some of the great data collected under the Gender Innovation Lab that he leads, discussed some of the implications of comparing spouses’ individual responses to the same questions, say, on labor market participation, ownership/control of plots, or power over consumption decisions. This is an inadequately explored, but important area for thinking about intra-household bargaining power (as well as asking a question to a person vs. a proxy) and Markus had some really interesting graphs to show us. One idea was to actually have the spouses play a version of the “dating game,” where they try to predict each others’ answers to questions and use this to sort out whether they’re simply trying to sync their individual answers with their partners’…
We ended the day by Chris Udry asking people about their practices in the field about providing feedback to responses. People agreed that some data collection, biomarker data, for example, is immediately useful to the respondents by providing them previously unknown information about an important topic, say, health status, and linking them to resources to address any revealed problems. People also shared their experiences about sharing survey data with the respondents or their communities, such as preparing an interesting fact sheet and sharing it with each respondent during the next round of interviews or through community meetings, forums, notice boards, etc. Incentives for such activities are low and PIs do worry about providing such information being an intervention in itself, so these are things to think about. My own view is that sharing impact evaluation results (or design at first follow-up) at the beginning of training at each round is greatly motivating for field teams, many of whom might work for the same project multiple rounds (benefitting all parties involved): seeing the purpose of the questions, impacts of the interventions, and the data that confirms or denies their priors can be greatly enjoyable, if not productive, for the field teams. Same worries about providing such information to field teams can be raised as above, but there are ways to minimize any potential bias from such exercises.
Let us know if you have ideas for measurement experiments to be embedded into RCTs, using the comments section, and we’ll see if we cannot put you in touch with the right people…
Join the Conversation