Researchers put a lot of effort into developing survey questionnaires designed to measure key outcomes of interest for their impact evaluations. But every now and then, despite efforts piloting and fine-tuning surveys, some of the questions end up “not working”. The result is data that are so noisy and/or missing for so many observations that you may not want to use them in the final analysis. Just as pre-analysis plans have a role in specifying in advance what variables you will use to test which hypotheses, perhaps we also want to specify some rules in advance for when we won’t use the data we’ve collected. This post is a first attempt at doing so.
What do I mean by the questions “didn’t work”?
By didn’t work I mean that the question likely has an incredibly high level of noise relative to any signal contained, or that it was just not answered by enough people to provide credible evidence, or that in practice it didn’t end up measuring the concept that you intended it to measure. Let me give a couple of examples from my own recent impact evaluations as illustrations:
- High non-response: In a recent matching grant evaluation in Yemen, we had to quickly field an endline survey of firms by phone before civil conflict broke out. Despite the survey firm expressing reservations that Yemeni firms would be willing to disclose financial records over the phone in this environment, we considered sales important enough to try and measure as the last question in the survey. Unfortunately 51 percent of firms claimed they did not know, or refused to answer. (All but one firm was willing to answer a follow-up question which asks which of 13 ranges their sales fell into).
- High noise: in my Nigeria business plan competition evaluation, and an ongoing business training evaluation in Kenya, one of the approaches we tried to elicit firm sales and profits was to ask firms i) what the main product or service they sold is; ii) what was the price per unit; iii) what was their cost of making this per unit; and iv) how many units they had sold in the last week or last month. The aim was to then use this to measure sales and mark-up profits on the main item. However, in both countries the feedback that came back from the field was that these were some of the hardest questions for firm owners to answer. The resulting coefficients of variation for mark-up profits and sales of the main product in Kenya (even after truncating at the 1st and 99th percentiles) were around 2.05, compared to 1.26 and 1.43 for my usual questions just asking profits and sales directly. This makes a huge difference to power: power for detecting a 20% increase in profits with treatment and control groups of 1000 each is 94% with a CV of 1.26, vs 58% with a CV of 2.05. The mark-up profits are also less correlated with baseline values of education, capital stock, number of workers, business practices, etc. than are directly asked profits (R2 of 0.03 vs 0.10).
- Not measuring what we thought it measured: in work on female employment in Jordan, we wanted to measure “empowerment” as one outcome of being employed. We used some relatively standard questions, including a set of questions on whether women could travel alone to a set of places like a friend’s house and the market. Here my concern is that in practice it seemed we were just picking up how much non-work time people had with these questions – with more time on their hands, people went to more places.
So how might we pre-specify when not to use a variable?
Sometimes this issue arises when you are trying several questions to measure a difficult to measure item (e.g. empowerment, firm profits, etc.) and are not sure which will work best in a large sample setting. Then you will have multiple measures of the outcome of interest. One approach is then to attempt to aggregate them to extract the common underlying signal. But another is to just throw out the variables that seem least informative. This could involve deciding rules like:
- Do not use a variable if you have another variable also measuring the same outcome that i) has a coefficient of variation that is at least 25% lower; ii) also has a stronger correlation with key baseline characteristics you would expect to be correlated with the outcome; and iii) you do not have at least 80 percent power to detect your treatment effect of interest given the realized coefficient of variation.
- Do not use a variable if you have more than 20% item non-response for it, and you have another variable which has higher response rates and also satisfies the first bullet point. This should be defined in terms of non-responses where the respondent should have answered (so not count firms which have shut down as non-responses for sales and profits, or people not working as non-responses due to missing labor income, etc.).
As I note, this post is designed to outline my current and evolving thinking on this. It would also be great to hear whether others have experience or examples they would like to share on this point, or thoughts on the above criteria?