An addendum to pre-analysis plans: Pre-specifying when you won’t use data collected


This page in:

Researchers put a lot of effort into developing survey questionnaires designed to measure key outcomes of interest for their impact evaluations. But every now and then, despite efforts piloting and fine-tuning surveys, some of the questions end up “not working”.  The result is data that are so noisy and/or missing for so many observations that you may not want to use them in the final analysis. Just as pre-analysis plans have a role in specifying in advance what variables you will use to test which hypotheses, perhaps we also want to specify some rules in advance for when we won’t use the data we’ve collected. This post is a first attempt at doing so.

What do I mean by the questions “didn’t work”?
By didn’t work I mean that the question likely has an incredibly high level of noise relative to any signal contained, or that it was just not answered by enough people to provide credible evidence, or that in practice it didn’t end up measuring the concept that you intended it to measure. Let me give a couple of examples from my own recent impact evaluations as illustrations:

  1. High non-response: In a recent matching grant evaluation in Yemen, we had to quickly field an endline survey of firms by phone before civil conflict broke out. Despite the survey firm expressing reservations that Yemeni firms would be willing to disclose financial records over the phone in this environment, we considered sales important enough to try and measure as the last question in the survey. Unfortunately 51 percent of firms claimed they did not know, or refused to answer. (All but one firm was willing to answer a follow-up question which asks which of 13 ranges their sales fell into).
  2. High noise: in my Nigeria business plan competition evaluation, and an ongoing business training evaluation in Kenya, one of the approaches we tried to elicit firm sales and profits was to ask firms i) what the main product or service they sold is; ii) what was the price per unit; iii) what was their cost of making this per unit; and iv) how many units they had sold in the last week or last month. The aim was to then use this to measure sales and mark-up profits on the main item. However, in both countries the feedback that came back from the field was that these were some of the hardest questions for firm owners to answer. The resulting coefficients of variation for mark-up profits and sales of the main product in Kenya (even after truncating at the 1st and 99th percentiles) were around 2.05, compared to 1.26 and 1.43 for my usual questions just asking profits and sales directly. This makes a huge difference to power: power for detecting a 20% increase in profits with treatment and control groups of 1000 each is 94% with a CV of 1.26, vs 58% with a CV of 2.05. The mark-up profits are also less correlated with baseline values of education, capital stock, number of workers, business practices, etc. than are directly asked profits (R2 of 0.03 vs 0.10).
  3. Not measuring what we thought it measured: in work on female employment in Jordan, we wanted to measure “empowerment” as one outcome of being employed. We used some relatively standard questions, including a set of questions on whether women could travel alone to a set of places like a friend’s house and the market. Here my concern is that in practice it seemed we were just picking up how much non-work time people had with these questions – with more time on their hands, people went to more places.
You can also imagine other types of “didn’t work” such as questions which were misinterpreted or mistranslated, questions like test questions where it seemed there was cheating involved, etc. It is also not uncommon to get feedback come in from the enumerators and field team saying that a particular question just didn’t seem to be understood by people, or seemed to be answered in a way we didn’t anticipate.

So how might we pre-specify when not to use a variable?
Sometimes this issue arises when you are trying several questions to measure a difficult to measure item (e.g. empowerment, firm profits, etc.) and are not sure which will work best in a large sample setting. Then you will have multiple measures of the outcome of interest. One approach is then to attempt to aggregate them to extract the common underlying signal. But another is to just throw out the variables that seem least informative. This could involve deciding rules like:
  • Do not use a variable if you have another variable also measuring the same outcome that i) has a coefficient of variation that is at least 25% lower; ii) also has a stronger correlation with key baseline characteristics you would expect to be correlated with the outcome; and iii) you do not have at least 80 percent power to detect your treatment effect of interest given the realized coefficient of variation.
  • Do not use a variable if you have more than 20% item non-response for it, and you have another variable which has higher response rates and also satisfies the first bullet point. This should be defined in terms of non-responses where the respondent should have answered (so not count firms which have shut down as non-responses for sales and profits, or people not working as non-responses due to missing labor income, etc.).
The more difficult question is what to do when this is the only measure you have of a variable of interest, and therefore you are deciding whether to report impacts on this outcome or not. The idea here is to avoid a situation where you either find a significant effect and so leave it in, or find an insignificant effect and then say “this is probably because the variable wasn’t measured very well”. It would be better to decide ex ante under what circumstances you deem the variable so tainted that it is better to not report the outcome and avoid suspect interpretations, versus to report it with caveats. Here I can imagine versions of the above two bullets, but perhaps with higher thresholds for when you shouldn’t use.

As I note, this post is designed to outline my current and evolving thinking on this. It would also be great to hear whether others have experience or examples they would like to share on this point, or thoughts on the above criteria?


David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

May 02, 2016

hi David -- these are really hard issues. My first instinct would be to worry about what biases might accompany any rules like this. For example it seems obvious that there is no point using a variable that has no variation. But of course a variable with no variation is one for which there was no treatment effect (all the patients are still sick). So discarding such variables could introduce a kind of fishing. Similarly a noisy variable might only be noisy in one arm because of an absence of an effect.
So some other quick reactions:
* Whatever rule you use, investigate the kind of effects it might have for bias or MSE. eg use monte carlo simulation (soon should be able to help with this I hope!) to get a handle on the conditions under which the rule introduces biases
* Such exercises might reveal the need for some form of correction
* Use criteria that do not sneak in information on treatment effects -- eg look at variation in the control group only.
* Be cautious about applying the rule if the criterion performs differently in treatment and control groups -- eg if there is differential non response (of course failure to find a difference between groups is not a guarantee that there are no differences)
* Rather than dropping why not report everything but provide flags (eg indicate which analyses are underpowered)

David McKenzie
May 02, 2016

Thanks for these thoughts. I totally agree in terms of wanting to use control group information only to avoid treatment effect issues - although I think it is worth also thinking more carefully about what to do if the treatment group decides not to respond to one set of measures (perhaps because they are fatigued from your intervention) while the control group does - this is clearly a treatment effect, but may then be completely uninformative for telling you about a particular outcome.
I'm not sure "report everything but provide flags" can completely deal with this issue, since i) we may not want to include in multiple testing and indices measures which are not useful and just serve to make it harder for us to detect impacts on our other measures after correcting for multiple testing or averaging in some noise; ii) I do worry about appendix arms races, where the appendices get to be several times longer than (30-40 page) papers, and see part of the role of the pre-analysis plan as being to prioritize what you will look at and report.
But totally agree these are tough questions and I'm not sure what I suggest here is the right approach either. So more thoughts/comments/critiques very welcome.