Guest Post by Eva Vivalt
Impact evaluations have exploded in development. But how much do we learn from them? At the most basic level, this depends on how well the researchers report their results. If a paper’s findings are not clear, much of the value of having done an impact evaluation is lost. Apart from wasting the funds allocated to impact evaluation (an average of approximately $500,000 per impact evaluation at the World Bank[1]), it’s a wasted opportunity.
In this post, I describe some facts based on a database of impact evaluation results collected by AidGrade, a U.S. non-profit that I founded. AidGrade focuses on gathering the results of impact evaluations and analyzing the data in different ways, including using meta-analysis. The most comprehensive data they have on impact evaluation results comes from the data they collected in the course of their meta-analyses, of which they have currently completed ten, covering such areas as microfinance or school feeding programs, with another batch on the way. They use very broad search criteria, so that a paper is counted as an impact evaluation so long as it tries to identify a counterfactual. Both working papers and published articles or books are included. Characteristics of each paper are coded up so that if one later wants to restrict attention to only those papers that used a particular method (e.g. a randomized controlled trial) or focus on a particular geographical area, among other characteristics, one can do that.
The subset of the data on which I am focusing includes those papers that passed all screening stages in the meta-analyses. Again, the search and screening criteria were very broad, and after passing the full text screening that determined that a paper was in fact an impact evaluation on the subject, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common (others did not report comparable data for a variety of quirky reasons, such as displaying results only graphically). A more detailed discussion of the methodology and codebook is provided here. The small overlap of outcome variables is a very surprising feature of the data and warrants discussion in a follow-up post. Ultimately, the data I draw upon today consist of 704 results (double-coded and then reconciled by a third researcher) across 98 papers covering the ten types of interventions; a larger dataset is expected to be released in the fall.
Some stylized facts immediately stand out:
1) Only about 45% of papers clearly report whether their results represent the intent-to-treat estimates or the treatment effect on the treated.
Intent-to-treat (ITT) estimates look at the effect on everyone assigned to receive treatment, regardless of whether or not they actually took advantage of the program. The alternative is to estimate the treatment effect on the treated (TOT). For example, suppose that only 10% of people who were offered a bed net used it, and suppose for the sake of argument bed nets were 90% effective at preventing malaria. The TOT estimate would be 90% - the ITT estimate, 9%. Clearly, if the authors don’t take care to explain which they are reporting, we really don’t know how to interpret the results!
The AidGrade coding scheme records whether the results were explicitly reported as ITT or TOT but also contains codes for the case in which the author(s) did not note this explicitly but it can be inferred from the rest of the paper. For example, if authors wrote that an intervention had a certain effect on the treatment group and it does not break this down further or discuss those actually administered the treatment, this would be considered a result that could be inferred to be an ITT estimate even though ITT is not explicitly mentioned.
45% of studies explicitly mentioned whether they were reporting ITT or TOT estimates. However, if we include the implied cases, closer to 90% of papers report this. While a great improvement, this means that for 10% of papers it is really unclear what they are reporting. Notably, this variable was the hardest to code, judging by the disagreement among coders, and we can presume that laypeople would have even greater difficulty interpreting the results; for them, the 45% statistic might be the more relevant one. This issue affected both working papers and published papers; published papers actually fared slightly worse, although the sample size of working papers was small so this isn’t very meaningful.
2) Only 75% of papers discuss attrition.
The definition of “discussing attrition” here is very broad. AidGrade used a very liberal coding rule that indicated whether the papers made any mention of attrition whatsoever. The 75% that discussed attrition didn’t necessarily take any measures to adjust their results accordingly, they simply mentioned whether anyone dropped out over the course of the study. Again, as a reader, if we don’t know about the possible self-selection bias embedded in a paper’s results, we really do not know what the results mean.
3) Details about the intervention and its context are frequently left unstated.
For example, the timing of the intervention vis-à-vis the follow-up data collection is often unclear. Authors might write that a program started in 2010 and data collection was completed in 2012. Missing from this description: when the intervention ended, when data collection began, and more finely grained information such as the relevant month. The relative timing of the intervention and any follow-up data collection is more important for time-sensitive interventions, but a lot of papers could do better here.
Costs are also largely invisible. Few papers discuss the costs and benefits of the programs they evaluated, apart from an occasional back of the envelope calculation at the end of a paper. When different kinds of interventions targeting the same outcome can vary in costs a thousand fold, costs become an important component of determining whether the topic is worth discussing in the first place.
There is also a more general disregard for context. Knowing about the context of the program might be seen as a bit of a luxury, since it doesn’t affect the internal validity of the study and it is assumed that papers will have no external validity. Still, the norms of development economics require authors to mention what country an impact evaluation is in. How about other equally important information on how the program was implemented?
Apart from these general points, miscellaneous errors abound. Similar to Tolstoy’s statement that happy families are all alike but each unhappy family is unhappy in its own way, well-reported impact evaluations are all alike, but every poorly-reported impact evaluation is poorly reported in its own way. Sometimes you will see that an impact evaluation reports test scores, for example, without giving any clue as to the scale of the test or whether the scores are normalized. Or you might see a paper that provides the results in chart form – a visual which can be quite helpful at conveying results – without providing the actual numbers anywhere.
While researchers do a good job overall of presenting their findings, there is clearly still room for improvement. Authors, knowing their own papers very well, may sometimes neglect to mention key details that could help others interpret their results. When so many resources have been put into an impact evaluation, and when results could potentially feed back into policy, it’s important to pay more attention to reporting.
Eva Vivalt is a Young Professional in the Development Research Group at the World Bank.
Impact evaluations have exploded in development. But how much do we learn from them? At the most basic level, this depends on how well the researchers report their results. If a paper’s findings are not clear, much of the value of having done an impact evaluation is lost. Apart from wasting the funds allocated to impact evaluation (an average of approximately $500,000 per impact evaluation at the World Bank[1]), it’s a wasted opportunity.
In this post, I describe some facts based on a database of impact evaluation results collected by AidGrade, a U.S. non-profit that I founded. AidGrade focuses on gathering the results of impact evaluations and analyzing the data in different ways, including using meta-analysis. The most comprehensive data they have on impact evaluation results comes from the data they collected in the course of their meta-analyses, of which they have currently completed ten, covering such areas as microfinance or school feeding programs, with another batch on the way. They use very broad search criteria, so that a paper is counted as an impact evaluation so long as it tries to identify a counterfactual. Both working papers and published articles or books are included. Characteristics of each paper are coded up so that if one later wants to restrict attention to only those papers that used a particular method (e.g. a randomized controlled trial) or focus on a particular geographical area, among other characteristics, one can do that.
The subset of the data on which I am focusing includes those papers that passed all screening stages in the meta-analyses. Again, the search and screening criteria were very broad, and after passing the full text screening that determined that a paper was in fact an impact evaluation on the subject, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common (others did not report comparable data for a variety of quirky reasons, such as displaying results only graphically). A more detailed discussion of the methodology and codebook is provided here. The small overlap of outcome variables is a very surprising feature of the data and warrants discussion in a follow-up post. Ultimately, the data I draw upon today consist of 704 results (double-coded and then reconciled by a third researcher) across 98 papers covering the ten types of interventions; a larger dataset is expected to be released in the fall.
Some stylized facts immediately stand out:
1) Only about 45% of papers clearly report whether their results represent the intent-to-treat estimates or the treatment effect on the treated.
Intent-to-treat (ITT) estimates look at the effect on everyone assigned to receive treatment, regardless of whether or not they actually took advantage of the program. The alternative is to estimate the treatment effect on the treated (TOT). For example, suppose that only 10% of people who were offered a bed net used it, and suppose for the sake of argument bed nets were 90% effective at preventing malaria. The TOT estimate would be 90% - the ITT estimate, 9%. Clearly, if the authors don’t take care to explain which they are reporting, we really don’t know how to interpret the results!
The AidGrade coding scheme records whether the results were explicitly reported as ITT or TOT but also contains codes for the case in which the author(s) did not note this explicitly but it can be inferred from the rest of the paper. For example, if authors wrote that an intervention had a certain effect on the treatment group and it does not break this down further or discuss those actually administered the treatment, this would be considered a result that could be inferred to be an ITT estimate even though ITT is not explicitly mentioned.
45% of studies explicitly mentioned whether they were reporting ITT or TOT estimates. However, if we include the implied cases, closer to 90% of papers report this. While a great improvement, this means that for 10% of papers it is really unclear what they are reporting. Notably, this variable was the hardest to code, judging by the disagreement among coders, and we can presume that laypeople would have even greater difficulty interpreting the results; for them, the 45% statistic might be the more relevant one. This issue affected both working papers and published papers; published papers actually fared slightly worse, although the sample size of working papers was small so this isn’t very meaningful.
2) Only 75% of papers discuss attrition.
The definition of “discussing attrition” here is very broad. AidGrade used a very liberal coding rule that indicated whether the papers made any mention of attrition whatsoever. The 75% that discussed attrition didn’t necessarily take any measures to adjust their results accordingly, they simply mentioned whether anyone dropped out over the course of the study. Again, as a reader, if we don’t know about the possible self-selection bias embedded in a paper’s results, we really do not know what the results mean.
3) Details about the intervention and its context are frequently left unstated.
For example, the timing of the intervention vis-à-vis the follow-up data collection is often unclear. Authors might write that a program started in 2010 and data collection was completed in 2012. Missing from this description: when the intervention ended, when data collection began, and more finely grained information such as the relevant month. The relative timing of the intervention and any follow-up data collection is more important for time-sensitive interventions, but a lot of papers could do better here.
Costs are also largely invisible. Few papers discuss the costs and benefits of the programs they evaluated, apart from an occasional back of the envelope calculation at the end of a paper. When different kinds of interventions targeting the same outcome can vary in costs a thousand fold, costs become an important component of determining whether the topic is worth discussing in the first place.
There is also a more general disregard for context. Knowing about the context of the program might be seen as a bit of a luxury, since it doesn’t affect the internal validity of the study and it is assumed that papers will have no external validity. Still, the norms of development economics require authors to mention what country an impact evaluation is in. How about other equally important information on how the program was implemented?
Apart from these general points, miscellaneous errors abound. Similar to Tolstoy’s statement that happy families are all alike but each unhappy family is unhappy in its own way, well-reported impact evaluations are all alike, but every poorly-reported impact evaluation is poorly reported in its own way. Sometimes you will see that an impact evaluation reports test scores, for example, without giving any clue as to the scale of the test or whether the scores are normalized. Or you might see a paper that provides the results in chart form – a visual which can be quite helpful at conveying results – without providing the actual numbers anywhere.
While researchers do a good job overall of presenting their findings, there is clearly still room for improvement. Authors, knowing their own papers very well, may sometimes neglect to mention key details that could help others interpret their results. When so many resources have been put into an impact evaluation, and when results could potentially feed back into policy, it’s important to pay more attention to reporting.
Eva Vivalt is a Young Professional in the Development Research Group at the World Bank.
[1] Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance and Effectiveness”.
Join the Conversation