Suppose the government launches a new program to try to help firms export. One reason we do impact evaluations and conduct randomized experiments is because we are not sure whether or not such a program will have its intended impact. However, when sample sizes are relatively small, or outcomes are somewhat noisy, we might end up with low statistical power, with our standard hypothesis tests unable to reject the null of no impact even if the point estimates are positive and of a magnitude that would pass a cost-effectiveness test.
But being “not sure” does not mean that governments, policymakers, and the participating firms themselves go into these programs completely agnostic about the likely effects. It would be unusual that after carefully designing such a program, we think it equally likely that the program will have massive negative effects and stop all firms exporting, have very modest effects, or end up creating a whole batch of export superstars. However, this is effectively how we proceed in standard frequentist estimation – as if we had a completely diffuse prior. But in practice, past experiences from similar projects, knowledge of local context and constraints, and economic theory are likely to inform beliefs about what the likely impact of such a program will be.
In a new working paper with Rachael Meager and Leonardo Iacovone, we show how Bayesian analysis can be used to formally incorporate informative priors into impact evaluation. We do this in the context of a multi-million dollar experiment testing a program designed to increase firm exports among 200 Colombian firms through improving their management practices. We summarize the steps and issues involved in applying Bayesian impact evaluation in a real-world setting, give examples of how this can help sharpen our understanding of the program’s impacts, and conclude by offering suggestions of settings where we think this approach will be most and less useful. A separate 2-page policy note summarizes the results of the experiment and implications for policy.
Steps involved in using Informative Priors for Bayesian Impact Evaluation
Step 1 - Elicit Informative Priors: The goal here is to bring in useful outside information, that is external to the data being collected for standard comparisons of outcomes for treatment and control that are used for frequentist impact evaluation. Like a good newspaper story, this involves answering several questions:
What priors to elicit? Here you need to be very clear what parameter(s) you will be estimating, and what type of model will be used. In a linear regression framework, this means deciding on whether you want to obtain priors on the ITT, TOT, or some other parameter. We chose, and recommend, ITT, since then priors can be combined across respondents – whereas if there is treatment heterogeneity, if people have different views about the proportion and characteristics of compliers, they can end up giving priors on a different LATE than someone else, or than applies in the data. See our paper for discussion of more complexities involved in non-linear models.
When should priors be elicited? We think they are best elicited when the design of the program is clear, but before any follow-up data are collected – corresponding nicely with the pre-analysis plan and registration process.
Whose priors do we want? One approach would be to use the existing literature, either from a meta-analysis or related literature. But when trying something new, there may be limited existing literature. We think it useful to get priors of the policymakers who will be making decisions about the program (we collected priors from 7 high-level Colombian policymakers), and from academics who can bring in the existing literature and economic theories. We also collected priors from firms participating in the program.
How should priors be elicited? For Bayesian analysis we need a full prior distribution, not just a prediction of the mean effect size. We use a beans and bins approach to elicit full subjective distributions (e.g. respondents have a grid with different bins for the impact on the likelihood a firm exports, and then if they put 2 out of 20 beans or stones in the bin [10,15) it means they think there is a 10% percent chance the ITT is between 10 and 15 percentage points).
Step 2 – Aggregate Priors and Fit a Distribution: Collecting priors from multiple respondents and aggregating them allows a wisdom of crowds effect and makes the results less noisy and less sensitive to the priors of any single individual. We aggregate the priors by type (academic, policymaker, firm) and then fit skewed normals and finite mixtures of Gaussians to obtain probability distributions we can use in analysis.
Step 3 – Estimate the treatment effects and obtain the posterior distribution: We use an OLS framework, so that the only difference between our Bayesian and frequentist estimation comes from adding informative priors, not from modelling the data generating process in a different way. This involves using a Gaussian model with the informative priors elicited above for the treatment coefficient. A practical complication occurs when we have many other regressors in the model, such as from including lots of randomization strata fixed effects. We place a hierarchical model on these coefficients to regularize them and shrink them towards the group mean. Then once we have the likelihood model for the data, and our priors, Markov Chain Monte Carlo (MCMC) methods are used via the package Stan in R to compute the posterior distribution.
Step 4 – Also use the fitted posteriors for decision analysis: once we have the posteriors, we can calculate the answers to questions like “what is the probability that the treatment effect is large enough to make the program cost-effective?”.
How do the Bayesian estimates compare to standard frequentist estimation?
We estimate the impacts of the Colombian program on different pre-specified export outcomes. Figure 1 looks at the extensive margin, of whether firms export at all. 54 percent of the control group exported in 2019. The red interval at the bottom shows what we would obtain from our usual frequentist impact evaluation: the estimated ITT is -0.5 percentage points (p.p.), with a 95 percent confidence interval of [-9 p.p., +8 p.p.]. We show intervals that cover the 2.5th to 97.5th percentiles of the different prior distributions in light blue – showing that policymakers, firms, and academics all expected the program to increase the likelihood of exporting. The posterior intervals then look very similar to the red frequentist interval, showing almost complete updating – the signal in the data is strong enough relative to these priors that we almost completely update these priors towards the data.
Figure 1: Impact on the Extensive Margin of Whether Firms Export at all
Figure 2 examines a measure of how much variety there is in exporting, measured by the number of distinct product-country combinations that firms are exporting. This is the case where the priors were that the program would not have much effect. This is also what the data shows, with our frequentist confidence interval being centered at zero, and with a width of 3.8 product-countries. When our informative priors are consistent with the data, the resulting posterior intervals are narrower than we would get using the information alone – firms and policymakers did not expect much of an impact in variety, the data also show this, so they can be even more confident that there was not much increase in variety. This highlights a great potential use case for Bayesian IE – when results coincide with what was expected, we can improve precision.
Figure 2: Impact on Export Variety in 2019
Finally, Figure 3 looks at the impact on the value of exports. This outcome is highly skewed, with a mass at zero and very long tail: in 2019, 94 of the 200 firms had zero exports, median exports was $5,029, mean exports $396,000, the standard deviation is $1.2 million, and the 99th percentile $6.7 million. Even when taking the inverse hyperbolic sine, or logs, statistical power is low for this outcome. Our point estimate is negative, but with a very wide confidence interval. Here we see the posterior distributions look much more similar to the priors than the data. The data are noisy and not very informative, so whatever you thought was the likely effect going in, this study should not update your beliefs very much.
Figure 3: Impact on Export Value
When is Bayesian IE likely to be most useful?
These three cases above illustrate the different roles informative priors can play in analysis, and how this can differ depending on how precisely the data enable you to estimate the outcome. For outcomes like whether firms export at all, the data are very informative, and having the priors can help communicate to a policy audience that despite the program not working as they expected, they should strongly update their beliefs given the data. A second use case is one in which the informative priors coincide with the data, allowing more precision – and so in small samples one can now be more confident in the results than if just using the data alone. And finally, the results also make clear when an impact evaluation is not that informative for policy, given the noisiness of some outcomes.
Just as with pre-analysis plans, we see the use of this Bayesian approach with informative priors as being particularly useful for expensive and time-consuming experiments: if you are running a lab or online experiment and can easily redo the experiment or add a lot more sample, the gains are likely to be lower. We see these methods as especially useful for experiments with relatively small sample sizes, which is common when working with SMEs, health clinics, village-level programs, or schools in many settings. Finally, it is also likely to be useful when extended to looking at treatment effect heterogeneity, since power tends to be much lower for heterogeneity than main effects.
Join the Conversation
This is so great to see! Thank you for writing this clear, informative post on applying Bayesian inference to improve reporting of policy experiments. The only part I'd take issue with is the example of when not to use Bayesian methods for impact evaluation, "lab or online experiment." We found Bayesian hierarchical modeling perfect for an online experiment precisely because it had so many treatment arms (72 in our case!), something you can't often do with field trials.
Here is the paper: https://www.tandfonline.com/doi/abs/10.1080/19345747.2020.1716905
And a backgrounder on the power analysis, how we could use pooling to test so many treatment factors: https://journals.sagepub.com/doi/abs/10.1177/0193841X18818903
We varied 5 factors each with 2 or 3 levels, so 3x3x2x2x2 = 72, but they followed a hierarchical structure.
Intervention we tested was variations on web design to test how low-income parents use info to choose schools for their kids.