# Design sandbox: Power calculations and optimal design for cost effectiveness (Part 1: The case of cash benchmarking)

## This page in:

## Intuition

When measuring the value of development interventions, it is useful not only to account for the impact of the interventions, but also to test whether those impacts are being achieved “cost effectively” — that is, whether an alternative intervention could achieve the same outcomes at lower cost. Unconditional cash transfers are growing in popularity as a cost effectiveness “benchmark” for other interventions. This is unsurprising given the flexibility of both scale and targeting for unconditional cash transfers, and the evidence showing unconditional cash transfers generate positive impacts on a broad range of development outcomes for households.

In a forthcoming paper, we analyze power in cash benchmarking experiments. While the focus of the paper is a metaänalysis of unconditional cash transfer size and persistence, in the process we realized statistical power when comparing cash transfers of different sizes (or comparing interventions with different associated program costs more generally) is surprisingly limited. One obstacle to achieving powered studies is that designing a RCT to maximize our ability to test for differential cost effectiveness across arm is not immediately intuitive — our back-of-the-envelope calculations quickly proved insufficient here! This motivated us to provide power calculations and optimal design guidelines for cash benchmarking experiments, and a companion dashboard (with code here) implementing these calculations. Since the associated paper is still under construction, feedback from DI readers is thoroughly appreciated!

In contrast to a traditional impact evaluation with a treatment group that receives the intervention and a control group, a cash benchmarking experiment will also include a cash transfer arm, so that the impacts of the intervention can be compared to the impacts of the cash transfer. This complicates power calculations for a number of reasons. First, researchers have additional comparisons that they are interested in compared to a traditional impact evaluation — program impacts, cost adjusted program impacts relative to cash transfer impacts, and cash transfer impacts themselves may all be of interest. As additional comparisons become of interest, this stretches the statistical power availed by the experimental design. Second, the additional experimental arm gives researchers additional degrees of freedom for shaping statistical power — researchers may choose not only the number of observations in the control group, the treatment group, and the cash transfer group, but often also the size (or sizes) of cash transfers.

## Power calculations and optimal design for evaluating cost effectiveness

To provide power calculations for cash benchmarking experiments, we begin by formalizing the researcher's program. The researcher designs a cash benchmarking experiment to compare the the impacts of the intervention of interest to the impacts of cost equivalent cash transfers. Specifically, they consider the following general model of the outcome \( Y_{i} \) of individual \( i \):

\[ Y_{i} = \alpha + f(\text{Cost}_{i}) + \beta \text{Program}_{i} + \epsilon_{i} \]

In this model, the flexible function \( f \) governs outcomes as a function of the cost of the cash benchmark — for an outcome improved by cash transfers, \( f \) would be increasing in the size (and therefore the cost) of the cash transfers delivered. The coefficient \( \beta \) then provides the excess impact of the intervention of interest relative to a cost equivalent cash transfer. The researcher then designs an experiment that randomizes individuals across the intervention of interest, a cash transfer (or multiple cash transfers), and a control group, in order to estimate \( f \) (the effect of cash transfers relative to the control group) and \( \beta \) (the effect of the intervention relative to cash transfers).

A few additional details are necessary to formalize the researcher's problem:

- First, the researcher chooses how much weight to put on the precision of different estimates (for example, the impact of the intervention relative to the control group vs. the impact of the intervention relative to a cost equivalent cash transfer).
- Second, the researcher assumes a particular model of \( f \) — with a linear model of the impact of cash transfers, for example, only one cash transfer arm would be necessary to estimate \( f \), while additional cash transfer arms would be necessary to estimate more flexible models.
- Third, given this model, the researcher chooses the number of cash transfer arms, and the size of the cash transfers in each arm.
- Fourth, given these details, the researcher chooses the number of observations to assign to the control group, the intervention group, and the cash transfer group(s), to maximize their preferred weighted average of the precision of the different estimates.

We note that this general formulation nests the optimal design problem for a number of more specific research designs — cash benchmarking experiments with one cash transfer arm, cash benchmarking experiments with multiple cash transfer arms in the presence of cost uncertainty (see discussion here), and estimating differential impacts of relatively larger cash transfers per unit of transfer. To support the design of these experiment, we developed a dashboard that solves for the optimal design in each of these three cases, and provides associated power calculations. In today's post, we focus on applying the dashboard to the design of cash benchmarking experiments with one cash transfer arm, and return to the other cases in a second post.

## Cash benchmarking dashboard

**Researcher's problem**

To start, the dashboard (presented in the figure above) allows the user to select the key parameters of the researchers problem described above. For the cash benchmarking experiment with one cash transfer arm described above, the dashboard focuses on estimation of the following reduced form:

\[ Y_{i} = \beta_{0} + \beta_{1} \text{UCT}_{i} + \beta_{2} \text{Program}_{i} + \epsilon_{i} \]

We allow the researcher to be interested in the following two estimands in this model. The first is the cost effectiveness of the program at increasing outcome \( Y \) — that is, \( \beta_{2} / \text{Program cost} \). The second is the cost effectiveness of the program at increasing outcome \( Y \) relative to the cash transfer arm — that is, \( \frac{\beta_{2}}{\text{Program cost}} - \frac{\beta_{1}}{\text{UCT cost}} \). The researcher’s objective is minimize a weighted average of the variance of the estimates of these two parameters (cost effectiveness, and relative cost effectiveness). The researcher sets the weight, which governs the extent to which the researcher cares about estimating cost effectiveness precisely vs. estimating relative cost effectiveness precisely.

We then allow the researcher to select the size of the cash transfer, that is, \( \text{UCT cost} \). By comparing the cost effectiveness of the program to the cost effectiveness of the UCT, even when the program and the UCT have different costs, we implicitly assume that the effect of the UCT is linear in its size in order to infer the impacts of a cash transfer that is cost equivalent to our program of interest.

Conditional on the above choices, the dashboard then minimizes the weighted average of the variances of the estimates of cost effectiveness and relative cost effectiveness, by choosing the optimal allocation of observations across the control group, the program group, and the cash transfer group. The dashboard then reports this optimal design, the variance of the two estimates, and associated power calculations.

**Optimal design and minimized variances**

In the figure below, we present an example output of the dashboard, which includes the optimal design (the fraction of observations assigned to control, to cash transfer, and to the program), and the variances of the cost effectiveness of the program and the relative cost effectiveness of the program (both times the square root of the sample size). To facilitate comparisons across alternative parameter choices (which lead to alternative designs), we provide two panels for which the researcher can specify different choices of the key parameters (the size of the cash transfer, and the relative weight placed on minimizing the variance of cost effectiveness).

The example below is for the case where the program and the cash transfer have the same cost, while the researcher places equal weight on the variances of the estimates of cost effectiveness and relative cost effectiveness. In this example, the optimal design assigns more observations to the program arm than either the control arm or the cash transfer arm. This is because the estimate of cost effectiveness only uses the control group and the intervention arm, while the estimate of relative cost effectiveness only uses the cash arm and then intervention arm (because the cash arm and the intervention arm have the same cost by assumption); this means the intervention arm matters for both estimates, while the control group and the cash arm each only matter for one estimate.

However, as discussed above, under our maintained assumptions, even with a larger cash transfer arm, we can infer what the effect of a cash transfer that is cost equivalent to the program would have been. The graph in the example above plots the optimal design as a function of the size of the cash transfer used for benchmarking; other graphs allow the dashboard user to explore the effects of also varying the weight placed on the variances, and also to explore how the variances of the two estimates of interest are impacted. Note that the optimal allocation to the cash transfer arm shrinks as the size of the cash transfer increases — this is because the cost effectiveness of the cash transfer is more precisely estimates as the size of the cash transfer increases. We caution that this is an asymptotic result — with a very small sample, if the “optimal” number of observations assigned to the cash transfer arm is very small (e.g., less than 15), this asymptotic approximation will be meaningfully biased.

**MDE and required sample sizes**

Lastly, in the figure below, we present the power panel of the dashboard, which allows calculation of minimum detectable effects and required sample size, as a function of the selections in the first part of the dashboard. These calculations are standard transformations of the variances presented above; our back-of-the-envelope calculations were again useful here!

We present one possible result below, continuing our example above, where the cash transfer and the program have equal cost, and the researcher puts equal weight on the variances of the estimates of cost effectiveness and relative cost effectiveness. In this case, the minimum detectable effect for cost effectiveness and relative cost effectiveness are equal. However, reasonable values of cost effectiveness may be larger than relative cost effectiveness, if the cash transfer also has an effect on the outcome of interest. As a result, the required sample size to detect relative cost effectiveness may be larger than that for cost effectiveness. In this case, decreasing the relative weight placed on the variance of cost effectiveness in the previous part of the dashboard would result in a design that reduces the required sample size to detect both cost effectiveness and relative cost effectiveness.

## Join the Conversation

Very cool post and a super-useful tool that I expect to be widely used and cited.

You caution that the asymptotic approximations you are using can lead to some unreliable results, such as very few observations assigned to the cash arm if the cash transfer is large. But does it make sense to use asymptotics here at all? This seems like a situation where what we want are the design-based standard errors of Abadie et al. (https://onlinelibrary.wiley.com/doi/full/10.3982/ECTA12675). The relevant uncertainty comes from treatment assignment, not sampling from a larger population.

Hi Jason: Thanks for the great comment, we were being a little loose in our language here. While we agree that in principle design-based SE may be most appropriate for this type of design, we just wanted to caution that we do not put any constraint on the number of observations assigned to any one arm -- so the dashboard can return some pretty extreme values such as one observation in the cash arm when cash transfer size --> infinity ... :)

I agree that it's good to have a caveat about the lack of constraints. Practically speaking, maybe the dashboard should throw up a warning if the number of treated obs gets "too low" (e.g. below 25, using my intuition based on the literature on clustering.)

>such as one observation in the cash arm when cash transfer size --> infinity

That is a case where design-based inference would help a lot. With just a single treated observation, there are real limits on what RI could tell you about the probability of observing your data under the null. I realize this is pretty annoying to implement, though. There is a real need for someone to develop canned code to output the Abadie et al. design-based SEs.