How big data helped us estimate the impact of an intervention with 0.8% take-up
This page in:
When asked if he would like to have dinner at a highly-regarded restaurant, Yogi Berra famously replied “Nobody goes there anymore. It’s too crowded”. This contradictory situation of very low take-up combined with large overall use is common with some financial products – for example, the response rate to direct mail credit card solicitations had fallen to 0.6 percent by 2012, yet lots of people have credit cards.
It is also a situation we recently found ourselves in when working on a financial education experiment in Mexico with the bank BBVA Bancomer. They worked with over 100,000 of their credit card clients, inviting the treatment group to attend their financial education program Adelante con tu futuro (Go ahead with your future). Over 1.2 million participants have taken this program between 2008 and 2016, yet only 0.8 percent of the clients in the treatment group attended the workshop. A second experiment which tested personalized financial coaching also had low take-up, with 6.8 percent of the treatment group actually receiving coaching.
In a new working paper (joint with Gabriel Lara Ibarra), we discuss how the richness of financial data on clients allows us to combine experimental and non-experimental methods to still estimate the impact of this program for those clients who do take up the program.
What would pure experimental analysis tell us?
73,654 clients were assigned to the treatment group for the workshop. Out of this, contact was attempted with only 47 percent, and only 12% (8,900) were able to be contacted, of which 2,672 agreed to participate, and only 583 attended and completed the workshop (0.8%).
With such a low take-up, not surprisingly the treatment and control groups averages follow one another closely after the intervention, just as they had done before the intervention (Figure 1). The ITT estimates are then extremely close to zero. The LATE estimates are statistically insignificant, but with wide confidence intervals. For example, the LATE impact for the outcome of having a delay in payment is [-20%, +30%].
Figure 1: Treatment vs Control Time Paths
We are of course being killed here by low take-up – the inverse square rule means that we need 15,625 times the sample with 0.8% take-up as we would to have the same power as 100% take-up. Even with our large sample sizes, power is then very low.
This is where standard analysis using experimental methods would stop. We would conclude that there is no significant impact of either intervention, but that we have insufficient power to rule out a wide range of positive and negative impacts. We therefore turn to combining non-experimental methods with the experiment to obtain more informative results.
Combining non-experimental and experimental methods with lots of data
We have a large sample of over 130,000 clients, and have monthly data on their credit card outcomes for up to 18 months before the intervention. We use this large dataset (of 660 MB) to combine propensity score matching and difference-in-differences with the experiment in order to estimate the treatment effect for those who actually take-up training. In doing so, we can overcome some of the common concerns one usually has with these non-experimental approaches:
- With matching we are usually concerned with at least two types of selection on unobservables. The first is dynamic selection – for example, people might be more willing to engage in financial education if they suddenly find themselves struggling with their credit card, compared to individuals who look similar in the cross-section but who have been struggling for a while. We can match on the entire trajectory of payment behavior over 18 months to alleviate this concern. Secondly, the concern is if these groups are so similar, why did only one group take the intervention? By matching only to individuals in the control group, we have an answer – they randomly weren’t invited.
- With difference-in-differences we are concerned about the assumption of common trends. This is more credible if the two groups are more similar to begin with (which is where matching helps), and if we see they have the same dynamics pre-intervention. With survey data, researchers are typically lucky if they have two periods of post-intervention data to test for a common linear trend over that short period. Here we can use all 18 months of pre-intervention data to show they follow the same non-linear trends pre-intervention.
We explore five different approaches to obtaining a non-experimental counterfactual, varying which variables we match on and whether we restrict ourselves to all matches in the common support or to just using the nearest neighbor matches. This allows us to show robustness of our results to the choice of counterfactual.
Figure 2 then shows the results for the workshops (the coaching intervention had similar effects). The results of our preferred specification suggest that participating in the workshop results in an 11 percentage point increase in the likelihood of paying more than the minimum payment, a 3.4 percentage point reduction in the likelihood of delaying payment, 63.7 percent higher monthly spending on the credit card, and a 2.7 percentage point increase in the likelihood of owning a deposit account with our partner bank.
Figure 2: Trajectories of financial outcomes of those receiving workshops compared to nearest neighbor matched control group
The training and coaching get clients to pay their accounts on time and pay more of their bills, but do not get them to cut back on spending. In fact, perhaps because they are not experiencing as many payment problems, they spend more on their cards. The result is that this training does increase the likelihood these clients remain profitable for the bank.
A by no means unique situation
While we always would prefer to have much higher take-up for our interventions and to just rely on pure experimental comparison of treatment versus control groups, there are many “low touch” interventions carried out by companies that have really low response rates, but can reach large numbers of clients – e.g. the click rate for email marketing campaigns by the banking/finance industry in the U.K. was 0.39%. This combination of rich data on customers with the initial experiment may provide a way to salvage a credible measure of the treatment impact in these situations. So even if you made too many wrong mistakes, remember your evaluation ain’t over till it’s over, and with today’s big data, you can observe a lot by just watching.
Terrific - nice job Yogi! This is a great paper and shows the value of multiple complementary approaches.
The pure experimental results are still the correct way of estimating the effect of treatment where "treatment" includes the initial outreach etc. This is often the variable of interest for policy-makers with a fixed budget, or for the financial institutions involved. Here the low take-up rate is not killing the results; it is the result. Nothing needs to be salvaged in that case.
Naturally the LATE is also often of interest. You write that "participating in the workshop results in an 11pp increase..." which is true (and again wonderful / impressive that you are able to estimate this so cleanly!) but has an implicit preceding modifier: "For the relatively small number of clients who want to hear from their bank and who want to participate in a workshop, participating in the workshop results in an 11pp increase...".
Of course this is no different for you than for any other LATE analysis, but it's important (as you know) to be clear about the counterfactual when interpreting the results. Here the low take-up rate not only reduces power (as would additional random noise), but also narrows the subgroup to whom the result applies.
Thanks Julian, and I totally agree on being clear who the treatment effects apply to. In the paper we write: "It is important to note that these treatment effects are for the set of clients who will take up the interventions when invited. We have seen there is positive selection into participation, so that individuals who have the worst initial financial behavior in terms of late payments, not paying more than the minimum required, etc., are less likely to participate. The treatment effect may be larger for these individuals if they could be induced to participate, since they have more room for improvement, or potentially smaller if such individuals are less likely to implement the changes suggested in the workshops. "
Hi David and Claudia, thanks so much for this post- it is incredibly timely. We at IDinsight have been experimenting with these techniques to extract LATE estimates in conditions with low take-up, and although I knew others must have been doing similar things this is the first paper I've seen. Just last week I finished up a matching algorithm that started as the equivalent of your "nearest neighbor LASSO" approach, though then deviated for reasons that aren't interesting to get into here.
I'd be really interested to see (or write perhaps?) some papers that stress-test this approach. For instance, would it give an estimate similar to IV in conditions where IV is appropriate (and confidence intervals are tighter)? Should we be using LASSO or one of the myriad of other machine-learning-for-prediction techniques? How do we know if the data is rich enough to give reliable estimates? Can we figure all this out with simulation? Etc.
If anyone knows any papers dealing with these topics, I'd love to see them.