Notes from the AEAs: Present bias 20 years on + Should we give up on S.D.s for Effect Size?


This page in:

I just got back from the annual meetings of the American Economic Association (AEAs) in Boston. It’s been a couple of years since I last went, and after usually going to just development conferences, it was interesting to see some of the work going on in other fields. Here are a few notes:
  • Present Bias after 20 years: I went to a great session on this topic, which was looking back on 20 years since David Laibson’s dissertation work on present bias. The session included Ted O’Donoghue and Matthew Rabin on “lessons learned and to be learned”, David Laibson on “why there isn’t more demand for commitment” and Charles Sprenger on “judging experimental evidence on dynamic inconsistency”. None of the papers appear to be online yet, so something to look forward to in due course. But a few key take-aways:
    • The problems with measuring time inconsistency using money questions: it has been relatively standard to assess time inconsistency by asking someone to choose between an amount today and an amount in 1 month, and then an amount in 5 months vs 6 months, and see if the discount rates differ.  However, as O’Donoghue noted, present bias should operate on utility, not money. As a result, individuals should maximize wealth, and then present bias should determine how they allocate that wealth as consumption over time. So if they can borrow and save, they should arbitrage away any difference between interest rates and the rates at which you offer them money today vs in the future. Money discounting therefore has this problem of requiring either individuals to not be able to borrow or lend or having them not think about it; the problem of depending on what the external consumption choices are; and the well-known issues of confounding with transaction costs and payment reliability. So there appears to be a move away from these questions. But this raises two issues: 1) these questions do seem to provide reliable indicators of behavior in some settings, so we need to know under what settings; and 2) there is no generally accepted alternative – people have been playing around with non-monetary choice problems where behavior over time or other things can be used.
    • Why is there not more demand/so much demand for commitment? David Laibson presented work calibrating a model of choice for commitment technology. He showed that there is a large range of parameters in which a sophisticated present-biased decision-maker will choose commitment products when there are no costs. But as soon as you add either partial naiveté about time preferences, or some relatively small costs of entering into commitment contracts, the demand for commitment products almost disappears entirely. This can be viewed with the fact that few firms offer commitment contracts, and that the main examples we have are products introduced by researchers. But as Nava Ashraf, one of the discussants noted, some of these products in developing countries have had very large impacts – suggesting more people should be demanding commitment devices.
    • Heterogeneity in discount rates is very hard to disentangle from heterogeneity in other parameters
    • Hard and soft commitments – one of the frontier issues is thinking about continuums of commitment, and thinking about a sweet spot where individuals can commit to some extent, but still retain some flexibility to back out if they need to. Nava pointed to recent work by Karlan and Linden as an example, where ear-marking beat strong commitment.
  • Should we be using Units of Standard Deviation to Compare Effect Sizes Across Studies? This came up in both the two discussions I gave, as well as in thinking about my own presentation (new work on measuring business practices in small firms that I’ll blog about when a full paper is available). This should be particularly familiar for readers working on health and education  - it is very common to hear that “intervention X led to a 0.2 S.D. increase in test scores”. But the more I think about this, the more I think that this is not a good measure for comparing across studies, or for power calculations:
    • Comparing across studies: I discussed Eva Vivalt’s paper on external validity (which she blogged about previously on our blog). One of her findings is that interventions run by NGOs/Academics have larger effect sizes that those run by governments. But consider the following example, where both run the same intervention to try to improve test scores in India. The NGO works with a very homogeneous group (control mean score 50%, std dev of 5%). The NGO increases test scores by 1 percentage point, which is a 0.2 S.D. improvement. The Government works with a much more diverse set of kids, with the same control mean (50%), but std dev of 20%. The Government program increases test scores by 2 percentage points. Despite this being twice as large as the NGO effect, when converted into units of S.D., it is only half the size (0.1 S.D.). i.e. comparing effect sizes in terms of units of standard deviations artificially inflates the effectiveness of interventions done on more homogeneous groups, all else equal. But as I found when trying to compare the estimates in my study to those in other work, we may also be concerned trying to compare magnitudes across studies with other ways of scaling them.
    • Power calculations: I discussed the Give Directly evaluation by Haushofer and Shapiro. They noted they had powered their study to detect a 0.2 S.D. impact. But this got me thinking about why we should care about S.D. when thinking about impacts. In particular, consider the impact on business revenue, which is quite heterogeneous (the std dev is about twice the mean in the control group). A 0.2 S.D. increase is then approximately a 37% increase in business revenue. If they had screened the sample to make it more homogeneous, then 0.2 S.D. might be a 20% or even a 10% revenue increase. It seems to me much more natural to think in terms of return on investment or what the percentage or absolute level increase we would like to see is than S.D.


David McKenzie

Lead Economist, Development Research Group, World Bank

Jon de Quidt
January 05, 2015

This paper by Carvalho, Meier and Wang has a beautiful demonstration of the importance of liquidity constraints in measuring present bias over money…

David McKenzie
January 05, 2015

Thanks, this paper on how elicited present bias varies before and after payday was indeed referenced several times, especially in Sprenger's talk.

Stuart Buck
January 05, 2015

The effect size point is very important, especially given that people are routinely taught that using effect sizes is a way to compare the magnitude of effects across different contexts.
For a discussion of the same point in the education context, see this 2008 MDRC paper (pointing out that the standard deviation of educational achievement is very different when based on individual students versus when based on schools, so the same intervention may look much more or much less effective depending on how it is measured/reported).…

David McKenzie
January 05, 2015

Thanks, this looks like a very useful paper. I like the point that standardized effect sizes ease interpretation when the outcome being measured doesn't have an inherently meaningful metric (like social and emotional outcome scales) but that "In contrast, outcome measures for vocational education programs —like earnings (in dollars) or employment rates (in percent) — have numeric values that represent units that are widely known and understood. Standardizing results for these kinds of measures can make them less interpretable and should not be done without a compelling reason."


Eva Vivalt
January 13, 2015

Cheers, David, on your commentary.
While I agree that different standard deviations are something one should watch out for in many contexts, it is certainly not driving my paper's results. If you notice, I explained in the data section that many papers do not report SDs in order to make this normalization using paper-specific SDs, hence needing to turn to the SD of other papers in the same intervention-outcome combination, i.e. the scaling is often done on a common factor. This means a) it wouldn't be driving the government vs. NGO/academic results, as it's essentially saying that even within an intervention-outcome's raw data (e.g. in percentage points) the two implementer types yield different results, and b) if you wanted to look between intervention-outcomes and were to think of it like introducing a common error by intervention-outcome, that's fine - it's better than leaving it raw (where the issue is only magnified) and I do cluster by intervention-outcome. It is very well-known that SDs have these problems, and nobody using SDs would be unaware of this, yet they are still extensively used in the literature for want of a better alternative when comparing across outcomes.