Published on Development Impact

Should we evaluate more algorithms?

Berk Özler

February 07, 2024

This page in:

When it comes to using algorithms using machine learning models in pertinent policy decisions, my impression is that people fall into three groups: the first group, perhaps older and more seasoned, says “No, thank you.” Don’t these things have bias? Will people accept them? Many won’t understand them, etc. The second group, perhaps younger and more tech savvy, might be too eager to deploy them: this can save so many lives, so much money! The third group, perhaps the largest, falls in between: they might see that the new technology has potential – empathizing with the second group – but also might be a bit weary of the downside risks – siding with the first. This group, I like to think, would like to know more to proceed cautiously…

A new working paper by Ludwig, Mullainathan, and Rambachan provides some fuel to this third group and, if subsequent efforts are successful, shrink the other two groups by winning them over to the moderate middle. They make two claims:

1. Algorithms, at least in several key policy areas, have huge potential to improve social welfare.

2. There is enough uncertainty about even the ones with the highest potential that we should invest in more evaluations (algorithm R&D in their parlance) before setting them loose in real life settings.

The authors gauge for potential is the Marginal Value of Public Funds, or MVPF, proposed by Hendren and Sprung-Keyser (2020, 2022). When a policy’s benefits to the society are positive and the cost of deploying that policy is lower than the savings to the government, the net cost becomes negative and the (societal) benefit/(governmental) cost ratio becomes infinite: the policy provides a free lunch to society. In a carefully selected number of important policy decisions, such as criminal justice (reducing crime and/or detentions), health (reducing heart attacks and/or unnecessary and expensive tests), education (matching students to classes), and workplace safety regulations (reducing workplace injuries and reprioritizing inspections) – all of which lend themselves well to algorithmic decision-making – they show that algorithms, perhaps surprisingly effective to the degree of being seen as unreasonable, have infinite MVPFs: the net costs to the government are easily projected to be negative and the benefits to the society are, on average, positive. The reason, the authors argue, why algorithms are successful in these areas is that they are doing so much better than what they call the ‘human benchmark’ – not by expanding the existing programs efficiently to the remaining cases with the highest marginal benefit, but by re-prioritizing all the cases to eliminate huge deadweight loss (see Figure 1 in the paper).

Their main example is the decision, by a judge, of whether to release or detain a person who has been arrested to await trial. There are several interesting issues that come up while thinking of a decision-aid for a judge that uses real-time data and I urge you to read this section of the paper. [Better, if you can catch Sendhil talking about this stuff at a conference or seminar, you will likely enjoy it and retain so much more.] One of the issues that I’d like to note that comes up is the need for policymakers to make decisions on key normative policy questions. This is something that comes up regularly among economists: somehow most of us are obsessed with providing a positive analysis of a policy question, setting important normative issues aside. Of course, while that is useful and is our comparative advantage, it does not erase the fact that the policymakers still have to make those difficult decisions (and deal with the consequences). In their criminal justice example, the policymaker must decide whether to reduce crime (holding detention rates constant) or reduce (unnecessary) detention rates holding crime constant, or a combination thereof. In deciding tax rates for meat and poultry, you need to make an explicit decision as to whether animal welfare enters directly into the social welfare function at all. Basu (2003) proposes how it is important for normative considerations to help decision-making when there is no Pareto dominance between policies. Algorithms still need (maybe even more so than usual) normative considerations to guide the policy decisions…

So, that’s on the large potential of using algorithms in areas when they’re suitable and feasible. Before getting to priorities when setting up evaluations/trials of such algorithms, it’s also useful to think a little bit about how these algorithms are created. Yes, you need data, and you need data scientists to train ML models, but there are also other equally important tasks to undertake, which might be harder and costlier (affecting the denominator of the MVPF, namely the net cost to the government): these are the human costs that are harder to measure. For example, judges (or nurses) might simply ignore the job-support tool that you provide to them: if take-up is zero, even a perfect algorithm will fail. Apart from the direct user, there are other stakeholders, whose opinions about the use of this algorithm will matter a ton in real life, such as civil rights advocates, defense lawyers, and the police in the criminal justice example.

In Cameroon, we designed a job-support tool that uses an algorithm (not ML-based) to help family planning nurses recommend contraceptive methods to clients that are tailored to their individual needs and preferences using data collected by the tool during a counseling session. I mention this because, I will use it as an example below when I discuss the key issues on which Ludwig et al. encourage much more research and impact evaluations. You can find the paper on the evaluation of the job-support tool here and listen to a 20-minute podcast with my co-author Susan Athey here. In preparation for that IE, we spent a lot of time conducting formative qualitative research, then continuous consultations with a wide range of stakeholders on the design of both the algorithm and the tablet-based job-support tool, and then extensive testing with experts and providers to make sure that the job-support tool was helpful, accepted and liked (by both providers and clients), and accurate (in terms of matching client needs) before even starting a pilot evaluation. The process proposed by Ludwig et al. is the same and the costs of the consultations and real-life piloting can easily exceed the cost of collecting and curating the data and developing the algorithm.

One more issue before we dive into IE priorities and that is ‘data drift.’ You develop your algorithm today and before you know it the demographic composition of the cases/clients/patients/students changes due to some shock or secular trends over time. This might render your algorithm less effective than when it was first designed. Perhaps worse, you might have behavioral changes on the part of the stakeholders in response to the changes brought on by the algorithm: for example, seeing that the algorithm is causing a much higher rate of release of arrested individuals to await trial, people tasked with bringing initial charges against an alleged offender might start bringing higher charges than before. Ideally, you would want your algorithm to be immune to these kinds of threats. One way to do that is to retrain the models and adjust the algorithm frequently: you want it to be relevant to the current population of cases, while at the same time, you want an algorithm stakeholders can understand and accept but also have limited ability to ‘game.’

So, what are the three key issues Ludwig et al. highlight as being important to tackle sooner rather than later in evaluative research? The first is the algorithm’s benchmark, i.e., the human. There is now sufficient evidence that this is a low bar for a variety of reasons: not only because of the human use of heuristics and biases in making decisions/predictions but also because the data we rely on might be noisy or biased. Humans aren’t very good, on their own (i.e., with their training and some data available to them), in guessing correctly who is having a heart attack, who is in serious pain, who needs extensive tests or a knee replacement. So, it should not be surprising that a well-trained algorithm should be able to better read a medical image, better predict who will fail to appear at trial, which workplaces are more likely to have accidents in the future, and so on.

However, equally, the same humans do see the patients in front of them, talk to the accused, hear from other stakeholders, and so on - not all of which is taken into account by the algorithm. This is why, often, the decisions are not automated by an algorithm but made by an expert human who is being aided by an algorithm (or a rule-based decision). Because the algorithms have to make a trade-off between predictive accuracy and explainability (stakeholders will be much less likely to use or accept the decisions of an algorithm that is a black-box they cannot comprehend), the human decision-maker might deviate from the recommendation of a decision-support aid based on context or information that is not used by the model. For example, in our trail in Cameroon, the algorithm did not take into account the importance of discretion (for contraceptive use) to the client, despite the fact that the nurse counselor discusses this issue with the client and a binary answer is recorded in the tablet. But the issue (the kind and degree of discretion needed by the client) is not easily boiled down into hard data, such as a categorical or continuous variable that can be used by the algorithm, so we chose to let the well-trained and experienced nurses to consider this issue at her discretion in considering the recommendation of the algorithm. Such leeway can open the process to biases and gaming, but as the algorithm, or rather the job-support tool using it, becomes its own data collection tool, it is not hard to track such issues in real-time either and take precautions against them – including the training and re-training of human decisionmakers. Hence, understanding better what information can be worth collecting (even if costly) for the algorithm to consider vs. what types of information are of real value (as opposed to a distraction) to a human can be quite valuable.

The second issue the authors raise is one also mentioned above, namely automation vs. decision aids. In our work in Cameroon, we had a rule-based recommendation made by the algorithm using a few key variables, which is a very rudimentary version of a more complex and well-trained algorithm using more data. One could imagine eliminating the human altogether and automating some decisions. Or we could vary the leeway the human has all the way to completely ignoring the decision aid. You can see an example of such a trial, applied to property tax decisions here.

Finally, the third and final issue Ludwig et al. mention, also discussed above already, is the issue of context dependence or external validity. Data drift falls into this category, so how often (or when) to update an algorithm can be useful knowledge. Are the predictors of offending or failure to appear at trial when released by a judge the same in Accra as in Yaoundé? In big cities vs. small towns? What about costs? What if nurses start gaming the algorithm to achieve contraceptive outcomes based on their personal biases towards different demographic groups (this recent working paper also suggests that people favoring their own preferences when making surrogate decisions for older adults over theirs). Research on generalizability, as well as shoring up the reliability of algorithms against strategic behavior by humans is another area of important research.

If you are interested in trialing policy evaluations in this area, the Development Research Group has a lot of capable researchers, such as Anja, Oyebola, David, and others. Impact evaluations manipulating some pertinent choices in the development and deployment of algorithms may themselves have infinite MVPF 😉… As Ludwig et al. conclude whether the promise of algorithms bears out or not is yet to be seen. There is only one way to find out.

Get updates from Development Impact

Authors

Berk Özler

Lead Economist, Development Research Group, World Bank

More Blogs By Berk

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.