Published on Development Impact

Learning from how Tech Companies do Experiments: Lessons from The Power of Experiments: Decision-making in a Data Driven World

This page in:

Cover page of the Power of Experiments

I recently read the new book The Power of Experiments: Decision-Making in a Data Driven World, written by Michael Luca and Max Bazerman, who are both faculty at Harvard Business School. The book is a quick read, and is written at a non-technical level. It is divided into three parts – the first briefly summarizing the history of experimentation in psychology, economics and policymaking; the second goes through seven or so case studies of how experiments have been used in tech firms; while the third focuses on experimenting for social good, covering experiments on increasing school attendance, get out the vote campaigns, and savings and health behaviors. There is relatively little on experiments in developing countries, and some of the examples (e.g. The Behavioral Insights Unit/Nudge Unit) will be overly familiar to many of our readers. The part that will be newest for our readers will likely be the case studies from the tech industry. In contrast to some other accounts of experimentation in firms that are very unspecific about what the details were, these case studies are very informative and concrete. I thought I’d summarize some of the lessons and examples.

How to organize experimentation within an organization: The authors discuss how major tech companies have an easy-to-use experimental infrastructure that allows virtually every team in the organization to run experiments. They give the example of Booking.com – where approximately 80% of the product development teams are actively running experiments, with experiments run on all parts of the business from customer-facing platform experiments to partner-facing ones to customer service and marketing, resulting in approximately 1,500 employees running experiments! They note that once experiments are run, results are logged into a centralized repository that allows anyone on a team to see what has previously been tested, and by getting more employees engaged in the experiments, teams can use experimental evidence in making most product-related decisions.

This culture of experiments is also highlighted in another lesson, which is to think about learning through a series of experiments, rather than just using a single experiment to ask and answer important business questions. For example, they consider the case of eBay deciding whether or not they should be spending $50 million per year on Google ads, when people who search for eBay may click through even without ads. They ran a series of experiments to measure the financial returns, turning Google ads on and off in different markets, and for different search terms – finding that much of the money spent on advertising was wasted, but that it was useful to advertise when people do searches for items less commonly associated with eBay.

This got me thinking about the difference between this system of constant, organization-wide, experimentation in tech companies and what we see in most governments and the World Bank. A big difference of course is the lower cost of doing tech experiments, and greater ease measuring outcomes. But you can imagine the potential benefits from a series of experiments on how people are hired, how frequently meetings are organized, the trade-offs between bureaucratic procedures and risk mitigation, telework, delegation, etc. etc. Although it is less clear how much the tech companies are doing organizational experiments on themselves as opposed to experimenting on how to best package and sell their products.

Experiments are just one part of the process of research and innovation, but can be a vital tool in persuasion: A first example comes from Airbnb, and the issue of whether customers and property managers practice racial discrimination on the platform, as well as what to do about it. Chapter 5 discusses how Mike Luca and a co-author first wrote a case study of Airbnb to understand how trust is built in these transactions, where they raised the possibility that making prominent personal profiles might increase trust but also discrimination. Then there were customer complaints from black customers of getting turned away, and correlational analysis suggested that African American hosts were earning less money per night than white hosts with similar listings. The authors note that this evidence was suggestive, but not definitive, allowing the company to deny discrimination. Mike and his co-author therefore conducting an audit experiment, sending rental inquiries that were identical, except for the name of the guest – and finding inquiries from guests with distinctively African American sounding names were 16% less likely to get a yes from hosts than those with white-sounding names. This experimental evidence and media attention helped lead to pressure on the company to make design changes to reduce this – and Airbnb created a data science team to study the issue and explore potential solutions – but to date have not made public the results of these experiments.

A second example comes from Uber, where typically innovation starts with lower-cost data gathering activities such as analysis of historical data, customer discussions, and running simulations. Then new products are piloted in one or two cities to make sure the product functions as expected (this is a key step we sometimes omit in our haste to roll out development experiments), before turning to experiments. Uber will conduct both large market-level experiments (see point below), followed by an iterative series of smaller experiments aimed at refining products.

The challenges of knowing what to measure and the right timeframe: this is certainly an issue for many development projects, but is also a challenge for tech companies. They note the danger is that experiments can lead to a focus on overly narrow or short-term outcomes. For example, they discuss experiments by StubHub to decide whether or not to shroud transaction fees – that is, whether to show them upfront, or only when getting to the final checkout screen. The simple experimental analysis shows that shrouding increases revenue, since customers are both more likely to make the purchase, and pay for a higher ticket price, when they only see transaction costs at check-out. But the authors note that short-term profit is not all that matters – perhaps customers don’t like these hidden fees and are less likely to come back for future transactions, so that it is important to measure longer-term effects. StubHub did track customers for a year afterwards, and do find treated customers less likely to come back in the next few months, but this is dominated by the revenue effect of increased short-term sales. But even this one-year measurement can’t capture potential longer-term reputation effects of StubHub moving away from its’ “No surprise fees” policy.

A second example of thinking carefully about what to measure comes from Uber, and it’s decision of whether or not to launch a new product (Uber Express Pool). The challenge here is of two types of spillovers: first, if some riders are offered this new product and not others, the entire market will be affected since the control group of riders are in the same market, being serviced with the same set of drivers; and second, the rollout of a new product can affect demand for all existing Uber products – they need to track not just what happens to Uber Express Pool, but also do all their existing alternatives. To deal with these issues, Uber tends to run market-level experiments for some of its bigger changes, rolling them out to a random subset of markets, but to all users within those markets. But then, despite all the massive amounts of data Uber has, they start running into small sample problems like we often do in development – and need to carefully use small sample methods in analysis. Moreover, they then need to carefully understand treatment effect heterogeneity – since impacts are likely different in the suburbs versus in a city, and during rush hour versus non-rush hour.

Thinking about transparency and the ethics of experimentation: The first line of the preface asks “How many experiments do you think you’ve participated in over the past year?” and notes that most of us have likely participated in many experiments, just without realizing it. Chapter 10 discusses the recent case of Facebook getting backlash against an experiment which tested whether happy and sad posts affect users’ moods, seeing whether users were more likely to write positive or negative things after seeing more positive or negative posts in their feed from their friends. They note that Facebook had been an outlier amongst corporations in that it had been publicly reporting some of its more interesting experiments – but that many companies took away a message of “transparency is dangerous, and experiments are best kept under wraps”. They remark that many tech companies avoid saying they are doing experiments, they “run A/B tests”. They argue the opposite – that firms should be publishing a lot more of their findings and are less likely to face blowback if they are transparent that they are constantly attempting to improve their products for users.

 


Authors

David McKenzie

Lead Economist, Development Research Group, World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000