Published on Data Blog

Newly released dataset maps 30,000 road crashes in Nairobi using crowdsourced data

This page in:
Newly released dataset maps 30,000 road crashes in Nairobi using crowdsourced data High congested traffic in Thika Road, one of Nairobi's main thoroughfares. Photo: Shutterstock

On April 1, 2024, a five-vehicle car crash on the Nairobi-Mombasa highway killed 11 people, seven of whom were from the same family returning from Easter holidays and many of whom were children. Tragedies like this are common in Kenya. In fact, road traffic crashes are the leading cause of death for young adults and the 12th most common cause of death across all age groups. So, the question is, could this tragedy have been avoided?

The World Bank has been working for years in the Kenyan context to develop the AI tools to detect and georeference road car crashes. The intention has been to find the most treacherous tracks of road and focus road safety efforts on them.

The reason for doing this is simple: years of road safety recommendations have not enabled countries to make their investments matter. Road networks are extensive, road safety investments expensive, and information scarce.

Data can make a difference. Data can help identify high-risk corridors where crashes are concentrated, allowing policy makers to target the few kilometers of road with the highest crash risk. Moreover, data can be used to evaluate investments and trigger course corrections. A road safety campaign in Texas that displayed traffic fatalities on electronic billboards led to more crashes because they distracted drivers; the US Federal Highway Administration has recently urged more simple messages on electronic freeway signs to minimize distractions.

While official data can be scarce, new sources of information can be leveraged to obtain good data. Indeed, 92% of crash fatalities occur in low- and middle-income countries (LMICs) that typically lack digital systems for recording crashes. If at all, crashes are recorded on paper and can underestimate crashes. But times have changed. Today, bystanders report crashes on social media in large numbers. With some ingenuity, the World Bank team was able to demonstrate that one can turn tweets into a publicly available dataset. In the case of Nairobi, this resulted in a dataset and map of over 30,000 geocoded crashes.

The World Bank Smart and Safe Kenya Transport (smarTTrans) team focused on Kenya because at a rate of 28 road traffic fatalities (RTF) per 100,000, the country exemplifies the tragedy taking place across sub-Saharan Africa (27 RTF/100,000).

So, what did we do? We leveraged crowdsourced reports of crashes by using a popular X (formerly Twitter) account in Kenya, @Ma3Route, where users share information about transport and traffic conditions. @Ma3Route has over 1.4 million followers, and followers frequently post information on crashes (see example posts below).



We queried over 1 million posts from @Ma3Route from August 2012 through July 2023 and developed algorithms to (1) identify posts that report a crash and (2) geolocate the crash based on information within the text of the post (few posts have geolocation enabled, so we relied on references of roads and landmarks to geolocate crashes). Reports of crashes were then clustered into individual crashes (multiple posts could report the same crash); this process yielded 31,064 individual crashes. For more details on the algorithm to produce the crash dataset, see our paper here and the algorithm R package here.

Publicly releasing dataset

We recently made the crash dataset publicly available, which can be downloaded here. We hope the dataset can be a valuable tool for researchers and policymakers alike to inform road safety policies.

While the dataset can be a valuable tool, there are important considerations to keep in mind when using the dataset:
 

  • Not all crashes are reported by bystanders to @Ma3Route. The dataset represents crashes from time and locations where bystanders are more likely to see and report crashes. Moreover, more visible crashes, such as crashes that result in traffic delays, may be more likely to be reported by bystanders irrespective of their severity. Conversely, fatal crashes that happen at night may be less likely to be reported.

  • Posts do not reliably capture information on fatalities or injuries. Some posts reference that a crash resulted in injuries or fatalities; however, not all posts contain this information and the ability of bystanders to accurately ascertain injury information is uncertain. Not having data on fatalities and injuries is therefore a limitation of the dataset as road safety interventions aim to reduce deaths and injuries from crashes (for example, SDG target 3.6).

  • The algorithm to geolocate crashes is not perfect. We develop and implement an algorithm that geolocates crashes based on the text of the post, using references to landmarks and roads. To test the accuracy of the algorithm, we manually code the locations of 1 year of posts. The algorithm determines the correct location for 65% of crashes from the truth dataset (recall). Among the crashes where the algorithm produces a location, the location is correct 81% of the time (precision).

  • Crash reports depend on usage of @Ma3Route. Since @Ma3Route started in late 2012, the number of reported crashes grew until mid-2015, when reported crashes then started to decrease (see Figure 1). The reduction in reported crashes after 2015 may indicate declining use of @Ma3Route rather than a reduction in actual crashes. Similarly, in March 2020, there was a sharp reduction in crashes after social distancing measures, including a curfew, were implemented in response to COVID-19. This reduction could result from a reduction in crashes, which has been reported in other contexts—but could also result from less users on the road to report crashes. Our working paper here further explores the impact of the curfew on crashes in Nairobi.
     

Figure 1. Trends in reported crashes over time

Image


Despite these limitations, the data can still be useful to identify high-risk locations in the city. To map high risk crash locations, we group crashes within 500 meters of each other into clusters—where clusters with the most crashes could be considered blackspots. The map and corresponding table below show that crashes are spatially concentrated. Ten locations (clusters) represent 10% of all crashes, where 100 locations (out of 716) represent about half of all crashes.

Figure 2. Map of crash clusters (left) and table summarizing crashes across clusters (right) from January 2020 through July 2023. Individual crashes within 500 meters are grouped into clusters using Ward’s hierarchical algorithm. “% Crashes” shows the percent of reported crashes out of all 716 clusters within Nairobi.

Image


Our aspiration for this dataset is to fully harness the potential of crowdsourced efforts. While the dataset is generated from crowdsourced reports, by making the dataset public we hope to crowdsource analysis of the data. We look forward to seeing how others use the dataset to improve road safety.

Acknowledgements
This work received funding from the ieConnect for Impact program which is a collaboration between the World Bank’s DIME group and Transport Global Practice. The ieConnect program is funded by the UK International Development from the UK government. We thank Sarah Williams and Elizabeth Resor for supporting the research that informed this blog.


Arianna Legovini

Director, Development Impact Evaluation, World Bank

Robert Marty

Research Analyst, Development Impact Evaluation (DIME), World Bank

Sveta Milusheva

Senior Economist, Development Impact Evaluation

Guadalupe Bedoya

Senior Economist, Development Impact Evaluation (DIME), World Bank

Join the Conversation

The content of this field is kept private and will not be shown publicly
Remaining characters: 1000