One of the most exciting and challenging stages of designing research projects is figuring out how to measure key outcomes—especially when the data is poor or non-existent. This is often the case for sensitive topics like criminal activity, particularly in low- and middle-income countries, where administrative records are limited or unreliable.
I’m always eager to learn about new methods for collecting high-quality data in these difficult settings. In this post, I highlight two innovative research projects that have tackled this measurement challenge head-on: one studying drug cartel presence in Mexico, and another documenting the behavior and organization of criminal groups in Medellín, Colombia. These projects exemplify creative approaches to gathering new data where none existed before.
Much of what we know about criminal organizations comes from secondary sources, such as judicial proceedings, police investigations, or journalistic accounts. But in the cities most affected by crime today—precisely where better understanding of the topic is most needed—such high-quality secondary data is often unavailable. That’s where these researchers came in with novel, and very different, measurement strategies.
Measuring Cartel Presence in Mexico
Official data on drug trafficking in Mexico is sparse, and no comprehensive panel exists on which organizations operate where after 2010. To fill this gap, Sabino (2020) developed a novel method using machine learning and online news sources.
Her approach starts with Google News. Using a web crawler—an automated script that systematically browses the internet—Sabino collected articles that mention both specific municipalities and cartels. The assumption is that local and national media outlets contain regular, detailed, and systematic coverage of when and where criminal organizations are operating. She then applied natural language processing to assess whether an article truly discusses a cartel operating in a particular location.
Specifically, she used a semi-supervised convolutional neural network (CNN)—a type of algorithm commonly used for image recognition but increasingly applied to text classification (Kim 2014).[1] She manually labeled 5,000 sentences to train the model to distinguish between actual reports of cartel presence and unrelated mentions.
To validate her data, Sabino compared it to other sources, including the dataset from Coscia and Ríos (2017), two hand-collected datasets from local newspapers (Sánchez Valdés 2015, 2017), and state-level U.S. DEA data. The results were strongly correlated, supporting the credibility of her new measure.
Mapping Medellín’s Criminal Ecosystem
To understand how Medellín’s complex web of criminal organizations operate, a team of researchers—Chris Blattman, Santiago Tobon, Gustavo Duncan, Ben Lessing, Juan Martinez, and Arantxa Rodriguez—launched a large-scale mixed-methods research project starting in 2016.
Over seven years, they conducted hundreds of qualitative interviews and thousands of surveys, and consulted a wide range of secondary sources and experts. Their interviewees included:
- Community members, leaders, and shopkeepers, who described the services gangs provide, fees they charge, and perceptions of both state and criminal actors.
- Public officials, such as prosecutors, police officers, and local leaders, who offered institutional perspectives.
- Journalists, including Medellín’s most experienced organized crime reporter, who shared information from news articles, court transcripts, and private sources.
- Gang members themselves—the heart of the project. By 2024, the team had interviewed 180 members from 80 different criminal groups, ranging from neighborhood "combos" to higher-level mafia-like structures called razones.
Interviews took place both in communities and in Medellín’s three prisons. Inside prisons, the team used a snowball sampling approach: wardens initially facilitated access to incarcerated gang members, and the researchers built a referral network to expand their reach. One of their key collaborators was a former gang member turned government outreach worker, who became a full-time “research associate” and helped connect the team with current and former gang affiliates.
They also surveyed about 10,000 seventh and eighth graders enrolled in top 100 Medellín’s most violent schools. The surveys measured antisocial behavior, impulsivity, parental supervision, and—using vignettes—how youth perceive the financial and social rewards of joining gangs (combos).
What kinds of research and policy-relevant questions can this data help answer? Quite a few—and they’re part of a highly novel and impactful research agenda. For instance, the data allows the team to study the market structure and political economy of organized crime, as presented in Blattman et al (2024). It also provides insights into long-term criminal trajectories, enabling analysis of occupational choices among 15,000 teenage boys in Medellín. [2]
In addition, the research has served as the foundation for building partnerships with governments and civil society organizations to design and evaluate pilot interventions aimed at improving security outcomes. Some of these initiatives, such as those described in Blattman et al. (2022), that have the potential to be scaled up in the future.
For other researchers working on similar topics, the team—together with additional collaborators—has created a private, encrypted wiki called WikiCombo. This collaborative platform is well-suited for managing the networked and decentralized nature of the data, especially when collected by many contributors. All primary and secondary sources are uploaded and encrypted, and the information is organized into thematic pages on each criminal group and related topics, with direct links to supporting documents whenever possible. Details on how to request access to WikiCombo and additional materials can be found [here].
These two projects show what’s possible when researchers step outside the bounds of traditional data collection. Whether scraping news websites with machine learning or building trust with gang members to gather firsthand accounts, both approaches reveal new ways to study hard-to-measure phenomena—and to do so with rigor and creativity.
[1] To understand a bit better how CNNs work, check this useful blog as a starting point.
[2] Blattman, Tobon, Rodriguez-Uribe (work in progress).
Join the Conversation