Syndicate content

Making Analytics Reusable

Benjamin Daniels's picture
Fernando Hoces de la Guardia (BITSS) leads an interactive session using R Markdown to create dynamic, reproducible documents blending code and writing.

This is a guest blog post by the Berkeley Initiative for Transparency in the Social Sciences (BITSS), DIME Analytics, and Innovations in Big Data Analytics teams. This post was written by Benjamin Daniels, Luiza Andrade, Anton Prokopyev, Trevor Monroe and Fernando Hoces de la Guardia. The workshop also included presentations by Mireille Raad and Dunstan Matekenya.

Since 2005, the share of empirically-based papers published in development economics journals has skyrocketed, reaching more than 95% by 2015. Today, lab-style research groups and teams typically maintain in-house capacity for the entire research workflow. This development means that new, scalable methods for ensuring high-quality research design, data collection, analysis, and publications are needed for evidence to remain transparent and credible. We call these workflows “reusable analytics”, because they are research processes that can be verified by outside teams, or repurposed for a different analysis by the same team later on. Research teams almost universally plan to adopt such processes, but there is also a pervasive sense that actually making analytics reusable is costly and difficult. Therefore, our analytics teams are currently putting extensive effort into selecting and developing flexible tools and processes that can be used over and over again—so we can deliver recommendations and trainings for easy-to-learn and easy-to-use reusable analytics.

The DIME Analytics group, the Innovations in Big Data Analytics Team, and the Berkeley Initiative for Transparency in the Social Sciences (BITSS) recently co-hosted a hands-on workshop on “Making Analytics Reusable”. The workshop offered hands-on training in some core tools for reusable analytics: code collaboration using Git and GitHub, dynamic documents using Stata, R, and Python, and team task management using GitHub issues and project boards. The over-subscribed attendance reflects growing demand for modern principles and practices that accelerate learning, transparency, reproducibility and efficiency in research and policy analysis.

It seems that everybody wants to make their research workflows reusable: it enhances the quality and credibility of work across the board. But truly reusable workflows have two components. Obviously, there is technical understanding—the knowledge of how to write code, programmatically access and manage data, or use Git, as a practical matter. That seems to be what most people who attend our training or workshops think they need most, and there are many technically-oriented resources already available online. However, we think there is a much more subtle conceptual understanding of reusable workflows that people need to develop and is much harder to teach. In this post we will describe what we’ve learned about these processes through teaching them, and how you can start thinking about reusable analytics more effectively.

The more we taught these courses, the more we realized that the mental model most people have of how to get data work done is out of sync with the goal of reusable processes. Making analytics reusable means re-imagining your workflows from the perspective of a different user trying to retrace your steps—be that your future self, your team members or someone across the globe trying to reproduce your work. This is reflected in how to approach the tasks, down to the level of managing file structures and collaboration workflows. During the workshop, we held hands-on session on two tools to help in this process.

Benjamin Daniels (DIME Analytics) teaches participants how to use Git and GitHub in collaborative workflows with version control and simultaneous editing.

Git and GitHub are tools for version control that can be used to track changes in code, results, papers and reports in a more efficient version than naming files like “final_paper_HJAL_rev_v7.doc”. It can make your life a lot easier, but can also be overwhelming at first glance. One essential principle for Git to work, however, can be incorporated into any workflow. It is simply mentally breaking down massive tasks into smaller and smaller parts. This makes it easier to think about how bits and pieces of the workflow can be re-designed with collaboration and reusability in mind. Breaking down tasks leads naturally to breaking down files. In practice, that could mean slowly modularizing code and documents so that multiple people can simultaneously work on them. Instead of my-report.docx, you compile chapter-01.tex, chapter-02.tex, figure-01.eps, and so on. This kind of modularization enables a new, parallel organization of the scope of familiar objects like folders, files, and versions around human-level micro-tasks that are designed to avoid getting entangled with each other.

The other focus of the workshop was how to create dynamic documents in various statistical softwares. R, Stata, or Python can be used to produce interactive “notebooks”, containing both the writing and the code that produces results. Such notebooks—most recently popularized by Jupyter—are arguably an entire new way to write scientific papers, such as Fernando Hoces de la Guardia’s (BITSS) “The Effects of a Minimum-Wage Increase on Employment and Family Income”. This is the most advanced way to create dynamic documents, but there are other, less technical steps that can be taken to make documents more dynamic and easier to reproduce if you’re not ready to fully invest in that yet. Introduction to the world of dynamic documents usually starts with switching from Word to LaTeX, so you can reference tables and figures without having to copy and paste results every time they are updated. Modularization can help here as well: if instead of my-analysis.do, you have figure-01.do, figure-02.do, and so on, it’s much easier to find how each result in a document was created, and to make changes individually to them or make different people responsible for them.

Teaching these processes as a workflow mindset is new to us—we are only just elaborating the full details for ourselves. What we do know is that making your work fully reusable, reproducible, and transparent is a long process—but done right, every step along the way makes all the rest of your work easier, not harder. As we continue to deliver trainings on the technical contents of these workflows, we plan to focus on a more balanced combination of technical skills and conceptual understanding. We want to teach people to realize that the way to truly learn most of these skills is one day at a time, through a process of acculturation, rather than all at once—and that a workshop like “Making Analytics Reusable” is meant to be only the beginning of a journey.

Add new comment