How to make user-written Stata commands really reproducible

|

This page in:

Anyone who has tried to run someone else’s Stata files has probably experienced the error that user-written commands needed for the project are not installed. This is annoying when you are trying to re-run a colleague’s code, but it really becomes a headache for creating reproducibility packages

 

This issue is more complex than it may seem! While people usually solve it by adding a line to their scripts to install the command from SSC (Stata’s standard platform for distributing user-written commands), this only enables the file to run--it does not mean that outputs will be reproducible. That’s because the SSC is not versioned: only the most recent version of user-written commands is available on SSC. If you’re lucky, the new version will still generate the same results as the version the author used initially. This requires more and more luck as time goes by, and very likely there will come a point when the version available on SCC will be so different that it breaks your code or changes your results.

  

In other programming languages commonly used in research, the standard platforms for distributing user-written commands provide a solution to this by allowing users to install older versions of commands. See for example renv for R or pip for Python. Since most economists still code in Stata, we (DIME Analytics) developed a technical solution by creating a new option adopath() to our existing command ieboilstart. We think this provides an adequate and easy-to-use solution to this common problem.

 

Our new option does not use virtual computer images as suggested here. While better, that is not yet an easy-to-use solution for a typical researcher, especially when working with licensed software like Stata. We welcome the day this is made easy, but in the meanwhile, we recommend using our new option adopath().

 

The adopath() option is part of the ieboilstart command in the ietoolkit package. This command is meant to be used at the top of a project’s main-script (previously called master-script) as a substitute for boiler plate code such as set version, set maxvar, and set more off. See the helpfile for ieboilstart for features other than adopath()

 

What does this new feature do?

 

You might have heard of “setting the ado-path”. This tells Stata where to look for user-written commands (ado-files). Modifying the ado-paths manually is, unfortunately, not straightforward, and it is even harder to do it correctly and persistently. We created a simple command option - adopath() - that does this in one line (if you want the technical details, see our continuing education session on the topic - slides, recording - after reading this blog). 

 

The intended workflow is to create a project-specific folder where all user-written commands used for that project are installed, which we call the project ado-folder. The project ado-folder can be created anywhere and called anything. However, we suggest calling it “ado” and strongly recommend saving it in the same location as that project's other do-files to ensure it is included when sharing code with team members (using GitHub, OneDrive etc.) or when creating a replication package.

 

After creating the project ado-folder, you simply use adopath("/path/to/ado-folder", strict) to point to it. Once you’ve done this, any command you install will automatically be placed in the project ado-folder, Stata will no longer find any commands you installed prior to setting the option (i.e. those that are not saved in the project ado-folder). This ensures that all the user-written commands your project needs are indeed included in the project ado-folder. All these settings are restored when restarting Stata.

 

Using adopath() and including the project ado-folder when sharing your code means that your code should no longer include net install or ssc install. Instead you share the exact version of a command your project’s code needs, and adopath() makes sure that all users running your code, use that exact version and nothing else.

 

How is this new feature used?

 

Figure 1 shows how ieboilstart, adopath() is intended to be included at the top of a main-script (if you are not familiar main-scripts and how to set root paths, read this section of the DIME Analytics Data Handbook). 

 

Image
Figure1
Figure 1 - Download a similar do-file example here: https://osf.io/p95va

 

These file paths corresponds to the folder diagram in figure 2. The folder called ado is the project ado-folder. After running the code in figure 1, you install commands to the project ado-folder by using net install/ssc install in your main Stata window. Do not run net install/ssc install in the do-file editor as you should not include that in code you share.

 

In figure 2, two fictive commands - nicetable and specialreg - have been installed from SSC (note, we have simplified how the “ado” folder looks as SSC creates some files and sub-folders here, but users do not need to worry about those, just don’t move or delete them). Any project specific ado-files are included in the adopath() workflow simply be saving in the “ado” folder disregarding any sub-folders.

 

Image
Figure 2
Figure 2 - Suggested folder structure for a project using ieboilstart, adopath()

 

We recommend that you setup adopath() at the start of a project and install commands in the project ado-folder as the need arises. While also possible to set it up at the end when preparing a reproducibility package, you risk having to manually copy commands into the project ado-folder if the exact version of a command needed no longer is available on SSC.

 

One technical requirement

 

ieboilstart cannot install itself.  ieboilstart, adopath() will throw an error if a user does not already have the package ietoolkit installed. We recommended including the code below before the line with ieboilstart, adopath(). This code tests if a user has a recent enough version of ietoolkit installed. If not, then the user is prompted to install it and is offered a one-click way to do so.

 

Image
Figure 3
Figure 3 - Download a similar do-file example here: https://osf.io/3tf5g

 

What about Stata’s built-in commands?

adopath() version controls user-written commands, but it is equally important to version control Stata’s built-in commands. ieboilstart already had support from this in the option version(). However, for this to work `r(version)’ must be included on the immediate subsequent line as shown in figure 1 and 3. See the help file for ieboilstart for more details on this.

 

Feedback

The source code for this command can be found here. This is a new feature and we’re sure there are things we have not thought of yet, so we appreciate any feedback you would like to share.

Authors

Kristoffer Bjarkefur

Consultant, Impact Evaluation Unit, Development Research Group, World Bank

Join the Conversation

Camo
April 10, 2023

Thanks for a neat solution to an under appreciated problem!

The groundhog package seems the way to go these days for doing this in R. I'd link to it, but URL's not allowed in comments.

Nicolas SATONGNON
April 10, 2023

C'est une bonne initiative