Five small things I’ve learned recently


This page in:

As a change from my usual posts, I thought I’d note five small things I’ve learned recently, mostly to do with Stata, with the hope that they might help others, or at least jog my memory when I unlearn them again soon.

1.Stata’s random number generator has a limit on the seed that you can set of 2,147,483,647.
Why did I learn this? We were doing a live random assignment for an impact evaluation I am starting in Colombia. We had programmed up the code, and tested it several times, with it working fine. In our test code, we had set the seed for random number generation as the date “04112018”. Then when my collaborator went to run this live, it was decided to also add the time of the drawing at the end, so that the seed was set as “041120180304”.  This generated an error, and prevented the code from running. Luckily we could quickly fix it, and the live draw proceeded ok. But lesson learned, 2^31-1 is a large number, but sometimes binds.

2. Avoid Stata’s rank command if you want replicable random assignment.
This is a lesson that I have to keep on re-learning. The situation here is part of the same random assignment. We wanted to rank firms within strata according to an export practices index (EPindex), and then randomize within each strata among quadruplets of firms with similar practice levels. The problem that arises comes when observations are tied with the same value of the variable. If you use:

egen rankEPindex = rank(EPindex), by(strata) unique

Then the unique option splits ties, but it does so “arbitrarily”, with this being different each time you run the code, regardless of how you have set the seed. This is a problem for replicability.

Instead, you are better to use the following two lines:
sort strata (EPindex), stable
by strata: gen rankEPindex=_n

The stable option here to sort maintains the order that you have the data already, so will yield the same results each time you run the code.
[update: Arthur Alik Lagrange also helpfully informs me that Stata has a separate command set sortseed that sets the seed used for breaking ties when data are sorted] 

3. How to map longitude and latitude in Stata
I currently have a survey in the field in Nigeria, where we are collecting the GPS coordinates of businesses as part of the survey. I wanted to take a quick look to see where surveying had taken place, and to visually see whether my different treatment groups look like they are spatially randomly disbursed. I found this guide which helped quickly put the data on a map:

Step 1: download a shape file for your region. A quick google found this map library of Africa, and I could then get a shape file for Lagos in .shp form.

Step 2: convert the shape file into Stata format using shp2dta
ssc install shp2dta
shp2dta using " NIR-24.shp", data("lagos_data") coor("lagos_coordinates")

Step 3:  Plot your data using the tmap command
tmap dot, by(treatment) ycoord(gps_Latitude) xcoord(gps_Longitude) map("lagos_coordinates.dta")

This gives a basic map, like the one below. You can of course then make it prettier, but from the point of view of quickly seeing where interviews have taken place, I found this very useful.

(note: I have randomly perturbed points for this illustration to maintain anonymity of respondents).

4. Getting rid of commas, dollar signs, etc. from numeric variables that appears as strings
I’ve been working with follow-up survey data from a project in the Western Balkans, where some of the data that was collected was entered with commas and dollar signs. The data are then a string variable, with values like $1,000,000. I don’t know why it took me this long to learn the subinstr command in Stata, that can get rid of commas and dollar signs.

E.g. if you want to get rid of the commas:
replace amountinvested = subinstr(amountinvested, ”,”,””,.) 

5. Apparently not everyone thinks winsorizing and truncating are the same thing
I’ve never used the term winsorize in my papers, but instead have distinguished between truncation (by which I meant where values above the 99th percentile are truncated or shortened by replacing them with the value of the 99th percentile) and trimming (by which I meant where values above the 99th percentile are trimmed or cut from the dataset and set to missing).  I had thought then that people were using winsorize and truncation to mean the same thing. Only recently have I discovered that most people use truncation and trimming as the same thing, both to mean excluding data above or below some percentile, and only winsorizing to mean replacing outliers above or below some percentile with that percentile. So whenever you read truncate in my old papers, you should consider it to mean winsorized. 
What small thing have you learned recently, that you wish you had known earlier or think others might appreciate knowing? Please share in the comments.


David McKenzie

Lead Economist, Development Research Group, World Bank

May 29, 2018

Super helpful. In terms of new things learned:
1. Seems like Stata doesn't have any built-in command (or any user-contributed package) for it, but we can do sample size calculations for estimating proportion with a given precision by using power/sampsi with power fixed at 0.5 and the difference as precision (and of course, using oneproportion/onesample). Kind-a cheating since we are not technically doing a hypothesis test, but the relevant equations are identical with 50% power. The 'trick' works for cluster-sampling as well. [h/t: this deck from JHU…]
2. In terms of mapping in Stata, there's a great package called spmap that we used recently to map coverage of a public program in Zambia. Comes with a lot of handy options. Not sure, though, how it compares w/ tmap. [h/t: Jeff M. at IDi]
3. Came across 'wyoung': a Stata package that controls the family-wise error rate (FWER) when performing multiple hypothesis tests by estimating adjusted p-values using the free step-down resampling methodology of Westfall and Young (1993). Looks really neat, can incorporate the stratified design of an experiment while resampling.
Highly likely, though, that people are already pretty familiar w/ these!

May 29, 2018

@DM: On the note of winsorization, have a query: What's best practice in winsorizing aggregated variables/outcomes (e.g., total HH revenue from all sources)? Do we just winsorize the sum or do we winsorize the components and then take the sum to get the aggregate? Does it matter that much? Reviewing papers (or more appropriately, the footnotes therein), the former seems to be the standard practice but I'm not sure.
Also, do/should things change when we are reporting treatment effect estimates for both an aggregated outcome and its components?
Thanks in advance!

David McKenzie
May 29, 2018

I think standard practice is to winsorize the sum, and then if you are reporting effects for the components, winsorize them separately (without then forming a sum of these winsorized components).
What is the "best" approach depends on why you are doing this in the first place - whether it is because you think some outliers are data entry errors or other mistakes (in which case trimming may be more appropriate), versus that they are genuine but extreme observations, in which case you might have a household that has extreme levels of consumption of food outside of home, but their overall consumption level is not that extreme - so you might not want to winsorize their total, even if you would want to winsorize their food outside of home component if your interest was just in the treatment effect on that outcome.

May 29, 2018

Thanks! This is extremely useful.

Eva-Maria Egger
May 29, 2018

Similar to the subinstr command, regexr can be used for any type of replacement within string variables, e.g. I had to match municipalities in Brazil with lots of funny characters that I standardised by using
foreach x in à á â ã ä å {
replace municipio=regexr(municipio,"`x'","A")
If you happen to work with cross-country data from different sources, I discovered this code: kountry - It changes country name variables in a standardised naming and coding scheme and you can get international standard codes (numerical and characters) etc.

Adriana Camacho
May 29, 2018

These tips are great... I want to read more of this helpful blogs!

Nicholas J Cox
June 01, 2018

Interesting stuff.
Breaking ties the same way means that results are replicable. So far, so good. But different results from breaking ties differently is part of the legitimate uncertainty about results. In other words, you are asking for consistency in making arbitrary decisions! What is it that depends on breaking ties in exactly the same way? Wouldn't it make more sense not to use the -unique- option at all? (Historically, the -unique- option for ranking was driven by graphical needs, not a desire to produce ranks that are in any sense better or more useful than the default.)
A picky correction is that in Stata functions and commands are quite disjoint, i.e. function is not another name for command, and vice versa. This doesn't really bite except that people can (a) use the documentation a little more easily if they are clear that functions are documented separately (b) realise that allowed syntax is different, notably that functions may only be issued as part of a command. Thus -subinstr()- for example is a function. (I did say this was picky.)

David McKenzie
June 01, 2018

Thanks Nicholas, appreciate being corrected on language (see lesson 5).
On the ranks, a small example may help make clear how this is used in our context.
Suppose we have 6 observations, with values 1, 2, 3, 3, 4, and 5 for some index of business practices, where higher values indicate better business practices. I want to use these for an experiment, in which I randomly assign 2 of these to treatment 1, 2 to treatment 2, and 2 to control. In order to make the groups as comparable as possible, I want to stratify this randomization by forming triplets of observations that have similar values of the business practices. In doing so, I need to choose which one of the two tied observations to put with {1,2} and which one to put with {4, 5}. This can be arbitrarily/randomly split, and then I have two triplets {1,2,3} and {3,4,5}, and then randomly allocate one unit within each triplet to each treatment group. These groups are just a way of making the observations more similar to one another, so it doesn't matter how I split the tie, just that I split it to get even-sized groups of 3. But then I want to make sure that everytime I run the code, I get the same allocation to the groups, so that I can run the code and get the exact same random assignment of observations to treatment and control as anyone else running the same code.

June 03, 2018

More fundamentally why is there a focus a stata? In the age of open data, open source when will the wb support a free option like r for instance?