Syndicate content

Generating Regression and Summary Statistics Tables in Stata: A checklist and code

Matthew Groh's picture
As a research assistant working for David, I’ve had to create many, many regression and summary statistics tables. Just the other day, I sent David a draft of some tables for a paper that we are working on. After re-reading the draft, I realized that I had forgotten to label dependent variables and add joint significance tests in a couple regression tables. In an attempt to avoid forgetting these details in the future and potentially help future researchers, I thought I’d post a checklist for generating regression and summary statistics tables.
 
Like a first draft of a paper, a first draft of a Stata .do file is prone to typos and other errors. The awesome thing about scripting in Stata is that if there’s a scripting error, then Stata will error out and simply stop. However, if there’s a non-scripting error that affects the data, the .do file will generate flawed results without any warning! So, before sharing treatment effects regressions or summary statistics with co-authors or your PI, first make sure you can answer yes to the following questions.
 
  1. Does the number of observations for each regression or summary statistic make sense?
  2. Do the magnitude and sign of each coefficient/summary statistic seem reasonable?
  3. Did you delete the constant term and add the control mean in the regression table?
  4. Did you check for joint significance of your covariates?
  5. Did you label the dependent variables/columns?
  6. Did you label the covariates/rows?
  7. Did you add a title?
  8. Is it clear what the estimation procedure is (e.g. regression vs. probit)?
  9. Are the column widths the right size so as not to cut off text?
  10. Is the bordering consistent with your other tables?
  11. Are the numbers rounded to an appropriate level, so you don’t display too many decimal places?
  12. Do the notes to the table clearly indicate how standard errors have been estimated, and what control variables if any have been included but not shown?
 
If the answer is no to any of the above questions, then you know that you have to go back and make adjustments. If the number of observations in each regression is not consistent, then maybe you’re accidentally dropping observations. Or, maybe there’s a legitimate reason.  Your regressions need not fit your initial hypothesis and often they won’t. But, the more out-of-sync your regression is with expectations, the more you should consider re-checking your underlying data.
 
For regressions on treatment effects, the control mean is often a clearer, simpler statistic than the constant term. A test for joint significance (the F-test) is informative because it reveals the likelihood that every interaction between covariates has no correlation with the dependent variable. The rest of the checklist simply involves crossing t’s and dotting i’s.
 
As a reference for generating publication quality tables, I’ve included two 10-step examples in Stata. These scripts rely on xml_tab and mat2txt for regression and summary statistics tables, respectively, but I try to make these packages easier to use by adding locals and walking you through. Below are example output tables (using system data) with very minor manual adjustments to add the control means and F statistic to the regression table and add borders and a title to both tables. The Stata code can be found here for regression tables and here for summary statistics tables.


Please add in the comments any other key items that should go in this checklist that you find yourself, your co-authors, or your RAs frequently forgetting

Comments

Submitted by Iván on

Hi, code does not work. Using Stata 13, stops at mat m = J(`nmodels',`ndep',.). Says "invalid sintax". Any suggestions?

Submitted by Matt on

Hi Ivan,

Indeed, it does run on Stata 13, 12, etc. Did you change part of the code? Did you download the mat2txt and xml_tab packages?

Matt

mat m = J(`nmodels',`ndep',.)

Submitted by Ruth Ann Church on

Hi Matt:
Thank you very much! I haven't tried it yet, but I expect this will be very helpful for my research.

Ruth Ann Church
President, Artisan Coffee Imports AND masters candidate at Michigan State University

Submitted by Iván on

Thanks, it does run, my mistake was runing it line by line and not the whole set of commands at once

Submitted by jrc on

Here is a little toy example using "esttab", which I really like for making regression tables. It looks nice, goes straight into LaTeX code (if you so choose), is easily adaptable and there is great documentation online. It makes it really easy to update results in your paper when you change a spec because you just re-run the code and re-compile the LaTeX and voila! (Note: this won't run, its just an example).

estimates clear
eststo: xi: reg Y `X' `X1' cluster(`cluster')
qui sum Y
estadd scalar Mean = r(mean)

eststo: xi: reg Y2 `X' `X2' cluster(`cluster')
qui sum Y2
estadd scalar Mean = r(mean)
......

esttab using NAME.tex, replace keep(`X') ///
title("Title Here") ///
mtitles("Model 1" "Model2" ... "Model N") ///
cells(b(star fmt(3)) se(par fmt(3))) nonumbers label ///
stats(Mean r2 N, fmt(2 3 0) label("Mean of Y" "R-sqrd" "Obs")) ///
addnote("Standard errors clustered at..." ///
"Regressions also include...." ///
"Tables report marginal effect in (Units or Something Here)" ///
"Symbols for p-values: + 0.10 * 0.05 ** 0.01 *** 0.001") ///
star(+ 0.10 * 0.05 ** .01 *** .001)

Submitted by Eric Melse on

Dear Matthew,

Can you provide some literature reference about your recommendation: Did you delete the constant term and add the control mean in the regression table?
I assume the control mean is the mean of Y(?)

Submitted by Matt Groh on

Hi Eric,

In the context of treatment regressions for randomized experiments, the control mean is a simpler, more intuitive data point than the constant term. Ultimately, the control mean is the counterfactual for the treatment mean whereas the constant term is just a component of the regression. If a treatment regression is y=a+bT + e where T=1 for treatment and T=0 for control, then the constant term and control mean will be nearly the same. But, if a treatment regression includes stratification covariates (e.g. y=a+bT + cV + e), then the constant term may be very different from the control mean. As for a literature reference, I'd just suggest you look at examples from papers reporting on randomized experiments (e.g. Miracle of Microfinance? Evidence from a Randomized Experiment and Estimating the Impact of Microcredit on Those who Take it Up: Evidence from a Randomized Experiment in Morocco)

-Matt

Submitted by Dan Killian on

This code is very helpful. I especially like the summary stats.

Is there any way for the regression output table to include the R squared and adjusted R squared?

Submitted by Matt Groh on

Hi Dan,

Yes, you can include the adjusted R squared by adding r2_a to the stats() component. In the linked regression table .do file you'd write the following: xml_tab `tab1', title("Table 1. Example Regression Table") sheet(`depvar1') below stats(N r2_a) save(`folder'`filename') replace)

For more, see help xml_tab for more information.
-Matt

Submitted by Melissa Rubio on

Dear Matt,

I am running the code for summary statistics, however it says that "matrix y not found". Any suggestions? Thanks a lot.

Submitted by Matt Groh on

Hi Melissa,

Try running the code all together. There is no matrix y in the code, so I'm not sure why you're getting that error. Could you be more specific about where you're getting the error? To be clear, the matrix is m and y is a local variable.

-Matt

Add new comment