# A round-up of recent questions from my mailbox: verifying randomization, balance tests, treatment heterogeneity, and multiple testing in DiD

I took a much-needed week of holiday recently, and came back to find quite a few questions about impact evaluation in the pile of emails that had built up. Since some of these might be of broader interest, and/or might get asked again, I thought I’d share some of the questions and my thoughts on answers to them here. I recently came across Henry Ford’s quote “We have most unfortunately found it necessary to get rid of a man as soon as he thinks himself an expert because no one ever considers himself expert if he really knows his job.” – so please don’t think of these as expert answers, and correct me/improve these answers in the comments if you have other thoughts on them.

I have adjusted some details of the questions to protect the privacy of those asking, and, in some cases, to make them of more general applicability. The paraphrased questions are then *in italics.*

*Verifiable randomization: **I need to randomly select a sample of projects to get evaluated by an outside consultant. My question is whether there is a standard “credibility” technology that I can use to credibly document that I didn’t draw lots of random samples until I found one that contains only projects that I want evaluated? *

This is a situation that also arises in randomized experiments, especially those with governments who may have a legal need to document the allocation procedure. The technology needed depends on whether this is intended to be able to be audited, or needed to demonstrate trust to citizens, or just to document for research purposes. Here are some technologies:

1) Public drawing: this has been done with public randomization ceremonies with physical names or numbers getting drawn; or running the code in public, with someone from the government or an auditor supplying a seed for the random number generator (perhaps after publicly demonstrating that different seeds lead to different assignments). One thing to watch out for is that Stata has limits on how large the seed can be – as we found out to our dismay in a live randomization ceremony in Colombia, as discussed in this previous post.

2) Double-blind selection: you get someone else to take the sample and randomly order it, then number it from 1 to N, and then you independently randomly select the treatment assignment of n out of N units, and then rematch this with the data. Or alternatively, you order the data in order of the unit identification numbers, and then give someone else the numbers only, and have them select the assignment.

3) Use the date and time, or 123 as an assignment seed – this perhaps makes it seem less likely that you have tried lots of seeds, but obviously is less concrete.

4) Have someone else select a seed for you, and then email you the seed, which you can then have documented evidence of.

What technologies have others used?

*Balance tests after stratification: **I have an experiment in which I have stratified by a variable with five different categories. I want to then test balance, and want to know whether I should do so for these variables that I stratified on? Also, should I include them in my joint orthogonality test?*

If you have stratified on a variable, then it will be balanced by design (subject to odd numbers in strata, and/or discretization of continuous variables). So in my balance tables I separate the variables used in forming strata from those which the randomization is not balanced on. I then show means for treatment and control for the stratified variables, but do not conduct balance tests for them.

Then for testing balance on other covariates, I suggest the following:

· For univariate balance tests, you want to run the regression that you will be doing for outcomes. This should mean conditioning on the stratifying variables. E.g. run:

Y = a + b*Treat + c’Strata dummies + e

· For the joint orthogonality test, also condition on strata, so run:

Treat = a + b’X + c’Strata dummies + e

And then test that b’1=0 (i.e. that none of the X variables predict assignment to treatment, once you have conditioned on the variables you stratified on).

*Treatment heterogeneity by whether or not you know other subjects: **I have a lab-in-the-field experiment which looks at how people make allocations in trust/investment games under different conditions: (information treatment) playing the game when the identity of the other party is known; and(no information control) playing when the identity of the other party is not known. These games were played at meetings in which some of the players knew one another, and other players did not know anyone else. When we look at the overall treatment effect of the information treatment it is positive, but not significant. But we do find larger and significant positive effects of the information treatment conditional on the experimental subject knowing at least one other person at the meeting. Given that whether a subject knows others was not experimentally assigned, is it still valid to interpret this effect?*

The key question is whether we should think of knowing one other person at the meeting as something predetermined/unchanged by the interventions (like gender, or education), or whether it is something that the games somehow changed. Assuming that whether they know others is measured before they play any games, then it would seem we are in the first category – and then I think it would be appropriate to look at treatment effect heterogeneity by this characteristic – and it makes sense that knowing the identities of the other players won’t make much difference if they are all strangers to you.

I think it would then be useful to try and characterize the two groups more in terms of baseline characteristics as well – so when you talk about this being an effect for the subset of subjects who know at least one other person, you can say a bit more about what types of people this treatment effect is applicable for.

*Multiple hypothesis testing in difference-in-differences or with treatment effect trajectories: **I recently read your incredibly helpful **blog post** on different methods for multiple hypothesis testing and how to implement them in Stata. I have a follow-on question. I’m currently conducting a study with a fairly large number of outcomes (20) in a difference-in-differences event study framework with staggered treatment adoption. So for each outcome, I run a regression with 4 “leads” and 4 “lags” corresponding to the pre & post effects of my treatment respectively (in addition to year and unit fixed effects, plus time-varying covariates). When you say that a particular method for correcting multiple hypothesis testing can handle multiple “treatments”, could I expand that to consider multiple treatment effect coefficients for a single treatment? For example, I am drawn to the Anderson sharpened q-values since I’d like to use reghdfe and it can handle multiple treatments. So I’m wondering if I can think of “treatments”, as you’re using the term, as the various leads/lags in an event study regression.*

I am glad this question is being asked, and it also applies to experiments in which researchers are interested in the trajectory of treatment impacts, and so might show treatment effects at 1, 2, 3, 4, and 5 years. Here are my thoughts, while acknowledging that I can’t think of good examples to point to that directly address the difference-in-differences question.

1) The standard way to deal with this in the difference-in-differences setting for a single outcome would be a F-test of joint significance. For example, you can test that all the pre-intervention effects are jointly zero, and then test that the post-treatment effects are jointly zero. So I think you could then take the 20 p-values you get from these F-tests (one for each outcome), and then apply the Anderson standardized q-value adjustment to these. I think it might make sense to do this separately for the test of leads (i.e. for testing there is no treatment effect before the intervention), and then for the lags (for trying to detect a treatment effect after the intervention).

2) Conversely, I do not think you would want to use the Anderson standardized q-value approach to adjust the p-values on individual tests of the lags and leads (e.g. the coefficient on the t-2 lag for outcome 3). The reason is that the different lags and leads are likely to be correlated with one another, and could even be negatively correlated in some cases (due to mean reversion for example). So I would be cautious taking the full set of 160 coefficients (20 outcomes * 8 coefficients) and applying standardized q-values to that. However, the power of FWER correction approaches like Romano-Wolf or Westfall-Young when applied to 160 coefficients may be relatively low, and so trying to adjust for testing all leads and lags and outcomes at once might kill all your p-values.

3) Since we might also be interested in the question “did the treatment have an effect one year after the intervention”, one could also make an argument for just doing multiple testing corrections across the outcomes one lead or lag at a time, calling each lead or lag a different outcome family. Then applying the Romano-Wolf or Westfall-Young FWER approaches could be used.

Does anyone have good examples of difference-in-differences papers that have dealt well with multiple testing across both outcomes and different leads and lags?

## Join the Conversation