coauthored with *Alaka Holla*

As we argued last week, we need more results that tell us what works and what does not for economically empowering women. And a first step would be for people who are running evaluations out there to run a regression that interacts gender with treatment. Now some of these will show no significant differences by sex. Does that mean that the program did not affect men and women differently? No. Alas, all zeroes are not created equal.

Some of these interactions will not be statistically significant (i.e. different from zero) simply because there aren’t enough observations to detect meaningful differences. Say you find a difference between men and women of 30 percentage points, with a standard error of 20 percentage points. Would you be comfortable saying there is no difference? Statistically, this is not different from zero at a standard level of confidence. At this point, though, you may want to ask what was the smallest difference I could have detected, given the variance in the data? The answer to this question is pretty easy to calculate (with just standard errors) – this paper by Don Andrews (ungated version here) shows you how. If the smallest detectable difference is quite high (i.e. this would have been a result worthy of publication had it been statistically significant), then this is a zero that doesn’t tell us much.

But when you are powered to detect a reasonable effect size, then the zero is quite interesting. This is basically telling us that the intervention equally benefits men and women, and this is something policymakers should know (if they care about gender). And we don’t see enough of these types of zero-results in published or (to a lesser extent) working papers. Why? Clearly some papers don’t meet the bar in terms of statistical power, so they’re out. But there are surely others where there actually is sufficient power, so why don’t we see more of this reported? (This is an optionally rhetorical question.)

Given these issues with statistical power, does this mean we should plan in advance so that we can adequately capture gender-differentiated effects? The first step would be, in the design stage, to do power calculations where you see what kind of sample size you would need to say something about gender differences in treatment impact. Unless you expect this program to have monstrously large differences between men and women or you have access to a significant source of additional funding, you will be depressed by what you see. Assuming a whole bunch of pretty routine things that are standard in power calculations, a rough rule of thumb is that to find a 50 percent or more difference in treatment impacts across men and women you need 4 times the sample size you would need if you were only interested in average treatment effects. That’s a lot more people, and that’s a significant chunk of change.

So what do we do? If the data are already being collected on a wide scale (e.g. national surveys), you’re covered. However, in others cases, the increased sample will require interviewing more people (not to mention possibly offering the intervention to more people). And for these, we need a policy discussion on what questions are worth this extra investment. But given that we’re talking about throwing a lot of money at understanding gender differences, we could also use that money to understand the effects of interventions which are explicitly targeted at addressing things that cause gender inequality. Both types of investment make sense – for example, we would want to know the gender differentiated impacts of a national labor market policy, but we might also want to know whether an intervention that aims to get women to shift into more productive sectors actually works. And this second type of intervention is what we’ll talk about next week.