Gender bias and getting grants


This page in:

A little while back, I blogged about a paper that traced the effects of having a gendered language through to the labor market outcomes of today.   Today, I am writing about a much more narrower version of this problem – and one near and dear to researcher hearts: grant applications.   A fascinating new paper by Julian Kolev, Yuly Fuentes-Medel, and Fiona Murray looks at how we can still get gendered outcomes, even with a blind review process.
Let’s start with the data.  Kolev and co. have access to the Global Challenges: Exploration program at the Bill and Melinda Gates Foundation and the 6,794 applications from US-based researchers from 2008-17, as well as the scores by reviewers and the funding decisions.  Now, the Gates Foundation uses a particular process that is somewhat different from many other options for these life sciences researchers.   First, reviews are blind – the reviewers don’t know whose grant application they are reading. Second, reviewers are from a range of fields and not always the narrow sub-field of the application.  Third, all scoring is done independently – there is no conference over divergent scores. And fourth, there is a scoring system that is not a continuous measure, but rather allows reviewers to award gold (1 per pool of about 100 applications) and 5 silvers.  So a somewhat coarse measure.   Kudos clearly go to the Gates Foundation for making these data accessible and putting all these details out there for the researchers to use.  
Kolev and co. start off with the simple fact that most of the proposals – 66 percent – are submitted by men. Beyond that, are women less likely to get funded in this blind review process? The answer is yes, by about 16 percent. But why?   The process is blind, and the reviewers can’t know these are applicants by female scientists.
Could it be the subject? Nope – this result holds controlling for topic-area fixed effects (within, of course the broad category of life sciences). Maybe it’s the fact that most of the reviewers are men? Nope – the result holds when controlling for the gender of the reviewer and, interestingly, the interaction of female reviewer and female applicant is positive (but the aggregate effect isn’t significant since there aren’t that many female reviewers – more on this below).
OK, maybe it’s experience or publications? Kolev and co. look at the length of career (measured from first publication) and indeed, the women in the sample are less senior than the men.   And they also have fewer publications (even controlling for career length) but not a lower share of top journal publications (when controlling for career length). So Kolev and co. throw these variables into the regression looking at the possibility of getting funded. And lo, publications do help you get funded (especially those top journals).   And gender?  Still significant and negative, even when controlling for all the measures of career length and publications. So It doesn’t appear to be publications or experience.
Maybe it’s persistence?   It turns out, repeat applications tend to score higher (this article is full of useful, data driven tips on how to increase one’s odds of getting funding).  But no, women do reapply less, but this appears to be driven by experience and indeed, with Kolev and co. put it into the regression, the gender effect stays negative and significant. 
Maybe it’s how the proposals are written?   Kolev and co. take to textual analysis and look at the words used by men and women. These turn out to be highly correlated but there are some differences. To try and get a handle on this, Kolev and co. single out “’narrow’ words (those which appear significantly more often in some topics than others) and ‘broad’ words (which appear at similar rates in all topic areas).”  Female proposal authors tend to use more of these “narrow” words, while male proposals tend to have more of the “broad” ones.  (It’s important to note that these are relative concepts of narrow and broad, driven by the proposals, not some broader notion).  
Kolev and co. then take words associated with higher or lower scores and use these to predict scores using calibration from male proposals only. Both broad and narrow words seem to make a dent in the gender coefficient, but narrow words are the ones that reduce the gender effect to insignificant (and less than half of the base case). So a large chunk of the gender disparity in scores appears to be driven by word choice.  Interestingly, the inclusion of the word choice variables (as well as a host of other grammatical controls) doesn’t budge the female reviewer positive effect for female applicants – which suggests that female reviewers are forming their scores based on other factors. 
Kolev and co. go on to look at the impacts of getting funding on future outcomes. They do this with a difference in difference among the proposals that received at least one positive review (silver or gold, as described above) and comparing the funded with the non-funded proposals. It’s basically a crude regression discontinuity. Among this sample, funding on its own doesn’t do statistically significant wonderous things for publications – very little is significant. However, for women it makes a big difference, with the combined effect of female and getting funded offsetting the negative female coefficient on variables such as top journal articles and future large NIH funding (with the latter more than offset).   As Kolev and co. put it, this funding seems to “level the playing field” for women.  
All in all, this is an interesting look behind the scenes at grant funding and how even a blind review process clearly still has some biases.  It also raises a host of questions: what if more reviewers were women? What if this were a set of fields where more applicants were women?  The majority were women? What if there was a different scoring process?  What if more funding programs made their data available so we could answer these?  
(Full disclosure: the team I work with holds two grants from the Gates Foundation.  Neither of them were in the life sciences. The writing of one was led by a woman. The other by a man.  And next time we will use a higher share of verbs.) 


Markus Goldstein

Lead Economist, Africa Gender Innovation Lab and Chief Economists Office

A. Tasso
May 01, 2019

The data are accessible? Tried looking in the paper and online and can't seem to find them.

Eszter Czibor
May 01, 2019

Fascinating study indeed, and it tells a compelling story; however, there are a few potential issues that are worth mentioning.
First, the practice of exploring the drivers of a gap by adding covariates progressively and checking whether their inclusion reduces the estimated gap is widespread (I have certainly done it myself...), but the conclusions from this method are sensitive to the order in which these covariates are added (see e.g. Gelback, 2014: It would thus be nice to see whether the paper's conclusions are robust to a) the sequence of adding covariates b) using different decomposition methods.
Second, I find that the conclusions from certain empirical findings are up for debate; e.g. the fact that repeat applicants tend to receive higher scores does not necessarily imply that people *should* apply again after a rejection ("Our results therefore serve as a reminder of the value of persistence in the face of rejection; in light of the results in Panel A, this is advice that is particularly important for female researchers and innovators at the early stages of their career"): given that the underlying proposal quality is unobserved, this association could simply reflect the selection process into repeat applications (i.e. people only apply again if they have a bulletproof second idea/proposal). Similarly, the fact that getting funded is followed by better career outcomes for women but not for men does not necessarily mean that women *benefit* more the funding, but could be a sign that the bar for funding was higher for female-led applications, so the process simply selects exceptional women. (Relatedly, it would be interesting to know whether conditional on their ratings, women-led applications are still less likely to be funded.)
Third, there is a worry that the broad/narrow classification picks up differences between proposals that are more related to content and topic than to writing style (e.g. "community" "contraceptive" and "health" are narrow, while "bacteria" "therapy" and "device" are broad). I am not arguing with the validity of the particular classification, but I wonder whether they pick up differences in research questions not controlled for by the ten main topic areas the authors include in their analysis. As such, the conclusion that "there is significant scope for female applicants to improve their scores by altering the words they use to describe their proposals" might be premature. This is a concern that I have seen many people mention in online conversations about the paper, so rather than elaborate on their point further, I refer to the broader challenge of trying to uncover the underlying data generating process in algorithmic prediction exercises (very illuminating discussion on p.96 here:
Finally, to ensure a homogeneous pool of applicants, the authors restrict their sample to proposals submitted by applicants who have both an academic or non-profit research affiliation and a US contact address. While this choice is understandable, it does reduce their analysis sample from 17,311 proposals to just 6,794. It would be great to see sensitivity tests that also include *all* US-based applicants, or academics/non-profit researchers from *all* countries where English is an official language.
These points notwithstanding, it is a nice study on a really important research question, and I am grateful for the authors for bringing attention to this topic!

Derek Wallace
December 16, 2019

I agree