How does one establish this? To start with, we need a way of measuring the degree of gender homophily or heterophily in any field. I’m convinced that the right way to do this is using the coefficient of homophily, defined as follows. Let p be the probability that a randomly chosen co-author of a randomly chosen man author is also a man and q be the probability that a randomly chosen co-author of a randomly chosen woman author is a man. The coefficient of homophily alpha is the difference between these two quantities: alpha=p-q

It turns out that alpha has some nice properties. First, it is equal to the Pearson correlation coefficient between the genders of authors on a paper. Second, for two-author papers it is equal to Sewell Wright’s correlation coefficient F, if we think of a paper as an individual and an author as a locus. All of this is written up formally in a short note that I just posted.

Continuing that population genetics metaphor a little bit, the stratification by gender into fields, subfields, etc. generates a <a href="https://en.wikipedia.org/wiki/Wahlund_effect"Wahlund effect, namely, an apparent shortage of heterozygotes or mixed-gender author pairs. What we would like to know is to what degree authors in the same small subfield assort by gender, and to what degree the apparent homophily is due to differences in the gender composition of fields. This is equivalent, in our population genetics metaphor, to decomposing the coefficient of inbreeding into its components, Wright’s F_IS and F_ST.

To do this, of course you need to be able to assign papers to disciplines, fields, subfields, etc. We did this on the JSTOR corpus using the hierarchical map equation; that provides the hierarchy of fields that you see at our website. If gender homophily is mostly due to differences in gender composition across disciplines, we would expect lower coefficients of homophily in small subfields. The graph linked here (I don’t think I can post it in-line) shows our results. Indeed, small subfields have low homophily (they actually appear to demonstrate heterophily) whereas large fields have high homophily.

Looks good, right? Unfortunately it’s not so easy, because even under random mixing of authors the test statistic alpha is not independent of the size of the field. The problem is essentially that authors don’t get to co-author with themselves. Consider a tiny field with four authors, two men and two women, and one paper in each of the six two-author combinations. Now if you pick a man author at random, you are twice likely to pick the man in one of the four man-woman papers as you are to pick a man in the man-man paper. Therefore p=1/3. But if pick a randomly chosen woman, again you’re more likely to pick a man-woman combination than the one woman-woman combination, and q=2/3. As a result, alpha=-1/3 even though the authorships seem to be distributed without gender bias.

Because of this, it’s very hard to know how much of the pattern in our figure comes from this size-sensitivity of the test statistic and how much comes from the fact that we are filtering out the effects of different gender compositions across fields as we move toward small subfield sizes. (There are other problems with this graph as well, not the least of which is that the data points are not independent since the big fields are composites of the smaller subfields).

And that’s more or less where we stand with the problem. We’re now working with colleagues at UW on a statistical approach to distinguishing between deliberate assortment by gender within a subfield and structural assortment due to differences in gender across subfields, but this turns out to be really tricky. Hopefully we’ll get that to work shortly, and I’m happy to share as soon as we do.

]]>This is a pretty reasonable assumption, and it doesn’t even need to be on the “PI” level like you’ve proposed to work. For example, consider sampling from an urn labeled “structural biology” and an urn labeled “ecology” and an urn labeled “molecular biology” and an urn labeled “metagenomics” etc. for every paper. Collectively, those subfields will exhibit some fairly significant variation in gender balance, and consequently pooling samples from such a heterogeneous dataset could very naturally lead to the imbalance you observe.

Basically, if you modeled this as a beta-binomial sampling problem, the over-dispersion you observe is simply a consequence of compounding a binomial distribution with the proportion parameter distributed according to a beta distribution. The question is, why do different samples have different proportions? One answer is what you propose: the PI/trainee match is biased. Another is that gender balance varies among fields sampled in your dataset. These aren’t mutually exclusive of course. I’m fairly confident that gender bias varies by field, because I’ve seen evidence of this in the past. I don’t know if it varies by PI, though it wouldn’t surprise me. It does however seem to be a more difficult problem to address than the variation in gender balance varying by field.

*The first/last author differences complicate the simple model obviously.

]]>The cohort effect might also partly explain the non-random gender associations. Old papers are more likely to have been written by all men, newer papers are more likely to have been written by women. Same deal as the field effect that you mention. My initial guess would that the two together (date & field) would explain a lot, but not all, of the assortment.

]]>For gender inference, we used a web API that has really robust data on author names, including specific probabilities and confidence measures for each name. The owner gave us API calls for free, but asked that we not release the specific data – if you’d like to send me your data, I’d be happy to run it against the database we have (though our numbers were quite similar using multiple methods).

]]>