Exploring the relationship between gender and author order and composition in NIH-funded research

Last week there was a brief but interesting conversation on Twitter about the practice of “co-first” authors on scientific papers that led me to do some research on the relationship between author order and gender using data from the NIH’s Public Access Policy.

I want to note at the outset that this is my first foray into analyzing this kind of data, so I would love feedback on the data, analyses and finding, especially links to other work on the subject, as I know some of these issues have been addressed elsewhere.

A long post follows, but here are some main things I found:

  • The number of female authors falls off as you go down the list of authors of a paper, with fewer than 30% of senior authors female.
  • Contrary to my expectation, there doesn’t seem to be a bias to put the male author first when there are male-female co-first author pairs.
  • There are, however, far fewer male-female co-first author pairs than there should be based on the number of male and female first and second authors.
  • The same thing holds true more generally for first-second author pairs. There is a deficit of cross gender pairs and a surplus of same gender pairs.
  • Part (and maybe most) of this effect is due to an overall skew in gender composition of authors on papers.
  • If you are female, there is a 45% chance that a random co-author on one of your papers is female. If you a male, there is only a 35% chance that a random co-author on one of your papers is female.

Before I explain how I got all this, let me start with a quick explainer about how to parse the list of authors on a scientific paper.

By convention in many scientific disciplines (including biology, which this post is about), the first position on the author list of a paper goes to the person who was most responsible for doing the work it describes (typically a graduate student or postdoc) and the last position to the person who supervised the project (typically the person in whose lab the work was done). If there are more than two authors an effort is made to order them in rough relationship to their contributions from the front, and degree of supervision from the back.

Of course a single linear ordering can not do justice to the complexity of contribution to a scientific work, especially in an era of increasingly collaborative research. One can imagine many better systems. But, unfortunately, author order is currently the only way that the relative contributions of different people to a work is formally recorded. And when a scientist’s CV is being scrutinized for jobs, grants, promotions, etc… where they are in the author order matters A LOT – you only really get full credit if you are first or last.

Because of the disproportionate weight placed on the ends of the author list, these positions are particularly coveted, and discussions within and between labs about who should go where, while sometimes amicable, are often difficult and contentious.

In recent years it has become increasingly common for scientists to try and acknowledge ambiguity and resolve conflicts in author order by declaring that two or more authors should be treated as “co-first authors” who contributed equally to the work, marking them all with a * to designate this special status.

But, as the discussion on Twitter pointed out, this is a bit of a ruse. First is still first, even if it’s first among equals (the most obvious manifestation of this is that people consider it to be dishonest to list yourself first on the author list on your CV if you were second with a * on the original paper).

Anyway, during this discussion I began to wonder about how the various power dynamics at play in academia played out in the ordering of co-equal authors. And it seemed like an interesting opportunity to actually see these power dynamics at play since the * designation indicates that the contributions of the *’d authors was similar and therefore any non-randomness in the ordering of *’d authors with respect to gender, race, nationality or other factors likely reflects biases or power asymmetries.

I’m interested in all of these questions, but the one that seemed most accessible was to look at the role of gender. There are probably many ways to do this, but I decided to use data from PubMed Central (PMC), the NIH’s archive of full-text scientific papers. Papers in PMC are available in a fairly robust XML format that has several advantages over other publicly available databases: 1) full names of authors are generally provided, making it possible to infer many of their genders with a reasonable degree of accuracy, and 2) co-first authorship is listed in the file in a structured manner.

I downloaded two sets of papers from PMC: 1,355,350 papers in their “open access” (OA) subset that contains papers from publishers like PLOS that allow the free text to be redistributed and reused 424,063 papers from the “author manuscript” (AM) subset that contains papers submitted as part of the NIH’s Public Access Policy. There papers are all available here.

I then wrote some custom Python scripts to parse the XML, extracting from each paper the author order, the authors’ given names and whether or not they were listed as “co-first” or “equal” authors (this turned out to be a bit trickier than it should have been, since the encoding of this information is not consistent). I will comment up the code and post it here ASAP.

I looked at several options for inferring an author’s gender from their given name, recognizing that this is a non-trivial challenge, with many potential pitfalls. I found that a program called genderReader, recommended by Philip Cohen, worked very well. It’s a bit out of date, but names don’t change that quickly, so I decided to use it for my analyses.

I parsed all the files (a bit of a slow process even on a fast computer) and started to look at the data. I’m going to focus on the AM subset here first, because these are all NIH funded papers and thus mostly from the US, so intercountry differences in authorship practices won’t confound the analyses, and because the set is likely more representative of the universe of papers as a whole than is the OA subset. I will try to note where these two datasets agree and disagree.

Of the 424,063 papers in AM, there are 2,568,858 total authors with a maximum of 496 and a wide distribution.

Author Number Histogram

There are 219,559 unique given names (including first name + middle initials), of which about 75% were classified successfully by genderReader as male, mostly male, female, mostly female or unisex. About 25% were not in their database. For the purpose of these analyses, I treated mostly male as male and mostly female as female. I’m sure there’s some errors in this process, but I looked over a reasonable subset of the calls and the only clear bias I saw was that it didn’t do a good job of classifying Asian names – treating most of them as unisex, and thereby excluding them from my analysis. All together there were 1,206,616 male authors, 737,424 female authors and 624,818 who weren’t classified. Of the authors who were classified, 62% were male.

Of the 424K paper 32,304 contained co-equal authors, and 28,184 contained two or more co-first authors (assessed by asking if the co-equal authors were at the beginning of the author list). Of these, 85% (24,087) had exactly two co-first authors and 12% (3,285) had three co-first authors (one had 20 co-first authors, which I’m just going to leave here for discussion). I decided to use only those with exactly two co-first authors for the next set of analyses.

There were a total of 11,340 papers with exactly two co-first authors both of whose genders were inferred. Of these, the author order counts were as follows:


I will admit I expected to see a lot more papers with Male-Female than Female-Male orders amongst two co-first authors. That is, however, not what the data show.

However, that doesn’t mean there’s not something interesting going on with gender here. First, there’s obviously a lot more male authors than female authors. In this set of papers, only 40.3% of authors in position 1 and 41.0% in position 2 are female. Given this you can easily calculate the expected number of MM, MF, FM and FF pairs there should be.


Although there doesn’t seem to be a bias in favor of M-F over F-M, there are significantly (p << .0000000001 by Chi-square) fewer mixed gender co-first author pairs than you’d expect given the overall number of male and female co-first authors.

What can explain this? Are young scientists less likely to collaborate across gender lines than within them? Are male and female pairs better able to resolve their authorship disputes, and are thus underrepresented amongst co-first authors? Or are there fewer opportunities for them to collaborate because of biased lab compositions?

First I wanted to ask if there was a similar bias if we looked at all papers, not just the relatively rare co-first author papers. Here is the fraction of female author by position in author list for all papers (excluding the last author for now).

Author gender by position

Female authors are most common in the first author position and they are increasingly less represented as you go back in the author order. Maybe this has to do with the well documented problem of academia driving out women between graduate school and faculty position. So next I asked what fraction of senior authors are women.

Gender by Senior Author Position

Yikes. Only 28% of senior authors of NIH author manuscripts are female compared to 46% of first authors. That’s horrible.

So what about the question from above. Are mixed gender first and second author pairs less common across all papers, not just co-firsts? The answer is yes.


Again, there are lots of possible explanations for this, but I was curious about the effect of biased lab composition (if the gender composition of labs is skewed away from parity then you’d expect more same gender author pairs). It’s hard to look at this directly with this data, but if one were going to guess at a covariate for skewed lab gender it would be the gender of the PI, and this I can look at with this data.

So, I next broke the data down by the gender of the senior author.

Author gender by PI gender

And in tabular form since the data are so striking.

 PI FemalePI Male
1st56.3 %41.0 %
2nd53.0 %40.6 %
3rd50.6 %40.0 %
4th48.5 %39.0 %
5th45.1 %37.1 %

This data very strongly suggests that women are more likely to join labs with female PIs and men more likely to join labs with male PIs. But it doesn’t say why. It could be that people simply choose labs with a PI of their gender, or that PIs select people of the same gender for their labs. This could have to do with direct gender bias, or with lab style or many other things. Or it could be that there’s a hidden field effect here – that different fields have different gender biases, which would drive the gender distribution of labs on average away from parity.

But whatever the reason it’s a clear confounding factor in looking at gender and authorship. Interestingly, the bias against mixed gender first and second authorship is still there (p-values << .0000000001) even if you control for the gender of the PI.

Next I asked if we could detect a skew in the gender composition of the entire author list of papers. So I took sets of papers with number of authors ranging from 2 to 8 (these are the ones for which we have enough data), filtered out papers where one or more authors didn’t have an inferred gender, and compared the distribution of the number of female authors to that expected by the frequency of male and female authors at each position. There is very consistently a skew towards the extremes, with a significant excess in every case of papers with authors of one gender.

Gender skew

So there’s a pretty systemic skew in the gender composition of authors on papers, but where that skew comes from is unclear. Let’s look at the gender mix of all of the other authors on a paper as a function of the gender of the last author.

Gender skew by last author

Again, there’s a pretty strong skew. But is this due to the PI’s gender or to a more general gender imbalance? It’s a bit hard to tell from this data alone. It turns out the skew you see after dividing based on the gender of the last author is roughly the same if you divide based on the gender of any other position in the author order. Here, for example, is what you get for papers with six authors.

effect of reference author

There’s a lot more one could and should do with this data, and I will come back to it later, but for now I will end with this observation. If you are female, there is a 45% chance that a random co-author on one of your papers is female. If you are male, it goes down to 35%. That’s a pretty big and striking difference, and I’m curious if anyone has a good explanation for it.

This entry was posted in public access, science, women in science. Bookmark the permalink. Both comments and trackbacks are currently closed.


  1. m b goodman
    Posted September 28, 2016 at 9:07 pm | Permalink

    Hi! Your analysis is totally aligned with an analysis of a broader collection of publications from JStor. (By broader, I mean encompassing research fields outside of biomedicine.)

    And also this publication,

    This prior work doesn’t attack the question of co-first authorship, however. Thanks for doing the analysis!

  2. Craig Kaplan
    Posted September 28, 2016 at 9:25 pm | Permalink

    I feel like PI gender has to be main driver, but perhaps there is also lab gender makeup as bias. I think that if lab with female PI shows enrichment for females in lab, then possibly labs enriched for females show bias in distribution, then should get the result you see???

    Can you check this number: I was surprised that expected/observed for one of the data sets was identical- is that coincidence or error?
    Female-Female 48769 48769

    • Posted September 28, 2016 at 9:29 pm | Permalink

      sorry – error – i’ve fixed it – thanks for catching that

  3. Posted September 29, 2016 at 7:45 am | Permalink

    Melanie Stefan and I recently submitted a paper to PLoS Computational Biology using a similar approach (preprint is here, blog post summarizing top-line results here). We were not looking at co-first authorship, we were principally concerned with gender skew in biology vs computational biology, but we do have very similar findings w/r/t female PIs.

    For gender inference, we used a web API that has really robust data on author names, including specific probabilities and confidence measures for each name. The owner gave us API calls for free, but asked that we not release the specific data – if you’d like to send me your data, I’d be happy to run it against the database we have (though our numbers were quite similar using multiple methods).

  4. Daniel Weissman
    Posted September 29, 2016 at 12:07 pm | Permalink

    The gender gap between senior authors and first authors is a mix of two things: 1, women being less likely to move from grad student to PI, and 2, cohort effects: the current generation of grad students has the highest percentage of women, but they haven’t gotten a chance to become PIs yet. So in a way, the data are encouraging, if today’s first-author distribution is tomorrow’s PI distribution.

    The cohort effect might also partly explain the non-random gender associations. Old papers are more likely to have been written by all men, newer papers are more likely to have been written by women. Same deal as the field effect that you mention. My initial guess would that the two together (date & field) would explain a lot, but not all, of the assortment.

  5. J.J. Emerson
    Posted September 29, 2016 at 2:25 pm | Permalink

    Another thing that could explain part of the effect is that you might not be able to assume that you’re sampling from a single urn. In a basic model*, your test of heterogeneity can be seen as repeatedly drawing pairs from an urn with a constant proportion of F and M balls with replacement and categorizing the draws as FF, FM, MF, and FF. If, on the other hand, you aren’t actually drawing from a single urn, you’re drawing from many different urns, each with different proportions of F and M balls, then you’d actually expect a monolithic test of homogeneity to reject the simple null model common in such tests.

    This is a pretty reasonable assumption, and it doesn’t even need to be on the “PI” level like you’ve proposed to work. For example, consider sampling from an urn labeled “structural biology” and an urn labeled “ecology” and an urn labeled “molecular biology” and an urn labeled “metagenomics” etc. for every paper. Collectively, those subfields will exhibit some fairly significant variation in gender balance, and consequently pooling samples from such a heterogeneous dataset could very naturally lead to the imbalance you observe.

    Basically, if you modeled this as a beta-binomial sampling problem, the over-dispersion you observe is simply a consequence of compounding a binomial distribution with the proportion parameter distributed according to a beta distribution. The question is, why do different samples have different proportions? One answer is what you propose: the PI/trainee match is biased. Another is that gender balance varies among fields sampled in your dataset. These aren’t mutually exclusive of course. I’m fairly confident that gender bias varies by field, because I’ve seen evidence of this in the past. I don’t know if it varies by PI, though it wouldn’t surprise me. It does however seem to be a more difficult problem to address than the variation in gender balance varying by field.

    *The first/last author differences complicate the simple model obviously.

  6. Posted September 29, 2016 at 4:25 pm | Permalink

    More in depth comments later, but this may save you and others from having to write custom scripts to tear apart the reference records in future analyses.


    • Posted September 29, 2016 at 7:37 pm | Permalink

      That’s great, but doesn’t do what I wanted to do, which is to parse the full manuscript XML form PMC.

      • Posted September 30, 2016 at 6:07 am | Permalink

        Right, but I posted because as open source it might be easier for those doing scientometrics to add functionality rather than start from scratch or use multiple packages.

        • Posted September 30, 2016 at 7:23 am | Permalink

          All in favor of you posting, just explaining why I didn’t use it.

  7. Posted September 29, 2016 at 9:32 pm | Permalink

    We’ve spent a fair bit of time thinking about this issue using the data linked in an comment above (http://www.eigenfactor.org/gender). My feeling is that a large majority of the gender homophily that you are observing comes from differences in gender composition of fields/subfields/etc. that many have mentioned, and from cohort effects like Daniel mentions.

    How does one establish this? To start with, we need a way of measuring the degree of gender homophily or heterophily in any field. I’m convinced that the right way to do this is using the coefficient of homophily, defined as follows. Let p be the probability that a randomly chosen co-author of a randomly chosen man author is also a man and q be the probability that a randomly chosen co-author of a randomly chosen woman author is a man. The coefficient of homophily alpha is the difference between these two quantities: alpha=p-q

    It turns out that alpha has some nice properties. First, it is equal to the Pearson correlation coefficient between the genders of authors on a paper. Second, for two-author papers it is equal to Sewell Wright’s correlation coefficient F, if we think of a paper as an individual and an author as a locus. All of this is written up formally in a short note that I just posted.

    Continuing that population genetics metaphor a little bit, the stratification by gender into fields, subfields, etc. generates a <a href="https://en.wikipedia.org/wiki/Wahlund_effect"Wahlund effect, namely, an apparent shortage of heterozygotes or mixed-gender author pairs. What we would like to know is to what degree authors in the same small subfield assort by gender, and to what degree the apparent homophily is due to differences in the gender composition of fields. This is equivalent, in our population genetics metaphor, to decomposing the coefficient of inbreeding into its components, Wright’s F_IS and F_ST.

    To do this, of course you need to be able to assign papers to disciplines, fields, subfields, etc. We did this on the JSTOR corpus using the hierarchical map equation; that provides the hierarchy of fields that you see at our website. If gender homophily is mostly due to differences in gender composition across disciplines, we would expect lower coefficients of homophily in small subfields. The graph linked here (I don’t think I can post it in-line) shows our results. Indeed, small subfields have low homophily (they actually appear to demonstrate heterophily) whereas large fields have high homophily.

    Looks good, right? Unfortunately it’s not so easy, because even under random mixing of authors the test statistic alpha is not independent of the size of the field. The problem is essentially that authors don’t get to co-author with themselves. Consider a tiny field with four authors, two men and two women, and one paper in each of the six two-author combinations. Now if you pick a man author at random, you are twice likely to pick the man in one of the four man-woman papers as you are to pick a man in the man-man paper. Therefore p=1/3. But if pick a randomly chosen woman, again you’re more likely to pick a man-woman combination than the one woman-woman combination, and q=2/3. As a result, alpha=-1/3 even though the authorships seem to be distributed without gender bias.

    Because of this, it’s very hard to know how much of the pattern in our figure comes from this size-sensitivity of the test statistic and how much comes from the fact that we are filtering out the effects of different gender compositions across fields as we move toward small subfield sizes. (There are other problems with this graph as well, not the least of which is that the data points are not independent since the big fields are composites of the smaller subfields).

    And that’s more or less where we stand with the problem. We’re now working with colleagues at UW on a statistical approach to distinguishing between deliberate assortment by gender within a subfield and structural assortment due to differences in gender across subfields, but this turns out to be really tricky. Hopefully we’ll get that to work shortly, and I’m happy to share as soon as we do.

    • Posted September 30, 2016 at 7:36 am | Permalink

      This is great. Very interested to see how that analysis turns out. Not sure there’s really good data to ask the next question, which is given all the complexities, can you distinguish between sorting based on gender distribution in field and sorting based on gender distribution in labs?

  8. Eric Johnson
    Posted September 30, 2016 at 11:29 pm | Permalink

    If there is a gender bias in co-first authorships, it may be masked by a tradition of having co-first authors be in alphabetical order. I think that is the default arrangement as it reinforces the idea that they are equal authors. However, there are papers that do not follow this practice, and thus the first co-author is more of a “first among equals”. Because this is a more subjective evaluation than being alphabetical, is there a way to easily remove the alphabetical co-firsts and compare the ratio of MF to FM in the counter-alphabeticals?

One Trackback