Last week there was a brief but interesting\u00a0conversation on Twitter about the practice of “co-first” authors on scientific papers that led me to do some research on the relationship between author order and gender using data from the NIH’s Public Access Policy.<\/p>\n
I want to note at the outset that this is my first foray into analyzing this kind of data, so I would love feedback on the data, analyses and finding, especially links to other work on the subject, as I know some of these\u00a0issues have been addressed elsewhere.<\/p>\n
A long post follows, but here are some main things I found:<\/p>\n
Before I explain how\u00a0I got all this,\u00a0let me start with a quick explainer\u00a0about\u00a0how to parse the list of authors on a scientific paper.<\/p>\n
By convention in many scientific disciplines (including biology, which this post is about), the\u00a0first position on the author list of a\u00a0paper goes to the person who was most responsible for doing the work it describes\u00a0(typically a graduate student or postdoc) and the last position to the person who supervised the project (typically the person in whose lab the work was done). If there are more than two authors an effort is made to order them\u00a0in rough relationship to their contributions from the front, and degree of supervision from the back.<\/p>\n
Of course\u00a0a\u00a0single linear ordering can not\u00a0do justice to the complexity of contribution to a scientific work, especially in an era of increasingly collaborative research. One can imagine many better systems. But, unfortunately, author order is currently\u00a0the only way that the relative contributions of different people to a work is formally recorded. And when a scientist’s\u00a0CV is being scrutinized for jobs, grants, promotions, etc… where they are in the author order matters\u00a0A LOT – you only really get full credit if you are first or last.<\/p>\n
Because of the disproportionate weight placed on the ends of the author list, these positions are particularly coveted, and discussions within and between labs about who should go where, while sometimes\u00a0amicable, are often difficult and contentious.<\/p>\n
In recent years it has become increasingly\u00a0common for scientists to try and acknowledge ambiguity and\u00a0resolve conflicts in author order by\u00a0declaring that two or more authors should be treated as “co-first authors” who contributed equally to the work, marking them all with a * to designate this special status.<\/p>\n
But, as the discussion on Twitter pointed out, this is a bit of a ruse. First is still first, even if it’s first among equals (the most obvious manifestation of this is that people consider it to be dishonest to list yourself first on the author list on your CV if you were second with a * on the original paper).<\/p>\n
Anyway, during this discussion I began to wonder about how the various power dynamics at play in academia played out in the ordering of co-equal authors. And it seemed like an interesting opportunity to actually see these power dynamics at play since the * designation indicates\u00a0that the contributions of the *’d\u00a0authors was similar and therefore\u00a0any non-randomness in the ordering of *’d authors with respect to gender, race, nationality or other factors likely reflects\u00a0biases or power asymmetries.<\/p>\n
I’m interested in all of these questions, but the one that seemed most accessible was to look at the role of gender. There are probably many ways to do this, but I decided to use data from\u00a0PubMed Central (PMC), the NIH’s archive of full-text scientific papers. Papers\u00a0in PMC are available in a fairly robust XML format that has several advantages over other publicly available databases: 1) full names of authors are generally provided, making it possible to infer many of their\u00a0genders with a reasonable degree of\u00a0accuracy, and 2) co-first authorship is listed in the file in a structured manner.<\/p>\n
I downloaded two sets of papers from PMC: 1,355,350 papers in their “open access” (OA) subset that contains papers from publishers\u00a0like PLOS that allow the free text to be redistributed and reused 424,063 papers from the “author manuscript” (AM) subset that contains papers submitted as part of the NIH’s Public Access Policy. There papers are all available here<\/a>.<\/p>\n I then wrote some custom Python scripts to parse the XML, extracting from each paper the author order, the authors’ given names and whether or not they were listed as “co-first” or “equal” authors (this turned out to be a bit trickier than it should have been, since the encoding of this information is not consistent). I will comment up the code and post it here ASAP.<\/p>\n
| <\/th> | Count<\/th> | Percent<\/th> | <\/th>\n<\/tr>\n<\/thead>\n | ||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Male-Male<\/td> | 4286<\/td> | 37.8<\/td> | <\/td>\n<\/tr>\n | ||||||||||||||||||||||||||||||||||||||||||||||||
| Male-Female<\/td> | 2479<\/td> | 21.9<\/td> | <\/td>\n<\/tr>\n | ||||||||||||||||||||||||||||||||||||||||||||||||
| Female-Male<\/td> | 2399<\/td> | 21.1<\/td> | <\/td>\n<\/tr>\n | ||||||||||||||||||||||||||||||||||||||||||||||||
| Female-Female<\/td> | 2176<\/td> | 19.2<\/td> | <\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n I will admit I expected to see a lot more papers with Male-Female than Female-Male orders amongst two co-first authors. That is, however,\u00a0not what the data show.<\/p>\n However, that doesn’t mean there’s not something interesting going on with gender here.\u00a0First, there’s obviously a lot more male authors than female authors. In this set of papers, only 40.3% of authors in position 1 and 41.0% in position 2 are female. Given this you can easily calculate the expected number of MM, MF, FM and FF pairs there should be.<\/p>\n\n
|