Statistical BS from autism geneticist in New York Times

[UPDATE: There is a followup to this post here.]

Last week Nature published the results of three studies (1,2,3) looking at the sequences of protein-coding genes from hundreds of individuals with autism and their parents. The main results are that there is a higher rate of de novo mutations in affected individuals, that these primarily come from fathers, and that the affected genes are enriched for those involved in brain development and activity.

I think a bit too much is being made of these studies – they’re generally technically sound, but there remains no definitive link between any single mutation or groups of mutations and the disease. However, the authors of one of the papers have made a big deal about having found a mutation in the same gene in two unrelated individuals. This is described in a piece last week by Benedict Carey in the New York Times:

In one of the new studies, Dr. Matthew W. State, a professor of genetics and child psychiatry at Yale, led a team that looked for de novo mutations in 200 people who had been given an autism diagnosis, as well as in parents and siblings who showed no signs of the disorder. The team found that two unrelated children with autism in the study had de novo mutations in the same gene — and nothing similar in those without a diagnosis.

“That is like throwing a dart at a dart board with 21,000 spots and hitting the same one twice,” Dr. State said. “The chances that this gene is related to autism risk is something like 99.9999 percent.”

Wow. 99.9999 percent. That’s impressive. But I have no idea where it came from.

If the study had looked at exactly two families, and they had found a single de novo mutation in the affected individual in each family, and these had been in the same gene, then yes, it would have been like throwing a dart at a dart board with 21,000 spots (roughly the number of genes examined) and hitting the same one twice – or roughly 1 in 21,000. But this is not what they did.

The study actually examine 200 families with an affected and unaffected siblings, and identified 125 variants with the potential to alter protein function. So the question is not how likely it is to hit the same spot if you throw two darts, but rather how likely it is to hit the same spot if you throw 125 darts at a dart board with 21,000 spots. The answer is that you would expect to have two dots hit the same spot 30.9% of the time. That is roughly one in three times. In fact, the 30.9% number is a conservative estimate that assumes that the odds of hitting any given gene are the same – this is undoubtedly not the case, as some genes are bigger than others – so the real odds that that the authors would have found the same gene twice purely by chance are even greater. Either way, it’s a very far cry from 99.9999 percent odds against.

UPDATE: Now that I’ve had a chance to look at the paper in more detail, I realize the authors were making a more subtle point about the nature of the mutations involved – highlighting the fact that they found two non-sense or splice mutations in the same gene. The authors did some fairly sophisticated simulations of the chances of this occurring and found, if they restrict their analysis to genes expressed in the brain, that the chances of this occurring by chance are ~0.8%.

This is not the same as throwing darts at a dartboard with 21,000 genes as there are only 14,000 brain expressed genes. But I agree with the authors that this is not a trivially expected result. Though I still have no idea where the 99.9999% part of the quote came from. Four orders of magnitude is a big difference.

What’s annoying here is not so much that the NYT used this quote (though they really need someone around to check these kind of things), but rather that the quote came from the lead author of the paper – Matthew State – a clinical geneticist at Yale.

I can not believe State said this this way. I hope he was simply misquoted. But if he really said this, and assuming that then he understands the basic statistics involved (which, given his position, I find highly likely), then he must have oversimplified and somewhat misrepresented the significance of his findings in order to make it sound more impressive in the popular press.


For those interested in how I came up with the 30.9% number, even if it might not be relevant, the question we want to ask is how likely is it that if we picked a random gene 125 times from a set of 21,000 we would never pick the same gene twice. Think of it this way. The first gene we pick can not overlap another gene. When we pick the second gene, 20999 times out of 21000 (probability .99995238) it will not be the gene we picked first. When we pick the third gene, we assume the first two went into different boxes (otherwise we’d be done already) so the odds go down slightly, to 20998 times out of 21000 (.99990476) and they keep going down slightly each time until we get to gene 125 when the odds are 20876 out of 21000 (.99409524).

The counterintuitive part of this is that even though at each step the odds are low, in order to end up with all of the genes in different bins you have to be on the right side of that random probability at each of 124 different steps. And to calculate the odds of this, you have to multiple all of these numbers together: .99995238 * .99990476 * …. * .99409524 which equals .69088693. That means that there is only a 69.1 percent chance that all 125 randomly chosen genes will be different – or 1 30.9 percent chance that you’ll see at least one gene twice.

It’s the same logic as the classic probability question of how many people you have to have in a room for the odds to be greater than 50% that two of them share the same birthday – the answer being 22.

UPDATE: Several people here and on twitter complained that my analysis did not take into account the controls in the paper – and implied that the results would be very different if I did.

The controls have a completely negligible effect. The critique the commenters raised that the authors didn’t just observe two hits to the same gene in the autism cases, they observed no hits to that gene in the controls. The papers states that there were 87 relevant mutations in the controls. So, conditioned on the observation that some gene was hit twice in the cases, we want to know how likely it would be that you would not hit that gene in 87 controls. The answer is 99.6%.

So, whereas I stated originally that the probability of hitting the same gene twice by chance in 125 random samples from a pool of 21,000 genes was 30.9%, if we now ask what is the probability of hitting the same gene twice by chance in 125 random samples from a pool of 21,000 genes AND not hitting the same gene in a set of 87 controls, the answer is 30.8%.


This entry was posted in genetics, science. Bookmark the permalink. Both comments and trackbacks are currently closed.