The 99.9999%: more thoughts on stats in the autism sequencing paper

Yesterday I got incensed about a quote in a story in the NYT from a prominent autism researcher about the significance of findings in their recent paper (which described the sequencing of protein-coding genes from autistic individuals, their parents and siblings).

The statement that so offended me, from the lead author of the paper, was that he was 99.9999% sure that a gene identified in his study plays a role in causing in autism. It’s a ridiculous assertion, completely at odds with what his group says in the paper. It needs to be corrected.

However, my original post also included some statistical analyses that were based on a cursory reading their work, and, as a result, didn’t directly speak to the central claims of the paper. My basic critique is unchanged – the 99.9999% claim is insupportable – but this got me interested in the fine details of their results and analysis, and I thought it would be useful to post my thoughts here.

Let me also say at the outset that, while I didn’t like what was said in the NYT, and don’t agree with everything the authors say in the paper, I am not trying to take them to task here. Analyzing this kind of data is difficult, and there are all sorts of complexities to deal with.

First, a summary of what they did and found. The core data are sequences of the “exomes” (essentially the protein-coding portion of the genome) from 238 children with autism spectrum disorders (ASD), both of their unaffected parents, and (in 200 cases) an unaffected sibling.

The analyses they present focus on the families with data for unaffected siblings, enabling them to compare the transmission of inherited variants (those present in the parents) and de novo mutations (those not present in the parents) between affected and unaffected siblings. They observed no differences in the transmission of inherited variants between affected and unaffected individuals, but found a significant increase in the number of de novo mutations that change proteins in affected individuals. Specifically, there were 125 do novo, protein altering (mis-sense, nonsense or splicing) mutations in affected individuals compared to 87 in siblings, a significant difference (p=.01).

This observation provides reasonable support for the hypothesis that de novo mutations are associated with autism. But it does not indicate which – if any – of these specific mutations is involved. After all, that there were 87 de novo protein-altering mutations in unaffected siblings suggests that many of those identified in autistic individuals are not involved in the disease. There is also the possibility that the elevated mutations rate is a secondary consequence of some other factor that leads to autism, and that none of these specific mutations are actually directly involved in the disease.

Given that the the observation of a de novo mutation in one gene in one affected individual conveys limited information about its involvement in autism, the authors focus on cases where independent mutations were observed in the same gene in different affected individuals. The reasoning is that such an observation would be unlikely to occur by chance in a gene that is not involved in ASD.

This is where I initially mistook what they did. I assumed the quote in the NYT story was referring to the chances of observing the same gene twice amongst the 125 de novo mutations in affected individuals, and pointed out that we actually expect this to happen at least 30% of the time (I say at least because the 30% number comes from assuming random mutations are equally likely to occur in all genes – which is not correct for reasons such as differences in gene size, GC content, etc… – all things the authors factor into their calculations). Indeed, the observation in the paper that two genes are hit twice in the set of 125 is not a statistically significant finding, and, by itself would offer no evidence that these genes are involved in ASD – and the authors do not assert that it does.

Instead the authors focused on a small subset of mutations – those that introduce a premature stop codon into a gene (thereby generating a truncated protein, or, because of a process known as non-sense mediated decay, likely decrease the expression level) or alter a splice site (potentially affecting the structure of the gene). The numbers here are a lot smaller. In the affected individuals there were ten nonsense and five splicing mutations, while there were only five nonsense and no splicing mutations in the unaffected siblings. And, crucially, in the set of 15 such mutations one gene – SCN2A – appeared twice.

So now the question is, is this a significant observation? Under the simplest of models, if you picked genes 15 times randomly from a set of 21,000 you’d expect to hit at least once gene twice with a probability of around .005 – making it a reasonably significant observation.

However, this is actually an overestimate of the significance, as differences in gene size, base composition, etc… make it more likely that a random mutation will land on some genes than others, thereby increasing the probability of seeing the same gene twice. The authors did extensive simulations that take this into account, and, restricting their analysis to the 80% of genes expressed in the brain, they conclude the observation of two nonsense/splicing variants in brain expressed genes is significant, at a p-value of 0.008.

However, it is worth noting (from the authors Figure S8) that under conservative but reasonable estimates of the de novo mutation rate and number of genes involved in ASD, the degree to which the data implicate SCN2A specifically is weaker, with a q-value (probability that the gene is not involved in ASD under various models) of around 0.03. Again, this may seem a bit counterintuitive, given that their data say that it’s significant that they saw the same gene twice, and they found only one such gene, how could that gene not be involved? But one actually has more power to validate the general model that de novo nonsense/splicing mutations are contributing to ASD than you do to implicate specific genes. This is why State’s assertion in the NYT that SCN2A was 99.9999% likely involved in ASD was pretty egregious – it is simply not consistent with their own data.

There are a few other things to note here.

First, the p-values and q-values they report is not specific to an individual gene – it is the average probability of observing a double hit in non-ASD genes and the average probability that a double-hit gene is involved in ASD. But SCN2A is relatively large (2000 amino acids), and thus the observation of two mutations in this gene is somewhat weaker evidence for its involvement in ASD than it would be for a smaller gene. I haven’t done a full simulations, but given that SCN2A is 4-5x larger than average, it should be on the order of 20x more likely to be doubly hit by chance than a typical gene, and thus the average q-value reported is an underestimate. It would be easy, using the simulations the authors already have on hand, to ask what the false-discovery rate when the doubly hit gene is 2000 amino acids or longer. I suspect it would not longer be significant.

The model also fails to consider the possibility that such fairly significant mutations in many genes might be lethal, and thus would never be observed. Hard to get a great estimate of what fraction of genes this might be, and the number is probably small given that they’ll almost all be heterozygous, but, again, given that the observations are only marginally significant, this possibility seems worth considering.

Finally, the more I read the paper, the more uncomfortable I grew with the way that the paper moved back and forth from non-synonymous to nonsense/splicing mutations, depending on where they got statistical significance. They start out by arguing that the there is a significant increase in the number of de novo synonymous mutations in ASD affected individuals. They get statistical significance here because there are a relatively large number of such mutations. They then look for cases where the same gene was hit twice, and find two. But this is not a significant observations – failing to distinguish between the possibility that a subset of ASD-involved genes were being hit from the null model of genes being hit randomly. However, for one of these pairs they noticed that there were two nonsense mutations. There wasn’t a significant enrichment of de novo nonsense mutations in cases (10) vs controls (5), so they added in the five splicing mutations from cases (there were none in controls) and got a marginally significant enrichment (p=.02). Then they looked at how likely it would be to find the same gene hit twice by nonsense/splicing mutations, and got a marginally significant result.

It’s possible to justify this path of analysis from first principles, as nonsense/splicing mutations are difference from missense mutations – and maybe this was part of the analysis design from the beginning. But the way the paper was set out, it felt that they were hunting for significant p-values – which is a big no-no. What if they had observed that highly conserved amino acids in some gene had been hit by the same missense mutation in two families? Would they have pursued this result and evaluated its significance? This is a crucial question, because if they pursued the nonsense observation simply because it was what they observed, then their statistics are wrong, as they need to be corrected for all the other possible double-hit leads they would have pursued. This is not a subtle effect either – such a correction would almost certainly render the results insignificant.

I don’t know the details of how this experiment was planned. Maybe they always intended to do this exact analysis in the first place, and thus it’s completely kosher. But the scientific literature is filled with post facto statistical analyses, in which people do an experiment, make an observation, and then evaluate the significance of this observation under the assumption that this was always what they were looking for in the first place.

It’s sort of like how, in baseball broadcasts, the announcers are always saying things like “This batter has gotten hits in his first at bat in 20 straight games played on Sunday afternoon”. They say this because it sounds so improbable – and in some sense it is, as this specific streak is, indeed, improbable. But if you consider all the possible ways you can slice and dice a player’s past performance, it is inevitable that there would be some kind of anomaly like this – rendering it statistically uninteresting.

I’m not saying that something this extreme happened in this autism paper. But the way the data were presented in the paper definitely made it seem like they were looking for a statistically significant observation on which to sell their paper (to Nature and to the public).

And it’s a shame – the data in the paper are cool. But does it really make sense to make such a big deal out of what is, at best, a single marginally significant observation? What if they hadn’t chosen one of those two families for their study? Would the result be uninteresting? Of course not.

In the end, what this paper should have said was, we generated a lot of cool data, we found some evidence that de novo mutations are enriched in kids with ASD relative to their siblings, but we need more data – a lot of it – to really figure out what’s going on here. Unfortunately, in the world we live in, this would have been dismissed as kind of boring, and likely not worthy of a Nature paper (although far less interesting genome papers are published there all the time).

So the authors made a big deal out of an interesting single observation, when they should have waited for more data. And then, probably for the same reasons, they oversold the result to the press – and ended up expressing an indefensible 99.9999% confidence in SNC2A’s involvement in ASD to a reporter.

And I hope you understand now why it pissed me off.

The 99.9999%: more thoughts on stats in the autism sequencing paper

9 Comments