Identifying individuals in "anonymous" genetic studies

Most people who participate in genetic studies do so with the expectation that their participation – and more importantly their phenotype – will be anonymous. To preserve this anonymity, raw data (individual’s genotypes and phenotypes) are not made publicly available. However, to enable validation and further research, pooled data – the average allele frequencies in cases and controls – have been made available through public databases like dbGaP.

But a really cool new paper in PLoS Genetics demonstrates that if you know an individual’s genotype you can actually figure out whether they participated in a particular study. This may seem counterintuitive, but if you think about it for a sec it makes sense. An individual’s inclusion in a dataset leaves a fingerprint – in terms of shifting the allele frequencies in the direction of their particular genotype. Obviously, if you only have a small number of genotypes this is meaningless. But if you have 500,000+, as most modern genotyping platforms do, the authors show that you can essentially just count up the number of times the pool’s allele frequencies diverge from the expected allele frequencies in the direction of an individual’s genotype. If this number is significantly higher than expected by chance, it is very likely that the individual was part of the pool.

A cute trick, no? But are there any practical implication? Clearly yes. As more and more people get whole genome genotyping from companies like 23andMe, Navigenics or DecodeMe (full disclosure – I am an advisor to 23andMe), and as many people start to share their genotypes – either intentionally or unintentionally – it would be theoretically possible for someone to take that person’s genotype and scan all existing genome-wide association studies to see if they participated. And if they haven’t had their genotype done, and someone else REALLY wanted to know if they participated in a study, that someone could steal a piece of hair and pay to have it genotyped. (It’s not discussed in the paper, but I bet you could use a sibling, parent or even perhaps a more distant relative and get a similar answer – although presumably with less certainty).

Surprisingly, the paper has received little notice in the popular press. Bit it’s created quite a stir in the human genetics community. The National Institutes of Health immediately shut off public access to its genome-wide association data, and urged others with similar data to follow suit. This is a rather shocking reversal for a community that had been pushing the open availability of these data.

It’s really rather amazing that no one thought about this before. There are a lot of very bright people involved in human genetic mapping, yet none of them realized that individuals could be relatively easily “unpooled”. I bet there are a lot of quantitative geneticists kicking themselves. And I hope some of them are working on a way around this.

Interestingly, the authors seem more interested in the forensics angle here. They offer up their method as a way to tell if a particular individual was in a room, handled a weapon, or anything else where a lot of different people might have left their DNA in the same place. You can see where this is going – get an individual’s genotype and you can trace them all over the world.

This entry was posted in genetics. Bookmark the permalink. Both comments and trackbacks are currently closed.

2 Comments