Blinded by Big Science: The lesson I learned from ENCODE is that projects like ENCODE are not a good idea

When the draft sequence of the human genome was finished in 2001, the accomplishment was heralded as marking the dawn of the age of “big biology”. The high-throughput techniques and automation developed to sequence DNA on a massive scale would be wielded to generate not just genomes, but reference data sets in all areas of biomedicine.

The NHGRI moved quickly to expand the universe of sequenced genomes, and to catalog variation within the human population with HapMap, HapMap 2 and 1000 genomes. But they also began to dip their toe into the murkier waters of “functional genomics”, launching ENCODE, a grand effort to build an encyclopedia of functional elements in the human genome. The idea was to simultaneously annotate the human genome and provide basic and applied scientists working on human disease with reference data sets that they would otherwise have had to generate themselves. Instead of having to invest in expensive equipment and learn complex protocols, they would often be able to just download the results, thereby making everything they did faster and better.

Now, a decade and several hundred million dollars later, the winding down of ENCODE and the publication of dozens of papers describing its results offer us a vital opportunity to take stock in what we learned, if it was worth it, and, most importantly, whether this kind of project makes sense moving forward. This is more than just an idle intellectual question. NHGRI is investing $130m in continuing the project, and NHGRI and the NIH as a whole, have signalled their intention to do more projects like ENCODE in the future.

I feel I have a useful perspective on these issues. I served as member of the National Advisory Committee for the ENCODE and related modENCODE projects throughout their lifespans. As a postdoc with Pat Brown and David Botstein in the late 90’s I was involved in the development of DNA microarrays and had seen first hand the transformative potential of genome sequences and the experimental genomic techniques they enabled. I believed then, and still believe now, that looking at biology on a big scale is often very helpful, and that it can make sense to let people who are good at doing big projects, and who can take advantage of economies of scale, generate data for the community.

But the lesson I learned from ENCODE is that projects like ENCODE are not a good idea.

American biology research achieved greatness because we encouraged individual scientists to pursue the questions that intrigued them and the NIH, NSF and other agencies gave them the resources to do so. And ENCODE and projects like it are, ostensibly at least, meant to continue this tradition, empowering individual scientists by producing datasets of “higher quality and greater comprehensiveness than would otherwise emerge from the combined output of individual research projects”.

But I think it is now clear that big biology is not a boon for individual discovery-driven science. Ironically, and tragically, it is emerging as the greatest threat to its continued existence.

The most obvious conflict between little science and big science is money. In an era when grant funding is getting scarcer, it’s impossible not to view the $200m spent on ENCODE in terms of the ~125 R01’s it could have funded. It is impossible to score the value lost from these hundred or so unfunded small projects against the benefits of one big one. But a awful lot of amazing science comes out of R01’s, and it’s hard not to believe that at least one of these projects would have been transformative.

But, as bad as the loss of individual research grants is, I am far more concerned about the model of independent research upon which big science projects are based.

For a project like ENCODE to make sense, one has to assume that when a problem in my lab requires high-throughput data, that years in advance, someone – or really a committee of someones – who has no idea about my work predicted precisely the data that I would need and generated it for me. This made sense with genome sequences, which everyone already knew they needed to have. But for functional genomics this is nothing short of lunacy.

There are literally trillions of cells in the human body. Multiply that by life stage, genotype, environment and disease state, and the number of possible conditions to look at is effectively infinite. Is there any rational way to predict which ones are going to be essential for the community as a whole, let alone individual researchers? I can’t see how the answer is possibly yes. What’s more, many of the data generated by ENCODE were obsolete by the time they were collected. For example, if one were starting to map transcription factor binding sites today, you would almost certainly use some flavor of exonuclease ChIP, rather than the ChIP-seq techniques that dominate the ENCODE data.

I offer up an example from my own lab. We study Drosophila development. Several years ago a postdoc in my lab got interested in sex chromosome dosage compensation in the early fly embryo, and planned to use genome-wide mRNA abundance measurements in male and female embryos to study it. It just so happened that the modENCODE project was generating genome-wide mRNA abundance measurements in Drosophila embryos. Seems like a perfect match. But these data was all but useless to us, not because the data weren’t good – the experiment was beautifully executed – but because their data could not answer the question we were pursuing. We needed sex-specific expression; they pooled males and females. We needed extremely precise time resolution (to within a few minutes); they looked at two hour windows. There was no way they could have anticipated this – or any of the hundreds of other questions about developmental gene expression that came up in other labs.

We were fortunate. I have money from HHMI and was able to generate the data we needed. But a lot of people would not have been in my position, and in many ways would have been worse off because the existence of ENCODE/modENCODE makes it more difficult to get related genomics projects funded. At this point the evidence for such an effect is anecdotal – I have heard from many people that reviewers explicitly cited an ENCODE project as a reason not to fund their genomics proposal – but it’s naive to think that these big science projects will not affect the way that grants are allocated.

Think about it this way. If you’re an NIH agency looking to justify your massive investment in big science projects, you are inevitably going to look more favorably on proposals that use data that has already, or is about to be, generated by expensive projects that feature in the institute’s portfolio. And the result will be a concentration of research effort on datasets of high technical quality, but little intrinsic value, with scientists wanting to pursue their own questions left out in the cold, and the most interesting and important questions at risk of never being answered, or even asked.

You can already see this mentality at play in discussions of the value of ENCODE. As I and many others have discussed, the media campaign around the recent ENCODE publications was, at best, unseemly. The empty and often misleading press releases and quotes from scientists were clearly masking the fact that, despite publishing 30 papers, they actually had very little of grand import to say, today, about what they found. The most pensive of them realized this, and went out of their way to emphasize that other people were already using the data, and that the true test was how much the data would be used over the coming years.

But this is the wrong measure. These data will be used. It is inevitable. And I’m sure this usage will be cited often to justify other big science projects ad infinitum. And we will soon have a generation of scientists for whom an experiment is figuring out what kinds of things they can do with data selected three years earlier by a committee sitting in a windowless Rockville hotel room. I don’t think this is the model of science anyone wants – but it is precisely where we are headed if the metastasis of big science is not amended.

I want to be clear that I am not criticizing the people who have carried out these projects. The staff at the NIH who ran ENCODE, and the scientists who carried it out worked tirelessly to achieve its goals, and the organizational and technical feat they achieved is impressive. But that does not mean it is ultimately good for science.

When I have raised these concerns privately with my colleagues, the most common retort I get is that, in today’s political climate, Congress is more willing to fund big, ambitious sounding projects like ENCODE than they are to simply fund the NIH extramural budget. I can see how this might be true. Maybe the NIH leadership is simply feeding Congress what they want in order to preserve the NIH budget. And maybe this is why there’s been so little push back from the general research community against the expansion of big biology.

But it will be a disaster if, in the name of protecting the NIH budget and our labs’ funding, we pursue big projects that destroy investigator driven science as we know it in the process.

Blinded by Big Science: The lesson I learned from ENCODE is that projects like ENCODE are not a good idea

30 Comments