This 100,000 word post on the ENCODE media bonanza will cure cancer

It is oddly fitting that the papers describing the results of the NIH’s massive $200m ENCODE project were published in the midst of political convention season. For this was no typical scientific publication, but a carefully orchestrated spectacle, meant to justify a massive, expensive undertaking, and to convince us that we are better off now than we were five years ago.

I’ll touch more on details of the science, and the way it was carried out, in another, longer, post. But I want to try to explain to people who were asking on twitter why I found today’s media blitz to promote the ENCODE publications so off-putting. Because, as cynical as I am about this kind of thing, I still found myself incredibly disheartened by the degree to which the ENCODE press release and many of the interviews published today push a narrative about their results that is, at best, misleading.

The issues all stem, ultimately, from the press releases issued by the ENCODE team, one of which begins:

The hundreds of researchers working on the ENCODE project have revealed that much of what has been called ‘junk DNA’ in the human genome is actually a massive control panel with millions of switches regulating the activity of our genes. Without these switches, genes would not work – and mutations in these regions might lead to human disease. The new information delivered by ENCODE is so comprehensive and complex that it has given rise to a new publishing model in which electronic documents and datasets are interconnected.

The problems start before the first line ends. As the authors undoubtedly know, nobody actually thinks that non-coding DNA is ‘junk’ any more. It’s an idea that pretty much only appears in the popular press, and then only when someone announces that they have debunked it. Which is fairly often. And has been for at least the past decade. So it is more than just intellectually lazy to start the story of ENCODE this way. It is dishonest – nobody can credibly claim this to be a finding of ENCODE. Indeed it was a clear sense of the importance of non-coding DNA that led to the ENCODE project in the first place. And yet, each of the dozens of news stories I read on this topic parroted this absurd talking point – falsely crediting ENCODE with overturning an idea that didn’t need to be overturned.

But the deeper problem with the PR, and the main paper to some extent, is the way that they slip and slide around the extent and nature of the functions they have “discovered”. The pullquote from the press release is that the human genome is a “massive control panel with millions of switches regulating the activity of our genes”. So let’s untangle this a bit. It is true that the paper describes millions of sequences bound by transcription factors or prone to digestion by DNase. And it is true that many bona fide regulatory sequences will have these properties. But as even the authors admit, only some fraction of these sequence will actually turn out to be involved in gene regulation. So it is simply false to claim that the papers have identified millions of switches.

Ewan Birney, who lead the data analysis for the entire ENCODE project, wrote an excellent, measured post on the topic today in which he makes it clear that when they claim that 80% of the genome is “functional”, the are simply refers to its having biochemical activity. And yet even his quotes in the press release play a bit fast and loose with this issue, repeating the millions of switches line. Surely it’s a sign of a toxic process when people let themselves be quoted saying something they don’t really believe.

The end result is some fairly disastrous coverage of the project in the popular press. Gina Kolata’s story on the topic in the New York Times is, sadly, riddled with mistakes. It’s commonplace amongst scientists to blame this kind of thing on reporters not knowing what they’re talking about. But in this case at least the central problems with her story trace directly back to the misleading way in which the results were presented by the authors’.

The NYT piece is titled “Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role” (wonder where they got that idea), and goes on to herald the “major medical and scientific breakthrough” that:

the human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave

This is complete crap. Yet it’s nothing more than a paraphrasing of the line the ENCODE team were promoting. Same thing with a statement later on that “At least 80 percent of this [junk] DNA is active and needed.” You can blame the reporter if you want for incorrectly mixing in the “needed” part there, which is not something the studies asserted. But this is actually a perfectly logical conclusion to reach from the 80% functional angle the authors were pitching.

I don’t mean to pick too harshly on the ENCODE team here. They didn’t invent the science paper PR machine, nor are they the first to traffic in various levels of misrepresentation to make their story seem sexier to journals and the press. But today’s activities may represent the apotheosis of the form. And it’s too bad – whatever one thinks about the wisdom of the whole endeavor, ENCODE has produced a tremendous amount of data, and the both the research community and interested public would have benefited from a more sober and realistic representation of what the project did and did not accomplish.

This entry was posted in ENCODE, NOT junk, publishing, science. Bookmark the permalink. Both comments and trackbacks are currently closed.

27 Comments

  1. Jonathan
    Posted September 6, 2012 at 3:41 am | Permalink

    “The problems start before the first line ends. As the authors undoubtedly know, nobody actually thinks that non-coding DNA is ‘junk’ any more.”

    Really? That’s *exactly* the claim lots of people have been making as a criticism of this paper.

  2. Jim Davis
    Posted September 6, 2012 at 5:00 am | Permalink

    Hey Mike, Remember in Biochem 61 I used to say (with no evidence, just a hunch) that junk DNA wasn’t junk; we just didn’t know what it did!
    Last year I was diagnosed with FSH Dystrophy, a non-fatal but nuisance form of MD. It’s caused by a deletion in Chromosome 4, but a second event is required to cause symptoms, the re-activation of a previously inactive gene in that so-called junk DNA region. It causes a slow weakening of various muscle groups, and interferes with or prevents muscle repair. I’m using a walker now, but still get around. I understand that a lot of the molecular biologists are excited about this, as the first example of that causing an actual human disease. Do you know more about this? My son and I went to Johns Hopkins to participate in a research study. Won’t help me, but a good feeling to participate in something that might help get to the bottom of this disease. Interesting result; My son does not have the bad gene. I enjoy your posts; keep up the good work! jim

  3. Dave
    Posted September 6, 2012 at 5:36 am | Permalink

    I have some concerns with their ChIP-Seq data. For one of their favorite TFs, the data is not at all consistent with several other papers published last year by us and others. For this particular TF, they get a ton (one order of magnitude!) more peaks than in the other studies. It is to a point where they have ChIP-Seq peaks at almost every single gene. That is just not at all consistent with what we have seen with the same TF, and is not typical for your run-of-the-mill TF. It really skews their downstream analysis in my opinion.

  4. DrugMonkey
    Posted September 6, 2012 at 5:41 am | Permalink

    And the $200M? Money well spent?

  5. Jim Woodgett
    Posted September 6, 2012 at 6:42 am | Permalink

    Have a feeling that the many contributors to the Encode project are also likely dissatisfied with the banal PR distillation of their work to “millions of buttons”. Dig deeper and there is some intelligent discourse of the data out there and like most large scale projects, the fruits will likely not be realised for several years (and not result in any “great” revelation).

    Another rather questionable claim is that the project reached its “milestone”. Code for “we expect the next tranche of funding”?

  6. Posted September 6, 2012 at 6:48 am | Permalink

    Well said Mike! I put up a very short comment on the Nature news piece yesterday saying that it’s time to cut off this massive funding stream to ENCODE:
    http://www.nature.com/news/encode-the-human-encyclopaedia-1.11312
    (See the Comments at the bottom; mine is the first one.)

    You are right, this is massively over-hyped, and the so-called discoveries are, well, not new. You could have also mentioned that the data has not been shared – it was all embargoed, some of it for years, while these papers were being written. Yes, you could see the data, but you couldn’t publish on it.

    • THEMAYAN
      Posted March 1, 2013 at 5:27 am | Permalink

      Mr Salzberb I saw a video of your criticism towards ENCODE, and while I always think criticism can often times be a good thing, I was a little surprised becuase I was under the assumption that the reason why ENCODE was set up the way it was, was to discourage grandstanding and to encourage the collection and analysis of data from all these different sources, and I thought everyone who participated in the project agreed to these terms and conditions before hand.

      However what I really found interesting was that after all this criticism concerning the over hyping allegations, you admitted to only reading a partial paper out of the thirty + published. Again I found this a little puzzling, even more so than the many errors I found in Dan Graur’s recent hit piece. Nothing personal of course. I just thought I would share my thoughts.

  7. Posted September 6, 2012 at 6:53 am | Permalink

    Very fine. We have raised some other points in our blog ‘The Mermaid’s Tale’, at EcoDevoEvo.blogspot.com.

    The hype is deplorable but has become standard fare. But thinking critically about the issues is put on the back burner, and not even all investigators are aware of the issues. Others, aware of them and willing privately to acknowledge that, say it’s bad for business to stress them. Like the always-included promise that we’ll never get sick again if we keep the pipeline open.

    Anyway, there are other issues such as how all of this evolves if it is all so important, and we commented on that, but it’s essentially absent from the tsunami of orchestrated ENCODE papers.

    Your post and that of some others, raises the right questions about the way such things are now routinely presented.

  8. Dave
    Posted September 6, 2012 at 7:18 am | Permalink

    We just re-analyzed one of their ChIP-Seqs. As expected, we get 5% of the peaks they do. Many of their “peaks” are background. No wonder their genomes are so “active”.

  9. Posted September 6, 2012 at 7:38 am | Permalink

    Mike, you didn’t mention that you’re one of the ENCODE “external consultants”:
    http://www.genome.gov/12513392
    I hope you’ll be giving them some advice about what to do next.

  10. Posted September 6, 2012 at 8:06 am | Permalink

    Hopefully I’m not talking out of turn (not a scientist!) but to me this project seems to have a lot of analogues to robotic space probe missions. It takes a lot of people a lot of time and money to build the probe and its payload (e.g. surface rovers for landing missions on Mars and all the equipment on the probe for flybys and such). You learn a lot just building all the equipment and tools (which hopefully you can reuse and improve for later missions — and you’ve learned stuff from previous missions that inform current projects). But the actual science is more about the data you collect once you have those tools and have shipped your probe off. This project’s current milestone seems to be a lot like they’ve built the probe, sent it off, successfully landed it on Mars and collected a lot of initial data. And *now* all the scientists can really take that data analyze it, run more experiments, gather more data, etc. And unlike space probe missions you don’t have to wait months or years for it to get somewhere before you can really learn anything — people have been using the tools and data to already find out things.

    Good analogy or bad? :)

    And assuming you could describe it this way, I’m sadly not sure how this would work — media often seems to want a (preferably easy to explain) big discovery or event.

  11. Georgi Marinov
    Posted September 6, 2012 at 12:06 pm | Permalink

    Dave
    Posted September 6, 2012 at 7:18 am | Permalink
    We just re-analyzed one of their ChIP-Seqs. As expected, we get 5% of the peaks they do. Many of their “peaks” are background. No wonder their genomes are so “active”.

    ===================

    Which sets of peaks did you look at? The ones that individual groups have posted and you can download from UCSC or the post-IDR sets of reproducible peaks that were used in the papers? There is a big difference between the two. Also, which factor are you referring to, which cell line, which peak caller, etc.

  12. Wendell Read
    Posted September 6, 2012 at 1:51 pm | Permalink

    “As the authors undoubtedly know, nobody actually thinks that non-coding DNA is ‘junk’ any more. It’s an idea that pretty much only appears in the popular press”

    Check out Larry Moran’s Sandwalk web site

    http://sandwalk.blogspot.com/

  13. Posted September 6, 2012 at 2:59 pm | Permalink

    Am I mistaken or isn’t the whole claim that GMO food is safe for human consumption based upon this claim that humans do not assimilate this “Junk DNA” into our body’s cellular DNA or assimilate these modified genes into our cellular makeup?
    Also this admission seems to also have implications for vaccine manufacturing since they also use this “Junk DNA” claim to suggest that these vaccines made from DNA manipulation are safe because of the same reasoning.

    • Posted September 6, 2012 at 3:03 pm | Permalink

      That’s not the argument for the safety of GMOs.

  14. Dave
    Posted September 6, 2012 at 5:03 pm | Permalink

    @Georgi Marinov
    We are interested in one of the papers in Genome Biology and have been analyzing the raw data they submitted to GEO. For the sake of anonymity, I wont go into any more details here, but we are using a pretty standard Bowtie/Macs pipeline and an FDR <1%. Of course not all peak callers are created equal, but they should be in the same ball park!!!

  15. Georgi Marinov
    Posted September 7, 2012 at 2:25 am | Permalink

    Dave
    Posted September 6, 2012 at 5:03 pm | Permalink
    @Georgi Marinov
    We are interested in one of the papers in Genome Biology and have been analyzing the raw data they submitted to GEO. For the sake of anonymity, I wont go into any more details here, but we are using a pretty standard Bowtie/Macs pipeline and an FDR <1%. Of course not all peak callers are created equal, but they should be in the same ball park!!!

    ==============================

    The way this worked was that individual groups submitted their own peak calls for each library then everything was passed as replicates through the IDR pipeline for the integrative papers.

    I can assure you that there is no overcalling in the peak call sets generated by the IDR pipeline.

    However, what each group submitted individually is highly variable as different peak callers were used, different settings, etc. and some of those submissions were done long before there was such a thing as an IDR pipeline. I don't know what is currently public at UCSC, but there is a chance you downloaded a set that was indeed overcalled; that is not what went into the ENCODE papers though.

    P.S. You're probably aware of that, but MACS in its earlier incarnations was itself guilty of some very significant overcalling. And in general peak callers can be very unstable and produce wildly differing results depending on the specific of how they were designed and what settings you run them on – those 1% FDR settings are completely meaningless IMO.

    ENCODE has been quite open about all the issues with ChIP-seq, even if understandably this is not the kind of thing that would go into a press release; there is a whole paper in the package that's mostly about that:

    http://genome.cshlp.org/content/22/9/1813.full

  16. Dave
    Posted September 7, 2012 at 7:32 am | Permalink

    Well compared to several other manuscripts (including one by us), the numbers of peaks published by ENCODE is wildly, wildly high in this particular paper. I’m not talking about slightly different – a 500 here, 500 there – I am talking about 5 – 10,000 additional peaks over previously published papers. We performed our analysis with MACS and several other commercial packages, with very similar results. Over-calling with MACS is therefore not the issue, and I understand very well all the problems with ChIP and ChIP-Seq. I just think the notion that a regular TF binds to the genome 25,000 times or more, and at practically every single gene, is incorrect. Over-calling inevitably leads one to this conclusion.

  17. Georgi Marinov
    Posted September 7, 2012 at 8:12 am | Permalink

    Well, without knowing which factor you’re talking about and in which cell line and by which group, I can’t comment further.

  18. Sven
    Posted September 7, 2012 at 10:46 am | Permalink

    Hi Dave,

    Just a comment on your notion that it’s unlikely that a particular TF binds to 25,000 sites or more, I think that really depends on the TF, as Georgi alluded to, and its expression level. For example, we’ve ChIPped PU.1 in mouse macrophages about 30 times now, and get over 45,000 peaks at an FDR of 0.1%. On the other hand, in different B cell precursors, as well as a conditional cell line which have lower concentrations of PU.1, we observe between anywhere from 8,000 to ~45,000 peaks, depending on the nuclear concentration of the factor.

    On the other hand, DNA preparation and GC bias can be a major source of background peaks both in input and ChIP, see e.g. the input in Valouev, Nature Methods, 2008, which has 28,000 beautiful peaks just by itself, which also show up in the ChIPs in that paper, or the recent evaluation of modENCODE ChIP-Seq data sets by Chen, Nature Methods, 2012, which shows an amazing GC bias, leading to input and ChIP peaks at GC-rich regions (and no, this is *not* what all ChIP-Seq data looks like, we have looked at many others that don’t have these problems). Maybe the data set you looked at has similar biases, leading to erroneous peaks?

  19. Dave
    Posted September 7, 2012 at 2:08 pm | Permalink

    Hi Sven,

    Thanks for your comments. Of course I agree that the binding pattern of TFs is TF and cell/tissue dependen. The issue is that when one paper has 25,000 peaks and another has 3,000 peaks and the ChIPs are performed in exactly the same cell line, there is a major, major problem somewhere.

  20. Georgi Marinov
    Posted September 7, 2012 at 3:49 pm | Permalink

    Was it the same dataset analyzed by both papers or different datasets?

    ChIP quality is highly variable.

  21. Sven
    Posted September 7, 2012 at 4:17 pm | Permalink

    @Dave,

    Oops, yeah, that does sound like the was a problem. Are the 3000 peaks a subset of the 25000, or more specifically, are they the “tallest” peaks only? Maybe the antibody lot was bad, or one of the wash buffers had much higher salt concentration than it should have, washing away most of the signal, leaving only the tallest peaks, and in a very clonal fashion (since then most of the IP’d DNA is gone). Yes, this has happened to us in the past. Frustrating.

  22. Silvia Onesti
    Posted September 9, 2012 at 6:26 am | Permalink

    Well, compared with the Higgs’ boson and Mars’ Curiosity hypes, this is not too bad… if this is the rule of the PR game today, I don’t see why basic biology should play honest and end up left behind, so that all the public money then goes to high energy physics or astrophysics or synthetic biology or some other “trendy” subject…

  23. Dave
    Posted September 11, 2012 at 4:19 pm | Permalink

    @Sven

    “….or one of the wash buffers had much higher salt concentration than it should have, washing away most of the signal, leaving only the tallest peaks, and in a very clonal fashion (since then most of the IP’d DNA is gone). Yes, this has happened to us in the past”

    LOL, really? Clutching at straws maybe?

    You did not mention the (perhaps more) obvious issue that maybe…..just maybe…..the paper reporting 25,000 peaks is incorrect or that a large number of their peaks are background. Perhaps their washing was not stringent enough? Perhaps their antibody was not good? Perhaps their analysis was off?

    I would favor the last one given that we have re-analyzed the exact same data using our pipeline and get data which is much more consistent with previously published work. Still think the wash buffer is to blame?

  24. THEMAYAN
    Posted September 14, 2012 at 3:13 pm | Permalink

    It seems so many scientist over the years have dug themselves in a hole by preaching this useless vestigial junk paradigm to students and to the general public, and now it has come back to bite them in the ass! While it may be true, in the earlier days, there were a small handful who found at least some hints of function within this ncDNA, but what is also true, is that their research was largely ignored by the Status quo, and in much the same way Barbara Mcclintocks work on maize and transposons was ignored for many years, until she was finally redeemed decades later.

    It seems that for many, it was more important that we instead use this useless junk DNA paradigme as a poster child for bad design and as a tool in a cultural war. Kenneth Miller and Richard Dawkins (just to name a few) both claimed it made sense from an evolutionary perspective and was predicted by evolutionary theory.

    I have heard many on other blogs cite Ryan Gregory and Larry Moran as the guys who are correcting everyone else, which is interesting, because on one hand you have Gregory admonishing others who claim that we were surprised at the amount of function, when he rightfully claims that we have known this for years, yet again at the same time as I have said before, he refuses to acknowledge how much of this work was ignored by the status quo.

    Eddy/Rivas lab says…….”So when you read a Mike Eisen saying “those damn ENCODE people, we already knew noncoding DNA was functional”, and a Larry Moran saying “those damn ENCODE people, there is too a lot of junk DNA”, they aren’t contradicting each other. They’re talking about different (sometimes overlapping) fractions of human DNA. About 1% of it is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA given what we know (and our knowledge about regulatory sites is especially incomplete)”

    I couldn’t find those specific quotes. So either the good people at Eddy/Rivas are either lying, misinformed, don’t know how to use quotation marks, or I’m to dumb to find these specific quote myself. I’ll leave that an open question, but I do have confidence the quotes below are accurate.

    Mike Eisen says …”The hundreds of researchers working on the ENCODE project have revealed that much of what has been called ‘junk DNA’ in the human genome is actually a massive control panel with millions of switches regulating the activity of our genes. Without these switches, genes would not work – and mutations in these regions might lead to human disease. The new information delivered by ENCODE is so comprehensive and complex that it has given rise to a new publishing model in which electronic documents and datasets are interconnected”

    Larry Morans response to this is as follows…..”Here’s the interesting thing. Many of us are upset about the press releases and the PR because we don’t think the ENCODE data disproves junk DNA. Michael Eisen’s perspective is entirely different. He’s upset because, according to him, junk DNA was discredited years ago.
    Moran also goes on to say…….”Eisen is wrong, junk DNA is alive and well. In fact almost 90% of our genome is junk”
    It seems even the science critics cant even agree on what they are arguing about. Maybe this is not a simple black and white argument as many are trying to portray it. Maybe trying to use a “one size fits all” definition of function on the scale of this universe of complexity which resides within the genome, may be much to simple an approach and expectation.

  25. Claudio Slamovits
    Posted September 24, 2012 at 9:35 am | Permalink

    @themayan
    Eddy is right in his reading of Mike and Larry positions (ok, he didn’t use quotation marks properly but I don’t think that’s relevat to the point). Larry’s estimation of the proportion of junk may be over the top, but he is essentially right. Same for Mike (n the opposite sense). Reasonable people can understand each other if they want to.

    I’m pretty sure that Mike KNOWS that not all the ncDNA is regulatory, he only cares apbout the regulatory part. Larry perhaps has a more integrative vision of the genome and so he visualises it in a way such that he is bothered by generalisations. The whole point is that the PR punchline was poorly chosen by ENCODE and it turned out doing more wrong than good.

17 Trackbacks