When a photo caption says it all

This is from a kind of silly article in the NYT about how people are generating too much DNA sequence data and we can’t really deal with the deluge. They get lots of smart people (many of whom are my friends) to talk about this problem – but I think they’re making a mountain out of a molehill. Other fields (anything involving the capture of lots of images – like cell biology or astronomy) are swimming in far bigger seas of data, and don’t make a big deal out of it in the NYT. But, whatever, a bit of harmless whining never hurt anyone.

I just thought it was particularly funny that an article that I felt just didn’t get it would be capped by a photo that demonstrated that they didn’t get it – that is not a picture of “some cells”, but rather a picture of an Illumina sequencing flow cell, which is only, you know, the whole subject of the article.

UPDATE: My friend Lior Pachter expressed my sentiments about the article perfectly on FB:

A compressed genome can be stored in a few Mb of data, something like the size of a decent quality photo taken with a new cell phone. For perspective I think that we are currently taking about 1/3 trillion photos per year. The fact that the data is currently being stored in its redundant raw format is either gross stupidity or purposeful scamming by charlatans trying to make a quick buck selling disks, its hard to tell which. Frankly, unless the animal is endangered, shouldn’t we just go back and sequence it in 3 years for the price of a USB key? Why is all this crappy Illumina sequence filled with errors being stored in the first place? It is true there are significant and interesting and challenging computational problems related to high-throughput sequencing, but they have to do with coupling relevant and statistically sound analyses with interesting biological experiments.

This entry was posted in misc stuff, science. Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments

  1. Posted December 1, 2011 at 5:12 pm | Permalink

    Next you’ll see that caption, “examining some cells,” under a picture of a scientist using Excel.

  2. Cedar
    Posted December 2, 2011 at 1:36 am | Permalink

    I was assembling 150GB of Illumina reads just today, so I feel like I have enough expertise on this topic to take a postion. Both the thrust of the NYT article (OMG, too much data!) and the Eisen/Pachter rebuttal (quit yer whining, you whiners!) have their flaws. But of the two, I find Eisen/Pachter to be the more ignorant and short sighted, and frankly condescending. Ascribing our struggles to process this torrent of data to stupidly (or worse, apparently!) is willfully ungenerous, and the analogy to images is total crap. Both Eisen and the NYT seem to think it is a data storage issue. Not really. A stack of cheap 3TB drives with Illumina runs on them is adequate for storage, and and it isn’t really a big deal. The struggle is in processing these data in a way that extracts useful information. Computers that were adequate last year to deal with last year’s Illumina runs are now inadequate. Two months ago we bought a 48 core machine with 256GB of ram, and frankly I’d like more ram, and I’d take more cores too. The ecosystem of software tools for assembling short reads is moving so fast and has so many competing and cooperating pieces that it is really hard to come up with a state of the art pipeline that stays state of the art for more than a month. Even steeped in it, I feel like I barely have a handle on what is going on, and how to handle our data.

    Biologists will need to switch effort and money from sequencing to processing, but that doesn’t mean processing these data is a trivial task. I think it is possible that a lot of smart biologists with a lot to offer will not be able to create in-house processing pipelines, and may fall behind. Bioinformatics may become a big-lab only game, one that requires a staff of computer scientists and programmers. Maybe that isn’t a bad thing, but is surely isn’t a “molehill”.

  3. Cedar
    Posted December 2, 2011 at 1:51 am | Permalink

    BTW, I hated that photo too, but for a different reason. I’d bet that McCombie hasn’t been at a bench in years, and that the NYT photo-op may well be the first time he has ever held a flow cell in his life. That may be a fine photo for his website, but news photographs are supposed to inform first, and this photo fails that test (made worse by the ridiculous caption). A picture of a tech or a student with a flow cell would be a more realistic picture of how research actually happens.

  4. Posted December 7, 2011 at 3:59 pm | Permalink

    Two months ago we bought a 48 core machine with 256GB of ram, and frankly I’d like more ram, and I’d take more cores too.

    Individual investigators are very foolish if they buy machines like this. Institutional high-performance computing clusters with hundreds of multi-core nodes and/or cloud computing infrastructure is the only way to go.

  5. Michael Eisen
    Posted December 7, 2011 at 8:12 pm | Permalink

    Individual investigators are very foolish if they buy machines like this. Institutional high-performance computing clusters with hundreds of multi-core nodes and/or cloud computing infrastructure is the only way to go.

    I agree completely. Though not every institution has available HPCCs, and the economics of cloud computing don’t always make sense. Disk space is weirdly expensive relative to what the hardware costs to maintain locally, and since a lot of what we do requires large amounts of storage, it shifts the economics a bit.

One Trackback