Friday, February 11, 2011

The "Data Deluge" and DNA

The current issue of Science has a special series of articles related to the "data deluge", an issue that is currently impacting numerous fields of science including genomics. Basically, it's an issue where the amount of data is outstripping our analytical capacity due to both a lack of computational power and man power.

Naturally there are articles about the data deluge and genomics in the issue.

One by Scott D. Kahn of Illumina entitled "On the Future of Genomic Data" [link] is in great part about the meaning of "raw data" in genomics. He basically explains that "raw data" in next-gen sequencing is being defined downstream of the actual raw data, either as the sequence reads translated from the images (which is the ultimate true raw data) or as the variations from the reference. He explains well that these definitions are in great part due to the fact that the actual raw data represents an enormous amount of computational data that is by and large unnecessary.

Another entitled "Will Computers Crash Genomics" by Elizabeth Pennisi [link] discusses two major issues. First, it emphasizes Lincoln Stein's view that funding agencies have inadequately funded analysis in favor of data production, and that if this doesn't change we'll be in for some tough times because there will be far too much data for our bioinformatics infrastructure to support. Second, it discusses the potential solution to the genomics data deluge found in cloud computing (while warning about the privacy issues that solution brings with it).

Both articles are well written and astute. I think together they emphasize a lot of the issues related to our data deluge problem in genomics. I think the Pennisi article in particular puts the focus in an important place: That bioinformatics as a field is behind our data production capacity and, thus far, does not appear to be catching up at an adequate rate. That may be good news for bioinformaticists in genomics like myself, but it's not good news for genomics as a field. (A humorous note: the word "bioinformaticists" comes up on my spell checker as not existing. So does "bioinformaticians". That speaks volumes.)

The Analytical Deluge

I think one thing that the articles hint at but don't really touch on at any depth is the advancement and dissemination of analytical approaches. Specifically, analytical approaches have advanced significantly, but not at a fast enough pace to keep up with data production. I estimate that currently, the amount of sequence data in the world is growing exponentially, but analytical approaches have, in contrast, advanced at a plodding pace. Most of these advances stem from a handful of institutes with huge funding that produce the most data and thereby require the most robust analytical approaches.

We can look just over the past two years at how significantly alignment and variant calling have improved, for example. While these advances have been a major boon, they've also made current analyses nearly incomparable with old analyses. If we want to compare a genome we've just recently processed and analyzed with one from two years ago, it's not really a fair comparison unless we go re-analyze that two year old dataset using current tools. This is an issue we encountered with the U87MG genome. We compared our variant calls to the Watson and YanHuang genomes and ended up with a huge number of differences. But most of them can probably be attributed to each project using different sequencing platforms with different alignment algorithms and different variant calling algorithms with different settings. We can't be expected to go obtain all these huge data sets ourselves and re-analyze them to match our current projects. We have neither the infrastructure nor man-power (read: funding) for that.

I will say the community (or, particularly, the 1000 genomes project) has done a nice job pushing standards that will help make us able to use larger portions of the world's genomic data. However, a bit of that is self-fulfilling. The 1000 Genomes is the largest source of genomic data in the world right now (though the Beijing Genomics Institute may outpace them in the future) and, no surprise, the alignment algorithm (BWA), variant caller (GATK) and even the formats of the data (SAM for alignments and VCF for variants) used by most genomic scientists today were created by them.

Are they the best ways of doing things? Certainly not. I think even the authors of said programs will admit there will be better ways of doing these analyses even in the not so distant future (and it's not unlikely that these same people may be the ones who develop them). But it takes a lot of work to computationally create programs of this sort and, honestly, there is neither enough funding nor enough people to get it done quickly.

And I would say that's the problem. Who's going to go back and bring the old data up to the current standard every time a new and better analysis comes along? Do we just leave that data to the back issues of Nature and proceed with new data? I think not.

So it comes full circle, really. We need to keep more than just a list of variants relative to the reference genome. That's not adequate for reanalysis. At this point, I think it's safe to say we won't be squeezing anything more useful out of the images off the machines, but the raw read data is probably as far as we can go for the forseeable future if we want our data to stay relevant.

But the available resources for storing that data are not yet adequate. The Sequence Read Archive (SRA) is a good attempt, but difficult to use and navigate and likely limited in its future given it will need infinite expansion capability. Clouds offer a cost-effective alternative, but storing personal genomic data on a company owned computer system definitely rubs the medical and research community the wrong way.

So what do I offer as a solution? The answer is the same answer for nearly any problem of this sort:


Anyone who's applied for a grant from bioinformatics can tell you how insanely difficult it can be to get funding for such projects.

Just try getting funding for a project to establish a standard format for structural variation calling (because, let's face it, the current VCF attempt at it is not good enough). Try getting funding to write an assembly algorithm that doesn't take either three months or 96GB of RAM to run on a human genome (wait, does that even exist yet?). Or just write a grant about sequencing twenty cancers and do it anyway with that funding, because that's much more likely to get funded.

It's like Chris Ponting says in Elizabeth Pennisi's Science article: There needs to be a priority shift for funding in academia towards more bioinformatics.

Then again, we can see already that the big companies are scooping up as many bright, young bioinformaticists as they can. Perhaps we will be leaving these analyses to the corporate world. I can already see immense value in a company that solely develops the best software for specific bioinformatics needs--in fact, they already exist as Novocraft, CLCBio, and many others.

But this leaves us with the issue of what to do with all our data. One of my fellow post docs here at Stanford has about twenty 2TB external hard drives under his desk. I can tell you right now: That's no future for genomic data. Sure, snail mailing 10TB of data is faster (and, ironically, more secure) than sending it over the Internet, but even over a USB3, it's a long time to even transfer that much data from the drive to an internal disk. Solid state drives are still prohibitively expensive, but ultimately that's what we want to be using (as a large portion of our data analysis time right now is reading and writing to disks!). Meanwhile, labs can't be expected to forever buy hard drives nor to rely on cloud "solutions" that are potentially insecure and often rely on snail-mailing disks.

And then there's the problem I mentioned above: What about bringing old data up-to-date for the sake of comparisons? We have a ton of data already that isn't commonly being used because it's "too old", though in actuality there's nothing wrong with it and it could easily be brought up to date given the disk space, manpower and time to do it.

I'd love to see someone write a grant to the effect of: "We're going to take all the world's genome sequencing data and keep it up-to-date with the latest analytical techniques." I'd love to see that project exist and get funded. Maybe I'll write it.

Thanks for reading! So what do you think of this "data deluge" problem? Is it a problem at all in genomics?


  1. Kind of funny: I just found a link through Twitter where somebody is saying it looks like NIH may drop SRA altogether.

    Called it.

  2. This was indeed a prescient post! The SRA demise still lacks an official announcement, but my take is here;
    Whilst governments baulk from funding of bioinformatics and the private sector engages in a venture-backed land grab, a third sector is busily at work; the open source collaborative development community historically gets a lot done with very little, e.g.