Tuesday, June 1, 2010

Genome Studies: Where Do We Go From Here?

As I prepare to write my PhD dissertation, I have been reflecting on the state of genomics, particularly of publishing genomics. A question I sometimes get asked, which surprises me every time, is: “What is the point of sequencing the whole genome?” I admit, the first time I was shocked. But I tried to think about it from this other biologist’s perspective. From his point of view, sequencing a whole genome was all it really took to publish in a major journal. It is no simple task to sequence an entire genome, but it is more of a “data production” mode. Someone like him needs to go through ten or more individual, unique experiments to establish what a particular mutation in a particular gene is doing in a mouse before he can publish. I think biological scientists generally desire hypothesis-driven experimentation—answering a question by performing experiments. They see sequencing as just one big experiment. And maybe it is, but it also gives an incredibly large amount of information, making it a lot different from a single experiment of another type.

In our sequencing of the U87MG cell line, we tried to derive some biological relevance from the sequence and we did so by making general observations about the genome. For some biologists, this can in some ways feel lacking for some reason. For this reason, much of the field is moving toward whole exome sequencing in order to sequence the low-hanging fruit across a large number of samples rather than exploring the whole genome. They want to supplement their more traditional experimental approaches with next-gen sequencing, but they don’t see a point to whole genomes.

I think there’s a great deal of merit to that, but I also think there are important, biologically relevant questions that cannot be answered with whole exome alone, and therefore I do think there is need to perform whole genome studies in some cases. It all depends on the question being asked.

A Whole Genome Sequencing Project

What were those observations we found in the U87MG sequence?

One thing we did not find was an abnormally large number of small variants. To the contrary, U87MG appeared quite consistent with the relatively few other genomes that had been publish up to that point in terms of the sheer number of small variations. The vast majority of small variants were in the dbSNP database, and of those that weren’t, about 10% overlapped with the YanHuang (first Chinese) genome. A whopping 50% of the single nucleotide variants (SNVs--including those in dbSNP) were shared with YanHuang.

U87MG being a cell line derived from a Caucasian glioma, I can say honestly that I was surprised to not see a significantly higher number of small variations in its genome. We were certainly hoping, with it being the first cancer cell line to be whole genome sequenced, that we would find something striking in terms of variation.
And then we did: Structural variations. The genome of U87MG is well known for its cytogenetic aberrations, but the resolution of karyotyping, SKY, and microarrays is not high enough to accurately elucidate the complexity of those aberrations. Sequencing, on the other hand, can visualize these events in much higher resolution, such that we gained a picture that suggested a more complex mechanism for structural variation than merely large genomic breaking and rejoining events. For example, small regions of other chromosomes being nestled between large fragments of other chromosomes, facilitating a major translocation.

But again, what did it mean biologically? With the sequence alone, it’s hard to say much about it. We described our observations: That structural variation appears to be the major mode of mutation in this cancer cell line, and that said structural variations are often more complex than previously thought. I do think there’s a great deal of value to these observations, but we definitely wanted to find a smoking gun—a new cancer gene like IDH1, for example.

Now, all of that said, U87MG’s genome is important for other reasons even without a “smoking gun”. Specifically, it is one of the most used cancer cell lines, and for certain the most used brain cancer cell line. It is very often used as an in vitro model for brain cancer. Tons of papers have been published using U87MG. For any of those papers, going back and looking at the sequence, which is now freely available, could very possibly reveal new and important insight into the genetic etiology of a given phenotype, such as drug response. But as for the U87MG sequencing project, getting the sequencing done, aligning the data, calling and annotating variants and validating was a huge effort by itself, done by a fairly small group for a relatively small amount of money.

The sequencing data itself from our U87MG project is being used by other genomic scientists in their studies, as it’s an easily-accessible, freely-obtainable cancer sequence. The advantages of that could fill another blog post, but sufficed to say, it will continue to produce results for years after we published it, which makes the whole thing more than worthwhile.

The U87MG project and paper was not merely for the sake of sequencing a genome, however. It was a showcase of genomic technology and analytical techniques. That was the tail-end of a time when every group doing genomic analysis performed their own techniques and designed their own analytical tools to get everything done. For our part, we used the SOLiD system, and were the first group to publish a whole genome using it outside of Life Technologies itself. We used an alignment algorithm written by a member of the lab (Nils Homer-BFAST), annotated variants using a database written by another member of the lab (Brian O’Connor-SeqWare) and called structural variations using a program written by myself (Breakway).

Where Do We Go From Here?

So what is the next step? Some people, particularly from the larger genome sequencing groups, feel that publishing on single whole genomes is no longer worthwhile. Sequence ten whole genomes, or a hundred, or a thousand, and then publish. But I don’t see this as the only route for performing biologically meaningful whole genome studies.

In the case of cancer, for example, we have already shown in one cancer cell line sample that cytogenetic abnormalities are the major mutational mechanism. If this holds true for other cancer samples, simply whole exome sequencing isn’t going to be enough, because it doesn’t generally resolve those events. I think there is going to be a lot of power in paired sequencing of patient tumor and normal samples for the sake of discovering novel tumor mutations. Ironically, I see that as a major source of low-hanging fruit. By supplementing those experiments with some other biological experiments—assays to test the effect of detected mutations in cell lines, for example—there is an excellent chance of detecting cancer mutations that would be very unlikely to be detected through other means.

As for other genetic disorders, perhaps whole genome is not practical yet, but ultimately I again think it will be the way to go in the future when it’s more affordable and analysis is streamlined. The Autism genome, for example, may take thousands of whole genomes before it really gives us the genetic reasons for Autism.  But, fortunately for us, a time is coming soon when sequencing will actually be that affordable.

The value of a single genome sequence is dependent on the experiment at hand. It may only take a single paired tumor/normal sample being whole genome sequenced to detect the next major cancer gene. Doubtless it will take just a few individuals being sequenced to identify the cause of many Mendelian disorders. But it may take thousands of whole genomes to figure out Autism or Athersclerosis. It’s likely we’ll be rewarded not by relying solely on whole genome sequencing for those experiments, but by combining whole genome studies with other omics techniques—analyzing metabolic flux, assessing the whole inflammatory response, ChIP-seq, RNAseq, et cetera. It may be that analyzing the entire biological system through all these means and putting it all in the context of whole genome sequences results in the answers to many of these complex disorders.