Thursday, October 13, 2011

Exome Sequencing Q&A

In case you missed it, the most recent issue of Genome Biology is focused on exome-sequencing. Included among a number of exome-seq platform comparison papers (like this one and this one, which I intend to discuss in comparison with my own recent publication in Nature Biotech in a future post) is a fun Q&A article entitled "Exome sequencing: the expert view". The article asked three experts (Dr. Leslie G. Biesecker of NHGRI, Dr. Kevin V. Shianna of Duke University, and Dr. Jim C. Mullikin of NHGRI) a series of questions related to exome-seq ranging from what we've learned to whether exome-seq will be around in a couple years.

Since I like to think of myself as a bit of an expert on the topic (mostly because I dedicated the past year of my life to it), I thought it might be fun to answer the questions myself and discuss a bit what I think of the answers in the article.

How can exome sequencing contribute to our understanding of the dynamic nature of the genome?
Exome sequencing is quite literally the best way we currently have to assess the most interpretable part of the genome: the protein-coding sequence. While it accounts for only about 1% of the whole genome, that 1% is the part we understand the best, because it's the part that we can assess through central dogma. Basically, when we see a mutation in an exon, we can then determine how the RNA will be affected, and subsequently how the protein will be mutated. We can determine precisely which amino acids will be mutated, and which amino acids they will turn into. And thanks to decades of work on proteins and amino acids, we can predict fairly reasonably how damaging a particular amino acid change will be. We can also predict nonsense mutations, which almost always result in loss of the protein as well as frameshifts (which typically lead to a nonsense mutation), splice site variations, and regulatory site mutations (assuming the UTRs/upstream and downstream regions are enriched).

I put it like this to someone not long ago: What would you look at if you sequenced someone's genome? Exons. That is literally the first thing, because it's the low hanging fruit. If you find nothing there, you might zoom out and look at splice sites (which exome-seq generally captures), micro RNAs (which exome-seq generally captures), and regulatory elements (which are becoming recognized as increasingly important--be ready for regulatory element/transcription factor binding site enrichment kits in the near future). As for the intergenic regions, they're such a mystery with regards to function that I feel secure saying we basically don't know what to do with variants in those regions.

At the very least, I have no doubt exome-sequencing will drive a revolution in Mendelian disorder research (it already is). We've already seen a number of them published and as an increasing number of samples are sequenced, that will only improve. Because Mendelians are most often caused by changes in the protein coding region of genes, exome-seq is a prime way of solving them. The one major barrier I see for solving Mendelians is the (perhaps surprising) prevalance of dominant low-penetrance disorders. Mendelian disorders with this mode of inheritance are inherently difficult to solve, and will require a large number of samples sequenced. That said, this again primes exome-seq as a major method, and I feel secures it as a major technology for at least a few more years.

In the paper, Kevin Shianna mentions some shortcomings. In particular, he mentions that structural variations (SVs) are difficult to detect by exome-seq, which is certainly true. I would say there is some light at the end of that tunnel, at least with regards to copy-number variations (CNVs), as some successful work is already showing positive results with detecting CNVs in exome-seq.

How much has exome sequencing been driven by cost alone?
Certainly I think the increasing prevalence of exome-seq has been almost completely driven by cost. Targeted sequencing (and subsequently, exome-seq) were developed precisely because labs wanted to assess particular regions by sequencing without paying for an entire genome sequence (most of which they wouldn't even look at the data from). That said, were the exome-seq and WGS assays the same cost, there would still be significantly higher cost both in terms of time and money associated with the bioinformatics, analysis and storage of WGS data compared with exome-seq data. Even at equivalent assay cost, exome-seq is cheaper. Will that ever change? Unlikely, I think.

Consider the value of an exome versus the value of a whole genome currently. One could certainly argue that because the information in the exome is so much more meaningful (please don't write me letters for saying that), every exome base is worth significantly more than intergenic bases. In the paper, LGB and JCM make the point that an exome costs about 1/6th what a whole genome costs. That seems pretty accurate to me. And that does demonstrate how much more valuable we consider the exome compared to the other regions. Because of that I would not say that exome-seq has been driven by cost alone, but also because it is considered more valuable per base than whole genome.

What are the major limitations of exome sequencing?
Obviously, any meaningful variation outside of the exome is missed by exome-seq, and that's a major drawback. Just because they're less interpretable does not mean that non-exonic variations are meaningless. Missing SVs is another large drawback, although I feel it's important to point out that SV detection does not require a deep whole genome sequencing experiment--a low depth WGS that is not adequate for small variant calling is more than adequate for SV calling. Therefore, exome-seq could be supplemented by low depth WGS for SV calling without going as far as doing a full WGS. I'd estimate the cost of an exome-seq plus low-depth SV-WGS to still be less than 1/4th the cost of a full WGS.

The other major issue, which is also brought up in the paper, is the fact that exome-seq misses almost everything it doesn't target. If the mutation causing a particular disorder is in an exon that doesn't happen to be on your target list, you're going to miss it. WGS, on the other hand, would likely catch it. That's a major limitation. However, if the exon is poorly annotated and that's why it's not on the exome-seq design, there's a good chance you'll miss it with WGS as well. So while this is a major limitation, it's as much a major limitation of sequencing in general. Specifically, that mutational analysis is so heavily reliant on annotation, which is incomplete at best.

What lessons from exome sequencing studies can be applied to whole genome sequencing studies?
Most of what we've learned from exome sequencing studies is directly applicable to whole genome sequencing studies. Currently, the field is in a state where exomic variation is highly detectable and interpretable, and that's due in no small part to exome-seq. Coding region annotation in particular has become significantly more powerful as a result of all the exome-seq work that's been done. We have a much stronger appreciation for the limitations of sequencing as well. Although we are very sensitively detecting variants, even at miniscule false positive rates we have trouble detecting disease-causing variants with single samples. As far as I'm aware, not a single Mendelian disorder has been solved by sequencing of a single sample (exome or otherwise). Because we need some way to eliminate false positives, and we use family members and additional samples for that.

And that brings up the other major lesson we've learned that we'll need to apply to WGS in the future: Small sample groups are simply not adequate for most purposes. Even for monogenic conditions, numerous samples and multiple families are required. Much like GWAS and linkage studies, we're going to need to sequence a lot of people to find meaning. This has a lot to do with sensitivity and specificity. A false positive rate of 1% sounds okay in a lot of fields, but in genomics, that's 400 false positives in the exome and 30,000 in the whole genome. Good luck finding your one disease causing mutation in there! But we're doing it through clever decision-making with regards to which family members to sequence, which individuals to sequence, et cetera. And that is perhaps the most useful lesson from exome-seq so far.

How do exome sequencing studies contribute to our mechanistic understanding of disease?
The most obvious answer is that exome-seq is very likely to be how we find the genetic etiology of the vast majority of Mendelian disorders. How this will translate to complex disease has yet to be seen, though it's certainly being pursued, particularly in the field of autism.

I would like to respond to a statement made by Kevin Shianna in the paper:
However, exome studies will have very limited power to identify causative variants in regulatory regions spread across the genome (transcription binding sites, enhancers, and so on). Implementing a WGS approach would allow detection of variants in these regions, thus increasing our knowledge of disease beyond the coding region of the genome.
While it is certainly true that exome-seq may be unable to identify mutations in the regulome (yes, regulome), WGS is not the only alternative to it. Why not instead supplement the exome-seq with a regulome-seq? I guarantee you companies are going to create such a thing, and even if you don't want to wait for them to be commercially available, one could create a custom target enrichment that covers the regulome as well.

Back to the cost issue, at about 1/6th the cost of WGS each, exome-seq + regulome-seq would still only come to about 1/3rd the cost of WGS. Outside the exome and regulome, is there much more in the genome that we can even comprehend by sequencing at this moment? Structural variation, perhaps, but as I've already mentioned, SVs can be detected with a low depth WGS. So for half the cost, we may be able to obtain the same biological meaning. That's something to seriously consider when deciding how to budget your sequencing. If you can squeeze out double the samples and obtain practically the same info by doing exome-seq + regulome-seq + low depth WGS, wouldn't you?

I would also add that while it may not be mechanistic, the rise of exome-seq has led to a realization that that we are not adequately prepared from a policy standpoint yet. More than anything, I think it has led to a general realization that medical genetics, genetic counselors and DTC genomics are arriving before society is really ready for them. And that's an issue that I think is going to need increased attention as we researchers charge forward with sequencing everybody.

Does exome sequencing have a limited 'shelf life'?
Not as limited as many people think. Even in the $1000 genome world, exome-seq will still have a place. I'll discuss this a bit more shortly, but there are genomic elements that are more sensitively determined by exome-seq. I feel strongly that once sequencing is sufficiently cheap, exome-seq will become the de facto standard for genomic information much as microarrays were for a good decade. This is due to a combination of factors--the majority of the genome being uninterpretable, the storage cost of genomes versus exomes, the bioinformatic challenges of whole genome versus whole exome, et cetera. The article goes into good depth on many of those issues. The one I'd focus on, however, is that until WGS is actually cheaper than exome-seq, it will always have a place in our field.

I have to disagree strongly with the following statement:
Yes, as soon as the difference in cost between exome and whole genome diminishes (which will be soon) and issues with data management and storage are resolved, whole genome sequencing will be the method of choice.
First of all, he knows as well as the rest of us that "issues with data management and storage" are not trivial to resolve. But beyond that, this is the stance that says, basically, "what's a thousand dollars for a whole genome compared to $300 for just the exome"? I state these numbers because that's where we're heading. At some point exome-seq will stop getting cheaper because the enrichment assay will always cost something. Same for WGS. And my answer to that question is simple: You'll still be able to get three exomes for the price of one whole genome.

The real thing that will limit exome-seq's lifespan is the interpretability of the rest of the genome. If we can utilize the intergenic data to a greater degree, then those bases become more valuable. Then WGS becomes more appealing. Until then, it's all exome (and regulome--I swear, it is coming soon!).

How much do you think that future research will be restricted by the IT-related costs of the analysis?
A great deal. Clouds are not cheap. I'm not convinced they're even the answer at this point. But are clusters? I'm not so sure there, either. I'm not convinced we have the IT solution to this yet because, frankly, I don't think the hardware manufacturers have realized what a lucrative market it's going to be.

I had a conversation about this with somebody not long ago. I said, we've gone from producing gigabytes to terabytes to petabytes over the span of four years in the sequencing world. By now I think there's a good chance we're at or nearing a worldwide exabyte of genomic information. As we start sequencing more and more people, we're going to start hitting the limit for storage capacity worldwide. Sounds crazy, right? But is it?

I also had another rather humorous thought that evolved from that one. How much energy will the number of hard drives needed to store the entire human population's genomic data require? How much heat will those drives produce? In the future, could sequencing be a major cause of global warming? Think about it!

Frankly, I've started to take the issue seriously. One of the tasks on my plate is a thorough assessment of alternative storage formats for genomic data. But that's a post for another day.

Are there any advantages of whole exome over whole genome sequencing?
Yes. Let me say that in less uncertain terms:
And by that I don't mean what Leslie Biesecker and Jim Mullikin said in their response in the paper. There are literally exonic regions that are better resolved by exome-seq than WGS. We demonstrate that in our recent Nature Biotech paper. A typical WGS will miss a small but meaningful number of exonic variations that are detected by exome-seq. To be fair, the opposite is also true: WGS will detect some exonic variations missed by exome-seq. To me, that says one thing: To be truly comprehensive at this point in time, we need to do both. Naturally, budgets prevent such a thing, but it's important to recognize that targeted enrichment can allow sequencing of regions missed by WGS, and that this is an advantage of exome-seq.


Biesecker LG, Shianna KV, Mullikin JC. Exome sequencing: the expert view. 2011. Genome Biol. Sep 14; 12(9):128 [Epub]