Thursday, October 13, 2011

Exome Sequencing Q&A

In case you missed it, the most recent issue of Genome Biology is focused on exome-sequencing. Included among a number of exome-seq platform comparison papers (like this one and this one, which I intend to discuss in comparison with my own recent publication in Nature Biotech in a future post) is a fun Q&A article entitled "Exome sequencing: the expert view". The article asked three experts (Dr. Leslie G. Biesecker of NHGRI, Dr. Kevin V. Shianna of Duke University, and Dr. Jim C. Mullikin of NHGRI) a series of questions related to exome-seq ranging from what we've learned to whether exome-seq will be around in a couple years.

Since I like to think of myself as a bit of an expert on the topic (mostly because I dedicated the past year of my life to it), I thought it might be fun to answer the questions myself and discuss a bit what I think of the answers in the article.

How can exome sequencing contribute to our understanding of the dynamic nature of the genome?
Exome sequencing is quite literally the best way we currently have to assess the most interpretable part of the genome: the protein-coding sequence. While it accounts for only about 1% of the whole genome, that 1% is the part we understand the best, because it's the part that we can assess through central dogma. Basically, when we see a mutation in an exon, we can then determine how the RNA will be affected, and subsequently how the protein will be mutated. We can determine precisely which amino acids will be mutated, and which amino acids they will turn into. And thanks to decades of work on proteins and amino acids, we can predict fairly reasonably how damaging a particular amino acid change will be. We can also predict nonsense mutations, which almost always result in loss of the protein as well as frameshifts (which typically lead to a nonsense mutation), splice site variations, and regulatory site mutations (assuming the UTRs/upstream and downstream regions are enriched).

I put it like this to someone not long ago: What would you look at if you sequenced someone's genome? Exons. That is literally the first thing, because it's the low hanging fruit. If you find nothing there, you might zoom out and look at splice sites (which exome-seq generally captures), micro RNAs (which exome-seq generally captures), and regulatory elements (which are becoming recognized as increasingly important--be ready for regulatory element/transcription factor binding site enrichment kits in the near future). As for the intergenic regions, they're such a mystery with regards to function that I feel secure saying we basically don't know what to do with variants in those regions.

At the very least, I have no doubt exome-sequencing will drive a revolution in Mendelian disorder research (it already is). We've already seen a number of them published and as an increasing number of samples are sequenced, that will only improve. Because Mendelians are most often caused by changes in the protein coding region of genes, exome-seq is a prime way of solving them. The one major barrier I see for solving Mendelians is the (perhaps surprising) prevalance of dominant low-penetrance disorders. Mendelian disorders with this mode of inheritance are inherently difficult to solve, and will require a large number of samples sequenced. That said, this again primes exome-seq as a major method, and I feel secures it as a major technology for at least a few more years.

In the paper, Kevin Shianna mentions some shortcomings. In particular, he mentions that structural variations (SVs) are difficult to detect by exome-seq, which is certainly true. I would say there is some light at the end of that tunnel, at least with regards to copy-number variations (CNVs), as some successful work is already showing positive results with detecting CNVs in exome-seq.

How much has exome sequencing been driven by cost alone?
Certainly I think the increasing prevalence of exome-seq has been almost completely driven by cost. Targeted sequencing (and subsequently, exome-seq) were developed precisely because labs wanted to assess particular regions by sequencing without paying for an entire genome sequence (most of which they wouldn't even look at the data from). That said, were the exome-seq and WGS assays the same cost, there would still be significantly higher cost both in terms of time and money associated with the bioinformatics, analysis and storage of WGS data compared with exome-seq data. Even at equivalent assay cost, exome-seq is cheaper. Will that ever change? Unlikely, I think.

Consider the value of an exome versus the value of a whole genome currently. One could certainly argue that because the information in the exome is so much more meaningful (please don't write me letters for saying that), every exome base is worth significantly more than intergenic bases. In the paper, LGB and JCM make the point that an exome costs about 1/6th what a whole genome costs. That seems pretty accurate to me. And that does demonstrate how much more valuable we consider the exome compared to the other regions. Because of that I would not say that exome-seq has been driven by cost alone, but also because it is considered more valuable per base than whole genome.

What are the major limitations of exome sequencing?
Obviously, any meaningful variation outside of the exome is missed by exome-seq, and that's a major drawback. Just because they're less interpretable does not mean that non-exonic variations are meaningless. Missing SVs is another large drawback, although I feel it's important to point out that SV detection does not require a deep whole genome sequencing experiment--a low depth WGS that is not adequate for small variant calling is more than adequate for SV calling. Therefore, exome-seq could be supplemented by low depth WGS for SV calling without going as far as doing a full WGS. I'd estimate the cost of an exome-seq plus low-depth SV-WGS to still be less than 1/4th the cost of a full WGS.

The other major issue, which is also brought up in the paper, is the fact that exome-seq misses almost everything it doesn't target. If the mutation causing a particular disorder is in an exon that doesn't happen to be on your target list, you're going to miss it. WGS, on the other hand, would likely catch it. That's a major limitation. However, if the exon is poorly annotated and that's why it's not on the exome-seq design, there's a good chance you'll miss it with WGS as well. So while this is a major limitation, it's as much a major limitation of sequencing in general. Specifically, that mutational analysis is so heavily reliant on annotation, which is incomplete at best.

What lessons from exome sequencing studies can be applied to whole genome sequencing studies?
Most of what we've learned from exome sequencing studies is directly applicable to whole genome sequencing studies. Currently, the field is in a state where exomic variation is highly detectable and interpretable, and that's due in no small part to exome-seq. Coding region annotation in particular has become significantly more powerful as a result of all the exome-seq work that's been done. We have a much stronger appreciation for the limitations of sequencing as well. Although we are very sensitively detecting variants, even at miniscule false positive rates we have trouble detecting disease-causing variants with single samples. As far as I'm aware, not a single Mendelian disorder has been solved by sequencing of a single sample (exome or otherwise). Because we need some way to eliminate false positives, and we use family members and additional samples for that.

And that brings up the other major lesson we've learned that we'll need to apply to WGS in the future: Small sample groups are simply not adequate for most purposes. Even for monogenic conditions, numerous samples and multiple families are required. Much like GWAS and linkage studies, we're going to need to sequence a lot of people to find meaning. This has a lot to do with sensitivity and specificity. A false positive rate of 1% sounds okay in a lot of fields, but in genomics, that's 400 false positives in the exome and 30,000 in the whole genome. Good luck finding your one disease causing mutation in there! But we're doing it through clever decision-making with regards to which family members to sequence, which individuals to sequence, et cetera. And that is perhaps the most useful lesson from exome-seq so far.

How do exome sequencing studies contribute to our mechanistic understanding of disease?
The most obvious answer is that exome-seq is very likely to be how we find the genetic etiology of the vast majority of Mendelian disorders. How this will translate to complex disease has yet to be seen, though it's certainly being pursued, particularly in the field of autism.

I would like to respond to a statement made by Kevin Shianna in the paper:
However, exome studies will have very limited power to identify causative variants in regulatory regions spread across the genome (transcription binding sites, enhancers, and so on). Implementing a WGS approach would allow detection of variants in these regions, thus increasing our knowledge of disease beyond the coding region of the genome.
While it is certainly true that exome-seq may be unable to identify mutations in the regulome (yes, regulome), WGS is not the only alternative to it. Why not instead supplement the exome-seq with a regulome-seq? I guarantee you companies are going to create such a thing, and even if you don't want to wait for them to be commercially available, one could create a custom target enrichment that covers the regulome as well.

Back to the cost issue, at about 1/6th the cost of WGS each, exome-seq + regulome-seq would still only come to about 1/3rd the cost of WGS. Outside the exome and regulome, is there much more in the genome that we can even comprehend by sequencing at this moment? Structural variation, perhaps, but as I've already mentioned, SVs can be detected with a low depth WGS. So for half the cost, we may be able to obtain the same biological meaning. That's something to seriously consider when deciding how to budget your sequencing. If you can squeeze out double the samples and obtain practically the same info by doing exome-seq + regulome-seq + low depth WGS, wouldn't you?

I would also add that while it may not be mechanistic, the rise of exome-seq has led to a realization that that we are not adequately prepared from a policy standpoint yet. More than anything, I think it has led to a general realization that medical genetics, genetic counselors and DTC genomics are arriving before society is really ready for them. And that's an issue that I think is going to need increased attention as we researchers charge forward with sequencing everybody.

Does exome sequencing have a limited 'shelf life'?
Not as limited as many people think. Even in the $1000 genome world, exome-seq will still have a place. I'll discuss this a bit more shortly, but there are genomic elements that are more sensitively determined by exome-seq. I feel strongly that once sequencing is sufficiently cheap, exome-seq will become the de facto standard for genomic information much as microarrays were for a good decade. This is due to a combination of factors--the majority of the genome being uninterpretable, the storage cost of genomes versus exomes, the bioinformatic challenges of whole genome versus whole exome, et cetera. The article goes into good depth on many of those issues. The one I'd focus on, however, is that until WGS is actually cheaper than exome-seq, it will always have a place in our field.

I have to disagree strongly with the following statement:
Yes, as soon as the difference in cost between exome and whole genome diminishes (which will be soon) and issues with data management and storage are resolved, whole genome sequencing will be the method of choice.
First of all, he knows as well as the rest of us that "issues with data management and storage" are not trivial to resolve. But beyond that, this is the stance that says, basically, "what's a thousand dollars for a whole genome compared to $300 for just the exome"? I state these numbers because that's where we're heading. At some point exome-seq will stop getting cheaper because the enrichment assay will always cost something. Same for WGS. And my answer to that question is simple: You'll still be able to get three exomes for the price of one whole genome.

The real thing that will limit exome-seq's lifespan is the interpretability of the rest of the genome. If we can utilize the intergenic data to a greater degree, then those bases become more valuable. Then WGS becomes more appealing. Until then, it's all exome (and regulome--I swear, it is coming soon!).

How much do you think that future research will be restricted by the IT-related costs of the analysis?
A great deal. Clouds are not cheap. I'm not convinced they're even the answer at this point. But are clusters? I'm not so sure there, either. I'm not convinced we have the IT solution to this yet because, frankly, I don't think the hardware manufacturers have realized what a lucrative market it's going to be.

I had a conversation about this with somebody not long ago. I said, we've gone from producing gigabytes to terabytes to petabytes over the span of four years in the sequencing world. By now I think there's a good chance we're at or nearing a worldwide exabyte of genomic information. As we start sequencing more and more people, we're going to start hitting the limit for storage capacity worldwide. Sounds crazy, right? But is it?

I also had another rather humorous thought that evolved from that one. How much energy will the number of hard drives needed to store the entire human population's genomic data require? How much heat will those drives produce? In the future, could sequencing be a major cause of global warming? Think about it!

Frankly, I've started to take the issue seriously. One of the tasks on my plate is a thorough assessment of alternative storage formats for genomic data. But that's a post for another day.

Are there any advantages of whole exome over whole genome sequencing?
Yes. Let me say that in less uncertain terms:
And by that I don't mean what Leslie Biesecker and Jim Mullikin said in their response in the paper. There are literally exonic regions that are better resolved by exome-seq than WGS. We demonstrate that in our recent Nature Biotech paper. A typical WGS will miss a small but meaningful number of exonic variations that are detected by exome-seq. To be fair, the opposite is also true: WGS will detect some exonic variations missed by exome-seq. To me, that says one thing: To be truly comprehensive at this point in time, we need to do both. Naturally, budgets prevent such a thing, but it's important to recognize that targeted enrichment can allow sequencing of regions missed by WGS, and that this is an advantage of exome-seq.


Biesecker LG, Shianna KV, Mullikin JC. Exome sequencing: the expert view. 2011. Genome Biol. Sep 14; 12(9):128 [Epub]


  1. Nice post, Michael. I look forward to your discussion of your recent Nature Biotech paper.

    Re: "As far as I'm aware, not a single Mendelian disorder has been solved by sequencing of a single sample (exome or otherwise)". I think this may have happened (or be happening), with the caveat that this approach is less likely to work outside of consanguineous families.

    See "Reducing the exome search space for Mendelian diseases using genetic linkage analysis of exome genotypes" by Smith et al. (Genome Biol. 2011 Sep 14;12(9):R85) for an example of how analysis of a single sample may be feasible in these consanguineous families.

  2. Thanks for your comment, Pete.

    Even that paper does not solve a Mendelian with a single sample as far as I can see. They get down to an admirable 65 nonsynonymous exonic variants in a linkage peak in one of the recessive individuals, 605 in the other single-individual recessive and 2,478 in the two-person dominant. This is certainly not a laughable feat by any means--it's quite impressive. But still, without another individual, how do you go from that list of 65 to the one true causative mutation?

    I think this certainly brings up an important point regarding study design, though: Why just sequence every individual we get our hands on? Isn't there a much better study design than that usually?

    In the case of this study, for example, would anything be gained from sequencing a parent? Probably not much.

    But if there were another affected cousin? Or another individual with the disorder anywhere? Sequencing those people would probably solve this disorder.

    So yes, sequencing a single individual and applying linkage strategies narrowed down the candidates a great deal. But how much more would having a second cousin with the disorder exome-sequenced have helped? I've been dealing with this myself. When you aren't working on the offspring of first cousins, the number of samples needed ramps up quite quickly. Especially if the disorder may be caused by different genes in different families, or if it's low penetrance dominant, etc.

  3. Actually, on second thought, sequencing another cousin might not gain you much. But sequencing an unrelated individual with the same disorder may.

  4. I should have added a disclaimer that I am a colleague of the authors of that paper I linked too. I'm not certain, but the causal variant may have been identified in one or more of these families and it is awaiting a separate publication, (but don't hold me to that).

    In studies such as these (recessive disease, consanguineous family) I agree that sequencing another family member may not add much power. This is particularly true if there is a linkage region because apart from a few private mutations (and sequencing errors/artefacts), any putative variants found in the linkage region will be common to both affected individuals. So it's probably only going to reduce your list of variants by a small amount.

    On the other hand, an unrelated individual with the same disorder can be very helpful, however there are caveats to that approach too (as we've learnt the hard way). We've had cases of allelic heterogeneity (by which I mean different mutations in the same gene across families) and locus heterogeneity (by which I mean mutations in different genes across families). These are a real challenge and probably require some sort of analysis that takes into account gene-pathways or gene-networks.

    To return briefly to the point of WES vs. WGS, we have several pedigrees that have been undergone WES but no causal variant has (yet) been identified. In many cases I don't think we would necessarily have found the variant had we done WGS (ignoring the cost considerations for the time being). I think it's often a case of we need more samples rather than needing more sequencing data for these projects.

    You make a great point about some the current benefits of WES over WGS, specifically that it's really only variants in the exons (+ a few other bits) that we have much of a clue about. At the moment I'd definitely prefer to have additional samples WES rather than fewer samples WGS (of course, I work in the "Mendelian" world).

  5. You know, your comments about linkage and its continuing importance are very astute. I've actually been considering how we want to go about expanding a study on a dominant Mendelian we have here and which samples we really need to study. The problem is, with a low penetrance dominant, the families aren't inbred at all, and the number of candidates remains quite high.

    Still, I have to admit I haven't done linkage on a number of my Mendelian families yet. I absolutely should!

    The issue with allelic and locus heterogeneity are a major problem going forward with some of these. I think these issues are a bit under-recognized currently. But I frankly think they're a major reason exome-seq is the solution to so many Mendelians. The Ng et al. Kabuki syndrome paper is a good example of that.

  6. I think that the problem of storing data is overhyped, if we just say this is the reference genome for ever (and we could do that right now as the reference doesn't need (can never be) to be correct just something we can reference and everyone has available) then you just report difference to reference this reduces (if you compress and record quality or p values ) to literally 10's of KB of data.

  7. Michael, do you have a reference for that?
    My impression was that while we can compress down to about 40% of current size by that method, retaining quality scores for the reads and any meta information still leaves them quite massive. 40% is nothing to shrug at, but it's not "10s of KB of data" when you're starting from gigabytes.

  8. Oh, that said, I'm open to suggestions about formats that will take my gigabytes of data and convert them to kilobytes. ;)

    So far I've been looking at CRAM and Goby and those will certainly save quite a bit of space (by doing precisely what you suggested), but not on the order of making them one one-millionth the size.