Friday, May 11, 2012

ANNOVAR Patching (May 11, 2012)

Making Annovar Work

Add support for Feb 1kG release



It turns out I had made a typo in the command and the Feb 2012 variants work fine! Just be very careful to have it as "1000g2012feb_all" in the annotation part.

Kai, the author of Annovar, explained to me that the 1kG is going to start releasing variants in ethnic subgroups in the future, thus the addition of the "_all" to these datasets now. Works for me! (see below)

Fix to 1kG annotation confusion

Also something to note is that the Annovar website is a bit unclear (in my opinion) about the naming scheme it uses for the databases from 1kG. To download them, one would use the naming scheme on the downloads page (e.g. "1000g2011may"). 

But line 136 in the annotate_variation.pl script enforces a naming scheme that doesn't match this. Rather, one would need to use "1000g2011may_all" or else the db isn't recognized as native and is treated as a "generic" db. 

Rather than patching a "fix" to that, I simply changed my queries to add the "_all" onto all of the 1kG annotations.

TFBS memory problem

I don't know if I'm experiencing this because of using a new cluster system (I think that's why) or if it's the new version of Annovar, but I was actually having annotation against transcription factor binding sites ("tfbs" in Annovar). Thankfully, there's now memory functions built into Annovar that seem to help this. 

I simply added "--memtotal 100000000" and it started working.

Not sure what happens if it actually tries to use more than that much memory--does it dump to a tmp file or crash? Not sure. But so far it's working with this much room.

What's kind of weird is no other database has this issue so far. Just TFBS of all things.

Monday, March 12, 2012

Cliff Reid on CG vs Illumina

Recently saw a post on the Complete Genomics website from Cliff Reid discussing our Nature Biotech paper (Lam et al, Dec 2011) in which we compared sequencing the same individual to high depth on both Complete Genomics and Illumina and compared them.

I think some of it is fair, but I do want to go a bit into it because it’s more complicated than just breaking it down by the Sanger validation rate.

Accuracy/Sensitivity

First off, Cliff addresses the fact that we found CG was more accurate than Illumina when it comes to SNP detection rate. To determine this, we used three small Sanger validation sets. We took a set of 20 SNPs detected by both platforms, a set of 15 detected only by Illumina and a set of 18 detected only by CG and Sanger sequenced them. Here’s how it broke down:


Number testedValidatedValidation rate
Both platforms2020100%
Illumina-specific15213.3%
CG-specific181794.4%


This is certainly pretty cut-and-dried—obviously the CG-specific variants that we were able to validate were accurate at a higher rate than the Illumina-specific ones. However, I think a fair criticism of this very experiment was the relatively small number of variants that were validated by Sanger.

The Complexities of the Experiment

I think we need to consider a bit more the problem of the very small number of SNPs that we validated in the Sanger sequencing and, perhaps more importantly, what those SNPs were and why we went ahead and used the SureSelect data instead to draw our conclusions.

The SNPs that were selected for Sanger validation included SNPs at every quality level that passed thresholds. We selected twenty variants from each set (concordant, Illumina-specific, and CG-specific) and tried to validate them by Sanger.

Please note the “number tested” in the above table doesn’t match with the 20/20/20 that I just said we designed. This is because for each of the platform-specific sets, we were unable to design primers that amplified product across seven of the SNPs, and those were indeed consistently “low quality” relative to the mean quality score for concordant variants (although, again, they passed threshold).

This is why we went forward with using the Agilent SureSelect targeted sequencing data for validation. Of course, we fully realized that such an assay would be potentially biased towards Illumina because the validation is being done on the same machine as the whole genome sequencing. That was, in fact, the reason we initially went for Sanger. But after sixty Sangers and realizing it would be hundreds more (think of that in terms of time, man-power and monetary cost) before we could generate anything really meaningful with Sanger, we decided the few we did had accomplished our goal of at least demonstrating that concordant SNPs are highly accurate but non-concordant are not and moved on.

In Cliff’s post, he extrapolates the number of platform-specific SNPs that would validate if the Sanger rates were correct across the board and concludes that CG is, in fact, more sensitive than Illumina. I caution against using this Sanger data this way because in the paper, we clearly utilized the SureSelect capture validation to make up for an inadequate Sanger experiment.

Recalculating

Regarding experimental design and how it gets conveyed in the literature. We tried to make it clear that the Sanger was only suggestive, not conclusive. I hope that got conveyed, but I’ll reiterate here that the Sanger data in that paper is limited in its usefulness because there simply isn’t enough of it and because of the lower mean quality score across those positions resulting.

We do, in fact, have a figure (Supplementary Figure 1) that demonstrates lower quality scores for the platform-specific SNPs from both platforms, and that played out in the validation. But the problem is that the sentence describing that is four paragraphs up from the Sanger validation paragraph and it isn’t linked in the text.

Here’s a quote from Cliff’s post that I want to respond to:
“The paper also points to the magnitude of the problem caused by validating the Illumina platform with the Illumina platform. The Sanger validation data can be used to estimate the confidence in the results of the target enrichment validation data. If the Illumina unique SNPs really were 64.3% true SNPs as reported, then the likelihood of getting the Sanger validation results (2 of 15 validated SNPs) is less than 1 in 10,000. While the exact Illumina SNP validation rate is unknown, the Sanger data tells us that we can be more than 99.99% confident that it is less than the 64.3% calculated by this biased validation approach. For these reasons, we believe 64.3% is not the correct number to use in calculating the sensitivity of the Illumina platform in this study.”

I want to stress the importance of Table 2 in the paper and how it  shows perhaps the most important information in the entire paper. Here’s a summary of it:


ValidatedInvalidatedNot validatedValidation rate
Concordant81.0%6.4%12.6%92.7%
Illumina-specific50.5%28.0%21.4%64.3%
CG-specific53.9%33.2%12.9%61.9%

“Validated” means those that were present in the whole genome sequencing and observed in the SureSelect data (true positives). “Invalidated” are those that were present in the WGS, but observed as false in the SureSelect data (false positives). And finally “not validated” are those that could not be detected adequately in the SureSelect data. The “validation rate” was determined by removing those “not validated” from the targeted count and then determining what percent of those were validated (not invalidated).

This is key. Those “not validated” do include quite a few of those “low quality” candidates that wouldn’t validate by Sanger either, of course. And those make up a relatively high proportion of the Illumina-specific SNPs (21.4% compared with CG-specific SNPs at 12.9%).

 Now go back to the total counts and extrapolate that value to actual variants counts. If 21.4% of Illumina-specific variants are of this ilk, that brings the Illumina-specific count down to 271,248 SNPs. If 12.9% of CG are of this ilk (which seems reasonable given both Sanger and SureSelect), the CG–specific count goes down to 86,732 SNPs.

If you’re following me, we’re now at variant counts that we can now attach our “validation rate” to determine the actual number of true positives in a way Cliff might approve of (but without using the inadequate Sanger data). Here are the results:


TotalExtrapolated "good" callsExtrapolated validation
Concordant3,295,0232,879,8502,669,621
Illumina-specific345,100271,248174,412
CG-specific99,57886,73253,687

This being the case, I think it’s clear why we say “Illumina was more sensitive.” I feel confident these are the numbers to use, and Illumina detected quite a few more total SNPs. However, it clearly has a higher error rate as well, so that can affect things downstream, as it’s not trivial to differentiate all those errors from the total.

As for the criticism regarding the chances of getting a 2/15 validation rate if the actual validation rate is 64.3%, for one thing I think we should use different numbers—in this case, 2/20 (the total number of those tested and validated by Sanger) and 50.5% (the total number tested and validated by SureSelect). Still, that detail aside, you’re still going to get a very small probability (e.g. a hypergeometric p of 0.0001).

But I can also look at it another way. What’s the probability of the CG-specific result doing the same thing? It’s okay, but it’s not that likely (p=0.11). Yet you’re talking about 53.9% and 13/20 there.

There’s two reasons for that:
1) 20 is a small number. With a ratio around 50-55%, unless you get 10 or 11 out of 20, you’re deviating pretty dramatically. In fact, the range for p > 0.05 with 20 pulls and a 53.9% ratio is only from 7 to 15. This is why I said we “would have had to do hundreds of Sangers”.
2) Sanger sequencing is different from SureSelect target enrichment sequencing anyway. It’s not a true subset in the first place, and is susceptible to sources of error that don’t affect next-gen sequencing.

Anyway, I don’t think that criticism is overly fair. Really, to me, it only supports that our Sanger data should not be used this way in the first place.

Finally, I don’t want to seem like I’m bashing CG here. To the contrary—CG performed exceptionally well in our paper. It is without a doubt from what I saw more accurate.

(Maybe I should write that paper where I add SOLiD into the mix…)

Friday, February 17, 2012

Exome Annotations

 I just posted a thread on 23andMe about which annotations I use for my exome data. Here's what I said:

I currently use Annovar for annotating VCF files. The output from Annovar is not particularly intuitive, so I wrote a perl script that generates a VCF-based report. I thought I would share the annotations I've been using and ones I plan to add, and see if anyone else has any other annotation ideas. These could be useful for us to annotate our own genomes (and potentially for 23andMe to provide in the future).

The annotations I've been including are:

  • Gene annotation (type of mutation--exonic, intronic, splicing, etc.)
  • Gene name
  • Mutational description (i.e. specific amino acid change, etc.)
  • dbSNP130
  • dbSNP135
  • WashU Exome Variant DB (EVS)
  • Transcription Factor Binding Site (TFBS)
  • SIFT score
  • PolyPhen 2 score (PP2)
  • GWAS presence
  • Segmental duplication

(The reason I include both dbSNP130 and 135 is that 135 contains quite a few SNPs that are potentially meaningful from a disease and trait standpoint while 130 is mostly markers not directly affecting diseases and traits. 130 is a subset of 135. Also, the EVS is potentially more useful than either of them as a filtering device.)

Ones that I would like to include in the future:
  • VAAST
  • MIE sites/scores (Mendelian inheritance errors)
  • 23andMe annotations (anything from 23andMe's SNP databases--can 23andMe help with that?)

Any other ideas for great annotations that should be included?

The idea behind these types of annotations is to give us a way to sift through the data and extract biologically meaningful results. For example, we are most interested in mutations that actually cause a protein coding change, that are uncommon in the population, and that are predicted to have a dramatic effect on function.

So far these types of annotations have allowed me to narrow very long lists of results in exomes (think on the order of 30-50,000 mutations)  down to just a handful (1-20) candidate mutations for particular Mendelian disorders.

Anything I missed?

Friday, February 10, 2012

Sequencing My Exome: Why?

“Why do you want to do this?”
My wife, immediately after I tell her I'm going to sequence my own exome.

There are a few times in life where you want to do something so badly, but find it difficult to convey to others why. This was one of those times. Such a simple question, but so many different answers. And each answer as valid as all the others. All of them coming together to explain why I would want to do something so, well, unusual.

I could frame a whole dissertation on the reasons behind wanting to sequence my own exome. (And I will.) But first, the simple answer:

I’m curious about myself.

I want to see if I can figure out why I am the way I am. And by that I mean both physically and mentally. For some people, this isn’t something they’ll ever think about. For others, they might see very clearly that they are this way because God made them this way, or because their parents raised them like this, or because they had bad luck. They may simply accept that they have specific features that make them who they are and aren’t concerned with why.

For me, those answers are not good enough.

I know that there are mysteries to be solved in my genetic code. It comes with the territory of being a geneticist. That said, though, almost everyone thinks this way, usually without even realizing it. We can all look at our own families as a proxy for genetics. If your mom had type 2 diabetes and your sister has type 2 diabetes and your uncle has type 2 diabetes, you’re pretty sure you have a higher chance of getting type 2 diabetes. You’ll hear all sorts of people saying that—“guess I got my mom’s bad genes” and “guess he took after his father” and so forth. If you’ve ever known somebody who got old enough, you might have heard her tell you about her mother lived to a ripe, old age and her mother before her, and so on. That’s what I call thinking genetic.

The difference for me is that I’m thinking genetic at a different level. I actually want to look at my genetics to try to explain these types of things. I don’t believe in fate without reason. If my mom lives to be 90, and my grandmother lived to be 90, I want to know if I got the mutations that helped them get there. Could I just shrug my shoulders, say, “I probably did,” and move on? Absolutely. But that’s just not good enough for me.

And really, it’s not good enough for anyone. We’re no longer entering the era where we can do better than that. We’re already there. Exome sequencing represents that first major step into the era.

I think, to make a case for exome sequencing (and by the way, whole genome sequencing is basically just an expansion and improvement upon exome sequencing—more on that later), I first need to explain what we can learn from it.  And to do that, first you’ll need to know what an exome is. For those readers who already know all about this, feel free to skip down.

What is an exome, anyway?

To understand what the exome is, you first have to understand what the genome is. There are massive tomes on the details of the subject, but to describe the genome succinctly:

The genome is the blueprint for every cell in your body.

Every single protein in every one of your cells is encoded on this massive blueprint. In order to create and maintain a cell (and therefore, your body and very being), your cell quite literally reads the genome and generates certain amounts and types of various proteins to fit the particular cell it’s trying to become or to fulfill a particular function.

The exome is a subset of the genome that contains the instruction to create the proteins themselves. The exome makes up about 1-2% of the whole genome. If the genome is the blueprint, then:

The exome is the instructions for making every protein in your body.

Therefore, being able to read those instructions means we can figure out if differences in them will result in different protein structures.

What can I learn from the exome?

Identifying variations in the exome that lead to differences in proteins (which we call mutations) can give us a direct way of determining if a protein might have altered function in us compared to other people. Significant protein mutations will manifest themselves as traits. To bring up an example from a previous post, the earwax trait is a result of a variation in the exome that leads to a mutation in a protein that results in determining if your earwax will be wet or dry. But it goes far beyond that type of “interesting” trait. Mendelian disorders (which this blog derives its title from) are disorders resulting from mutations in a single gene, which we can detect in the exome (and, in fact, quite a few Mendelian disorders have been “solved” through exome sequencing now).

By sequencing the exome, we can directly assess every line of the “instructions” and identify those lines that differ from the norm.

But that’s not the only way to use this information. We can hunt for mutations that damage our proteins, and that is the first obvious thing to do when looking at the exome. But we don’t know how every mutation will affect a person. To the contrary—there are very few mutations for which we understand the effect.
In fact, the current standard in personal genomics testing (such as that from DTC companies like 23andMe or through-physician companies like Navigenics) is actually an approach that dominated the field for about a decade before next-generation sequencing really became a reality. Using microarray technology, these approaches measure specific sites known to harbor variants in the genome that are associated with a trait or disease but typically not causative for the trait or disease.

For example, right now if you were to do a standard 23andMe test, you’d have genetic variations assessed at about a million sites across your genome. These variations would then be compared to a database that tells how strongly particular variations associate with particular traits or diseases. So 23andMe can tell me that I have a collection of variants that associate with type 2 diabetes, and it can calculate how that increases my risk of getting the disease compared to the average person.

This is more of a science than people often think. This is thinking genetic at a slightly more advanced level. I could simply turn to my family history and guess that I’m at an increased risk for type 2 diabetes. However, the fact that my genetics confirm the increased risk makes it much more “real” to me. Not only do I have a family history, I actually inherited some of those genetic factors. My risk is real.

Knowing my exome sequence takes that to the next level. Rather than simply having associations, I may be actually able to go into the regions of association and identify mutations causing these problems.

Moreover, as more and more information regarding the genetic causes of various traits and diseases are discovered, my exome sequence will always be at hand for me to cross-reference. Imagine that tomorrow a study is released identifying a gene that tells you with complete confidence whether or not you’ll get type 2 diabetes. I would check that gene in my own exome for mutations immediately!

That may sound unrealistic, but when it comes to conditions like cancer, these kinds of studies come out all the time. I may identify a random mutation in a gene that pre-disposes people to getting a particular type of cancer in my own genome, and then I will know that I need to have my doctor monitor for that. Having worked closely on brain cancer for a few years, it struck me that the reason it’s the deadliest type of cancer is because by the time we detect it, it’s already at a very advanced stage. But if we have a gene or set of genes that we know predisposes people to get malignant brain tumors, we could look in our own exomes for mutations in those genes and then get ourselves MRIs starting at a particular age to try to detect them earlier and hopefully allow effective, long-term treatement.

I think anyone can see how powerful that type of diagnostic and predictive tool can be.

And that brings up a major reason to sequence one’s genome: This information is immutable. Your exome is not changing. On the day you die, you’ve got pretty much the same exome and genome you had when you were born. If a major discovery is made tomorrow, I’ll have my exome to look at for it. If another discovery is made in ten years, I can take that same exome sequence and look for it. There’s no “expiration date” on that information.

And that’s what really sold me on the whole thing, actually. My intimate knowledge that my exome is always going to be a part of me, and that our understanding of genetics and diseases will always be expanding. That means my investment now is going to pay off for my whole life. Or at least until I sequence my whole genome.

I hope that conveys my major reasoning behind why I would want to do this. Of course there are other factors as well. For one thing, I am a geneticist. Genetics is not just my job, it’s my hobby. I love it. And over the years I’ve become increasingly interested in my own genetics. But that’s honestly not the only reason. At this point, I see it as a choice that will help me keep myself healthy throughout my life. 

I think there will be a shift generally towards that thinking in the medical community at large in the very near future as well. It may only be a couple years before your doctor suggests you get your exome sequenced as well. In a society where I feel most of us already think genetic, I think it's only a matter of time before we stop simply guessing that it's genetic and instead actually prove it. And beyond that, we actually figure out that there's something we can do about it. That is empowering right there.

Thursday, February 9, 2012

Enter the Exome





My Exome kit from 23andMe has arrived! In a few short weeks, I will have my exome sequence in hand and ready to analyze. I've looked at hundreds of exomes over the past year, but only now, when I'm about to look at my own, have I started to really think about how to extract meaning from a healthy individual's exome. All of the work I've done has either been to assess exome sequencing as a science or to hunt for mutations causing specific conditions (Mendelian Disorders and novel genetic syndromes).

Now I'm going to have my own sequence in hand and have a very basic yet exceedingly complex question to answer: What does this all mean?

And with that question comes other questions:

Why do I care about my own exome?

What can I learn from it?

What justifies the cost?

Is it safe?


In the coming days, I will be posting about my answers to these questions and more. I came to a realization a couple of days ago (right after I ordered this kit) that even questions that seem simple to me as a geneticist are not so simple for most people.

"Why do you want to do this?" is harder to answer than people may think. Off the cuff I might say, "I'm a geneticist, it's what I do!" but that isn't at all the whole answer. So I am going to make it a goal to explain why anyone would want to have his or her exome (or genome) sequenced in terms that hopefully anyone can understand.

Friday, December 9, 2011

GATK's available annotations

Perhaps because it changes too often, GATK's available annotations for VCF files does not seem to be online anywhere that I've seen. The GATK site says to run GATK with the "--list" parameter to list them. Doing that requires putting in valid input files and such. Basically, it's a pain.

So here's the list from GATK v1.3-21-gcb284ee


Available annotations for the VCF INFO field:
    ChromosomeCounts
    IndelType
    HardyWeinberg
    SpanningDeletions
    NBaseCount
    AlleleBalance
    MappingQualityZero
    LowMQ
    BaseCounts
    MVLikelihoodRatio
    InbreedingCoeff
    RMSMappingQuality
    TechnologyComposition
    HaplotypeScore
    SampleList
    QualByDepth
    FisherStrand
    SnpEff
    HomopolymerRun
    DepthOfCoverage
    MappingQualityZeroFraction
    GCContent
    MappingQualityRankSumTest
    ReadPosRankSumTest
    BaseQualityRankSumTest


Available annotations for the VCF FORMAT field:
    ReadDepthAndAllelicFractionBySample
    AlleleBalanceBySample
    DepthPerAlleleBySample
    MappingQualityZeroBySample


Available classes/groups of annotations:
    RodRequiringAnnotation
    StandardAnnotation
    WorkInProgressAnnotation
    ExperimentalAnnotation
    RankSumTest


No promises about how accurate this is for any other version.

Wednesday, November 16, 2011

Stop SOPA

I'm not a huge politics guy, so I don't want to go on a tirade about the Stop Online Piracy Act. Sufficed to say, it's a huge censorship bill parading as a bill to protect intellectual property. While I think the majority of us support protecting IP, I can't imagine the best way to do so is to monitor everything we do and censor websites based on some government-backed list of sensitive content.

As an example, if someone were to post copyrighted material in the comments section of my blog without my notice, my blog could potentially be shut down (censored) because of it. If I were to link to a site that had, somewhere on it, shared copyrighted material (even if I had no idea it was there and didn't intend for anyone to go there and see it or download it), my blog could be shut down (censored).

Moreover, the bill basically forces both providers and hosting services to strongly monitor content and shut down sites that potentially "infringe" on protected IP. Ever posted a picture of something that was copyrighted? Ever shared a link to a YouTube video with a copyrighted song in the background? That could be enough to get you shut down (because your host or provider doesn't want to be sued).

My wife and I often talk about the situation in Japan (she's Japanese) regarding the Fukushima nuclear reactor situation and how censorship and control of the media in Japan is so strong that the general public has no idea how dire the situation is. We even see that censorship and media control bleed over to here in America, where the general public is under the impression the nuclear situation isn't as bad as it is, in large part because the government and big media have an incentive to see nuclear power as an industry succeed.

Now here's the scary thing: If I were to start posting excerpts from copyrighted articles about that topic to respond to, if SOPA were to pass, my blog could potentially be shut down. I could personally be denied access. It's unclear to me exactly how much power this bill would give big media and the government. And that's the major problem.

In our field of genomics, a lot of us utilize freedom of sharing information and media to rapidly advance the science. I understand that this bill is meant to limit piracy of software and other digital media, but it represents a foot-in-the-door to all sorts of censorship. Could SEQanswers, for example, be sued for having a post up that contains the Illumina adaptor sequences? It certainly has been threatened in the past based on such things, but with SOPA passed, SEQanswers very well could have been shut down for that. What a detriment that would have been to the genomics and bioinformatics community.

Anyway, I just wanted to share this on my outlet to the world, as it is a very important issue generally and to our field in particular.

If you are an American and you do not support SOPA, please send a notice to your congresspeople telling them not to support it, either.