Beating #Election2012 on Election Day in the US. Human genetics has made it, people. Step aside, Charlie Sheen.On Election Day 2012, what's the number two trending hashtag? #ASHG2012 baby! We've arrived! twitter.com/mjcgenetics/st…
— Michael James Clark (@mjcgenetics) November 7, 2012
Tuesday, November 6, 2012
Trending
Monday, November 5, 2012
Save the Square Watermelons
Also, I just realized I've had over 25,000 hits on this blog. Thanks for reading! I hope it's been useful and interesting!NO on 37! Think about it, Californians. Are you telling me you can't figure this one out, @carighttoknow? #prop37 twitter.com/mjcgenetics/st…
— Michael James Clark (@mjcgenetics) November 6, 2012
Thursday, November 1, 2012
"Why I Don’t Want to Know My Genome Sequence"
"Why I Don't Want to Know My Genome Sequence"
Kind of interesting to read this as someone who at least has his own exome sequenced and has learned a lot from it. It's kind of interesting to read it in contrast to my own post explaining why I wanted to sequence myself.
It ends with a funny statement:
To be more specific, let's use myself.
Sure, I could have lived my life thinking my borderline migranes were from caffeine withdrawel and my mother and not bothered to find out the exact mutations likely to cause it and therefore which drugs may actually have a beneficial effect on me. I could have. But I didn't.
I could have lived my life thinking my intestinal issues were due to a bad diet or food allergies. Because clearly copious amounts of salad, low fat, low carb, low salt is a bad diet, right? And clearly my wife who eats the same things and has zero gastrointestinal issues is just lucky. Oh no, it must be food allergies. It couldn't possibly be that I inherited two mutations, one from each parent, that damage a particular gene already known to be causative for gastrointestinal problems (among a number of other things that I happen to have that most general practitioners wouldn't link together).
I could have survived without knowing I have asthma. Maybe.
I mean, go ahead and live in ignorance if that's your thing. If you're not interested in your own genetics, then fair enough. You're not alone. But it was damn reassuring to me, personally, to figure out exactly what genes are mutated and how those are causing conditions that negatively effect me.
And I would say not only am I healthier, but I'm also more aware of my health. And that's a good thing.
And that genetic information isn't going to expire.
Anyway, let's not devalue our genetics this way. The fact that we don't know everything yet doesn't make the data itself less valuable. Yes, it will take effort to understand but then again, so does nearly everything about your health and life.
I see it as a great thing.
Kind of interesting to read this as someone who at least has his own exome sequenced and has learned a lot from it. It's kind of interesting to read it in contrast to my own post explaining why I wanted to sequence myself.
It ends with a funny statement:
An osteoarthritis mutation manifested itself as an inability to play an F chord at age 33. A p53 mutation and then another that bloomed in response to years of orthodontia X-rays gave me thyroid cancer a few years after I gave up the guitar. And I don’t need a genetic test to know I didn’t inherit my father and grandfather’s psychotic depression.
Ron Crystal, even though he’s among the sequenced, has the right idea: don’t smoke, exercise, eat a healthy diet, and don’t worry about DNA sequences. That’s good enough for me – at least for now.This is one of those statements used to make one's self feel better about not doing something he or she wants to do. That's my feeling, at least.
To be more specific, let's use myself.
Sure, I could have lived my life thinking my borderline migranes were from caffeine withdrawel and my mother and not bothered to find out the exact mutations likely to cause it and therefore which drugs may actually have a beneficial effect on me. I could have. But I didn't.
I could have lived my life thinking my intestinal issues were due to a bad diet or food allergies. Because clearly copious amounts of salad, low fat, low carb, low salt is a bad diet, right? And clearly my wife who eats the same things and has zero gastrointestinal issues is just lucky. Oh no, it must be food allergies. It couldn't possibly be that I inherited two mutations, one from each parent, that damage a particular gene already known to be causative for gastrointestinal problems (among a number of other things that I happen to have that most general practitioners wouldn't link together).
I could have survived without knowing I have asthma. Maybe.
I mean, go ahead and live in ignorance if that's your thing. If you're not interested in your own genetics, then fair enough. You're not alone. But it was damn reassuring to me, personally, to figure out exactly what genes are mutated and how those are causing conditions that negatively effect me.
And I would say not only am I healthier, but I'm also more aware of my health. And that's a good thing.
And that genetic information isn't going to expire.
Anyway, let's not devalue our genetics this way. The fact that we don't know everything yet doesn't make the data itself less valuable. Yes, it will take effort to understand but then again, so does nearly everything about your health and life.
I see it as a great thing.
Wednesday, October 31, 2012
ASGH2012 Fail
#ASHG2012 : Encouraging cloning by putting multiple highly interesting talks I want to see in concurrent sessions. twitter.com/mjcgenetics/st…
— Michael James Clark (@mjcgenetics) October 31, 2012
Monday, October 22, 2012
Misunderstood: Genetic tests and the people who don't understand them
On Friday, I was made aware of a story about a boy in my town (Palo Alto, California) who was being transferred from his school to another school against his and his parents wishes. He didn’t do anything wrong and he didn’t want to leave. Instead, the administrators at his school made this decision based on genetic information they were given by the parents. Genetic information that they apparently did not understand fully.
The Story
The original story made a bit of a splash, with articles on major news sources. Here are a few links explaining the fiasco:
The original article that exposed what happened.
An official response from the superintendent of the school district.
An SFGate article that exposes important facts about this case.
The article I first heard about the story in.
The basic story goes like this:
A boy named Colman Chadam is being forced to transfer out of his current school in the middle of the school year because he carries mutations that cause cystic fibrosis (CF). CF is a relatively common yet fairly severe genetic condition. Kids with CF should not be in close proximity to one another because they can easily infect each other with respiratory infections. In other words, a school would not want two kids with CF in the same class, and would probably be justified in keeping them separated for their own health (though it’s unclear if forcing them to go to separate schools is even necessary).
CF can be predicted genetically—we are aware of major CF mutations (in the CFTR gene) in the population and a standard genetic tests can identify them. But CF is typically diagnosed through a non-genetic test called a sweat test, and is otherwise a rather self-evident disorder because it is a serious condition. It affects one in twenty-five Caucasians. Moreover, the genetic tests are necessarily conclusive—one could carry mutations for the disease yet not have it.
The thing is, Colman was apparently never diagnosed with CF. He has never displayed any symptoms of the disorder. Instead, according to the SFGate article, Colman was found to have mutations in CFTR that could potentially cause CF. But they haven’t for him.
For whatever reason Colman had a genetic test including CFTR eleven years ago. It’s unclear in the articles I’ve found, but I’ve been told he’s likely to have tested positive for a sign of CF at birth and then had the genetic testing to confirm, but the disease never manifested later. So despite not having cystic fibrosis, his parents are aware that he carries mutations in the CFTR gene.
And, probably without thinking that there could be serious ramifications for Colman, his parents told the school that he carries mutations for CFTR when they were asked to report any pertinent genetic information.
The SFGate article discusses the progression from an innocuous bit of irrelevant “medical” information to Colman being transferred out of his school:
“A few weeks into the school year at Jordan Middle School, school officials took note of Colman's medical history, information that eventually was shared with another Jordan parent whose two children have classic cystic fibrosis and are predisposed to chronic lung infections.”
Can anyone else see some major problems with this little statement? First off, school officials were looking into this child’s medical history why, exactly? He does not have cystic fibrosis, so why was it being followed up so heavily by the administration? Second, his information should never have been shared with the other parent who has two kids with CF.
Moreover, with one in twenty-five Caucasians carrying mutations for CF, there’s likely to be numerous children at the same school carrying CF mutations who are not being transferred because they never had any symptoms and they never had a genetic test. That’s one of the things that makes this action by the school so unreasonable.
So I tried to think about how this could have happened.
I think what probably occurred was that Colman’s parents reported his CFTR mutations just in the interest of being fully compliant with school rules that say, “Tell us everything (or else),” which most parents are probably familiar with.
Then a nurse or administrator or someone (who hopefully had permission to see Colman’s records) who was familiar with cystic fibrosis (at least at the level of understanding that two kids with CF shouldn’t be in the same class) noted that Colman had CF mutations and misunderstood that to mean he has cystic fibrosis.
I’m not sure whether the other parents were really informed or not (at least one article suggests they were—if they were, they shouldn’t have been). But what I can imagine is that in the interest of avoiding any potential problems (imagined or otherwise), they decided to transfer Colman out without really understanding that he does not have cystic fibrosis.
There’s some pretty strong evidence of this in a letter from the superintendant that was posted on PaloAltoPatch. I’m pasting it below for posterity:
“The Palo Alto Unified School District strives to meet the needs of all students and to ensure that a safe and welcoming learning environment is provided for all our students and staff. To be clear, this commitment includes caring for our students with medical needs.
“At this time, our District serves students who are diagnosed with Cystic Fibrosis (CF), a serious medical condition that creates a need for our staff to observe strict protocols on cleanliness in order to help protect these students' fragile condition. I am proud that our staff energetically and compassionately works to observe the necessary protocols for these students. Further, I am proud that our staff thoughtfully embraces these, and all, students.
"A recent story that appeared in the San Francisco Chronicle addresses the distance -- the literal physical distance -- that must be constantly maintained between children with CF in order to avoid bacterial cross contamination. I want to stress that medical authorities clarify that CF is not a health threat to the general public (a theme that was touched on in the story), but it is a topic of concern for non-sibling children with CF.
“As background, at the beginning of this school year, we became aware that students (not related) with CF were on the same middle school campus. Based on the advice of medical experts, who stress the need for non-sibling CF patients to constantly maintain a specific distance from each other, we thought it best to place the students at different campuses. Again, based on the advice of medical experts, this is the zero risk option, and most certainly helps our District deliver on its commitment to provide safe learning environments.
“We asked the family of the student who is new to our District if they would consider moving their child to Terman Middle School. We asked because Board policy gives priority site choice to students who are established at a site, and the other students have been with our District for years. Sadly, it was this request that caused the controversy, and the media coverage.
“I want to stress a few critical points.
“First, at both of these schools the dedicated staff are stepping up and working to provide these students with excellent learning environments. We are grateful for their professionalism and compassion.
“Second, we are education professionals, not medical experts. I assure you that we will take action based on recommendations from medical authorities. Expert advice will be balanced by the PAUSD Board policy that strictly instructs our District to ensure we provide a safe learning environment for all students. To be clear, our Board of Education places the highest priority on student safety, in all situations and in protecting the privacy interests of all students.
"Finally, we must be honest about the fact that we are talking about public schools, not medical facilities. We will strive to implement all health and safety protocols to help protect the health of these children; however, the harsh reality of a busy middle school campus, where students ranging in ages from 12 to 15 share a cafeteria, restrooms, the gym and locker room, a library and other settings, is that it might be virtually impossible to maintain a specified separation and sanitation protocols at all times. This reality is what made us gravitate to the separate campus option. CF is a life-threatening condition, and in this context, based on the information provided to us at the time by all parties, we believed that zero risk was the best course of action. We hoped that these families would agree.
“This topic is understandably emotional for the families involved, and we will work with them to meet the needs of their children. However, I must again stress that we will follow the advice of medical experts, who we trust will help us understand the option that eliminates, if possible, all risks.”
Sincerely,
Charles F. Young, Ed.D.
Associate Superintendent Education Services
Palo Alto Unified School District
I feel that this letter clearly shows that at the very least, Dr. Young here was under the impression that Colman has cystic fibrosis. But he apparently does not. He carries mutations in CFTR, but he doesn’t have CF. Here’s another piece of evidence that Dr. Young and the other administrators are misinformed, thinking that carrying CFTR mutations is equivalent to having CF:
“The administrators sought medical advice, Silverman said Thursday, which resulted in a recommendation from Dr. Carlos Milla, of Lucile Packard Children's Hospital, saying that, ideally, children with cystic fibrosis would attend separate schools.
The recommendation was not based on knowledge of Colman's specific medical history, his parents and their attorney said prior to Friday's court appearance.”
Why else would they seek medical advice about whether children with cystic fibrosis should attend the same school unless they thought that Colman actually has cystic fibrosis? Rather than speaking with a medical doctor, they should have consulted with a board certified genetic counselor (which Stanford, where they received consult on this issue, has a number of). At the very least they should have spoken with a medical geneticist or even just a geneticist like yours truly. They didn’t even know that, though, because they collectively did not even realize genetics makes a huge difference here.
The Lesson
I honestly do think this is all a big misunderstanding. I hope that the administration is willing to admit a mistake was made and rescind their decision to transfer this kid over a misunderstanding about how cystic fibrosis works.But to me, it’s also a troubling sign. I’ve sequenced my own exome. I know genetic facts about myself, and if Colman’s case is any indication, I would have no immediate recourse if that information were to be used against me in a legal way. What’s happening to Colman has yet to be ruled on, but regardless, his life is still being thrown to the wind while these decisions get made based on misunderstanding and ignorance about how genetics and disease work.
Moreover, what of my own future children? Or my current relatives? Information comes out about me and that information is shared by my relatives, at least in part. Could it be used against them by people in positions of power who don’t understand genetics?
These are major reasons I’ve yet to share my genetic information with others openly. I would absolutely love to make my data public, to reap the benefits of the community at large studying my genome. I had plans to do just that, but I have yet to enact them for these reasons. Because the current protections do not consider all possible problems—like this issue with Colman Chadam.
Solutions
Currently, I can’t stress with people enough that they have to protect themselves. I love the idea of living in a perfect world where we can share our genetic data and not be subject to prejudice (intentional or unintentional) as a result. But we honestly don’t live in that world yet, so those of us who have been sequenced should seriously consider whether sharing our data is in our best interest.That includes not sharing genetic information with, for example, your kid’s school district. Of course we think it’s pertinent information they might need, but they do not need to know that your child carries CFTR mutations (only whether or not the child actually has CF).
As for Colman and his family, I think they need to push this. At the personal level, what’s happened to Colman is certainly unjust and possibly illegal. And for the PAUSD and its administrators, they need to learn a lesson here—that they need to fully comprehend what’s going on with students medically before making blunt-ended decisions about what to do with them. That if they are confronted with genetic information, they need to see a genetic specialist to interpret it correctly.
On a more general level, it needs to be recognized that genetic information cannot be used to discriminate against people even in the most innocent-seeming way, even when it feels like the right thing is being done. Until then, we who have genetic information about ourselves need to be careful with it and how we share it.
This issue is starting to spring up as genetic testing becomes more common. Imagine if a young person has genetic predisposition to asthma, but has yet to have an asthma attack. Do you stop that child from playing soccer or do you give her an inhaler just in case? The first one is, in my opinion, genetic discrimination. The second one is just common sense (or at least it should be).
Saturday, August 11, 2012
Mac OSX Xcode/Command Line Tools
So it looks like Xcode 4.1.1 no longer includes command line tools now that it's on the Apple Mac Store. Terrible.
The good news is, after going through multiple Apple login pages and hunting around, I finally got to the one that gets me command line tools:
https://developer.apple.com/downloads/index.action
Go there, log in, and in addition to Xcode itself is a SEPARATE download for command line tools. Wonderful.
I guess I should be happy they still have them somewhere to download. Too bad they removed them from the general Xcode download (or, alternatively, too bad they haven't made them another option on the App store if that's the direction they're heading).
The good news is, after going through multiple Apple login pages and hunting around, I finally got to the one that gets me command line tools:
https://developer.apple.com/downloads/index.action
Go there, log in, and in addition to Xcode itself is a SEPARATE download for command line tools. Wonderful.
I guess I should be happy they still have them somewhere to download. Too bad they removed them from the general Xcode download (or, alternatively, too bad they haven't made them another option on the App store if that's the direction they're heading).
Friday, May 11, 2012
ANNOVAR Patching (May 11, 2012)
Making Annovar Work
Add support for Feb 1kG release
It turns out I had made a typo in the command and the Feb 2012 variants work fine! Just be very careful to have it as "1000g2012feb_all" in the annotation part.
Kai, the author of Annovar, explained to me that the 1kG is going to start releasing variants in ethnic subgroups in the future, thus the addition of the "_all" to these datasets now. Works for me! (see below)
Fix to 1kG annotation confusion
Also something to note is that the Annovar website is a bit unclear (in my opinion) about the naming scheme it uses for the databases from 1kG. To download them, one would use the naming scheme on the downloads page (e.g. "1000g2011may").
But line 136 in the annotate_variation.pl script enforces a naming scheme that doesn't match this. Rather, one would need to use "1000g2011may_all" or else the db isn't recognized as native and is treated as a "generic" db.
Rather than patching a "fix" to that, I simply changed my queries to add the "_all" onto all of the 1kG annotations.
TFBS memory problem
I don't know if I'm experiencing this because of using a new cluster system (I think that's why) or if it's the new version of Annovar, but I was actually having annotation against transcription factor binding sites ("tfbs" in Annovar). Thankfully, there's now memory functions built into Annovar that seem to help this.
I simply added "--memtotal 100000000" and it started working.
Not sure what happens if it actually tries to use more than that much memory--does it dump to a tmp file or crash? Not sure. But so far it's working with this much room.
What's kind of weird is no other database has this issue so far. Just TFBS of all things.
Monday, March 12, 2012
Cliff Reid on CG vs Illumina
Recently saw a post on the Complete Genomics website from Cliff Reid discussing our Nature Biotech paper (Lam et al, Dec 2011) in which we compared sequencing the same individual to high depth on both Complete Genomics and Illumina and compared them.
I think some of it is fair, but I do want to go a bit into it because it’s more complicated than just breaking it down by the Sanger validation rate.
Accuracy/Sensitivity
First off, Cliff addresses the fact that we found CG was more accurate than Illumina when it comes to SNP detection rate. To determine this, we used three small Sanger validation sets. We took a set of 20 SNPs detected by both platforms, a set of 15 detected only by Illumina and a set of 18 detected only by CG and Sanger sequenced them. Here’s how it broke down:
This is certainly pretty cut-and-dried—obviously the CG-specific variants that we were able to validate were accurate at a higher rate than the Illumina-specific ones. However, I think a fair criticism of this very experiment was the relatively small number of variants that were validated by Sanger.
The Complexities of the Experiment
I think we need to consider a bit more the problem of the very small number of SNPs that we validated in the Sanger sequencing and, perhaps more importantly, what those SNPs were and why we went ahead and used the SureSelect data instead to draw our conclusions.
The SNPs that were selected for Sanger validation included SNPs at every quality level that passed thresholds. We selected twenty variants from each set (concordant, Illumina-specific, and CG-specific) and tried to validate them by Sanger.
Please note the “number tested” in the above table doesn’t match with the 20/20/20 that I just said we designed. This is because for each of the platform-specific sets, we were unable to design primers that amplified product across seven of the SNPs, and those were indeed consistently “low quality” relative to the mean quality score for concordant variants (although, again, they passed threshold).
This is why we went forward with using the Agilent SureSelect targeted sequencing data for validation. Of course, we fully realized that such an assay would be potentially biased towards Illumina because the validation is being done on the same machine as the whole genome sequencing. That was, in fact, the reason we initially went for Sanger. But after sixty Sangers and realizing it would be hundreds more (think of that in terms of time, man-power and monetary cost) before we could generate anything really meaningful with Sanger, we decided the few we did had accomplished our goal of at least demonstrating that concordant SNPs are highly accurate but non-concordant are not and moved on.
In Cliff’s post, he extrapolates the number of platform-specific SNPs that would validate if the Sanger rates were correct across the board and concludes that CG is, in fact, more sensitive than Illumina. I caution against using this Sanger data this way because in the paper, we clearly utilized the SureSelect capture validation to make up for an inadequate Sanger experiment.
Recalculating
Regarding experimental design and how it gets conveyed in the literature. We tried to make it clear that the Sanger was only suggestive, not conclusive. I hope that got conveyed, but I’ll reiterate here that the Sanger data in that paper is limited in its usefulness because there simply isn’t enough of it and because of the lower mean quality score across those positions resulting.
We do, in fact, have a figure (Supplementary Figure 1) that demonstrates lower quality scores for the platform-specific SNPs from both platforms, and that played out in the validation. But the problem is that the sentence describing that is four paragraphs up from the Sanger validation paragraph and it isn’t linked in the text.
Here’s a quote from Cliff’s post that I want to respond to:
I want to stress the importance of Table 2 in the paper and how it shows perhaps the most important information in the entire paper. Here’s a summary of it:
“Validated” means those that were present in the whole genome sequencing and observed in the SureSelect data (true positives). “Invalidated” are those that were present in the WGS, but observed as false in the SureSelect data (false positives). And finally “not validated” are those that could not be detected adequately in the SureSelect data. The “validation rate” was determined by removing those “not validated” from the targeted count and then determining what percent of those were validated (not invalidated).
This is key. Those “not validated” do include quite a few of those “low quality” candidates that wouldn’t validate by Sanger either, of course. And those make up a relatively high proportion of the Illumina-specific SNPs (21.4% compared with CG-specific SNPs at 12.9%).
Now go back to the total counts and extrapolate that value to actual variants counts. If 21.4% of Illumina-specific variants are of this ilk, that brings the Illumina-specific count down to 271,248 SNPs. If 12.9% of CG are of this ilk (which seems reasonable given both Sanger and SureSelect), the CG–specific count goes down to 86,732 SNPs.
If you’re following me, we’re now at variant counts that we can now attach our “validation rate” to determine the actual number of true positives in a way Cliff might approve of (but without using the inadequate Sanger data). Here are the results:
This being the case, I think it’s clear why we say “Illumina was more sensitive.” I feel confident these are the numbers to use, and Illumina detected quite a few more total SNPs. However, it clearly has a higher error rate as well, so that can affect things downstream, as it’s not trivial to differentiate all those errors from the total.
As for the criticism regarding the chances of getting a 2/15 validation rate if the actual validation rate is 64.3%, for one thing I think we should use different numbers—in this case, 2/20 (the total number of those tested and validated by Sanger) and 50.5% (the total number tested and validated by SureSelect). Still, that detail aside, you’re still going to get a very small probability (e.g. a hypergeometric p of 0.0001).
But I can also look at it another way. What’s the probability of the CG-specific result doing the same thing? It’s okay, but it’s not that likely (p=0.11). Yet you’re talking about 53.9% and 13/20 there.
There’s two reasons for that:
1) 20 is a small number. With a ratio around 50-55%, unless you get 10 or 11 out of 20, you’re deviating pretty dramatically. In fact, the range for p > 0.05 with 20 pulls and a 53.9% ratio is only from 7 to 15. This is why I said we “would have had to do hundreds of Sangers”.
2) Sanger sequencing is different from SureSelect target enrichment sequencing anyway. It’s not a true subset in the first place, and is susceptible to sources of error that don’t affect next-gen sequencing.
Anyway, I don’t think that criticism is overly fair. Really, to me, it only supports that our Sanger data should not be used this way in the first place.
Finally, I don’t want to seem like I’m bashing CG here. To the contrary—CG performed exceptionally well in our paper. It is without a doubt from what I saw more accurate.
(Maybe I should write that paper where I add SOLiD into the mix…)
I think some of it is fair, but I do want to go a bit into it because it’s more complicated than just breaking it down by the Sanger validation rate.
Accuracy/Sensitivity
First off, Cliff addresses the fact that we found CG was more accurate than Illumina when it comes to SNP detection rate. To determine this, we used three small Sanger validation sets. We took a set of 20 SNPs detected by both platforms, a set of 15 detected only by Illumina and a set of 18 detected only by CG and Sanger sequenced them. Here’s how it broke down:
Number tested | Validated | Validation rate | |
Both platforms | 20 | 20 | 100% |
Illumina-specific | 15 | 2 | 13.3% |
CG-specific | 18 | 17 | 94.4% |
This is certainly pretty cut-and-dried—obviously the CG-specific variants that we were able to validate were accurate at a higher rate than the Illumina-specific ones. However, I think a fair criticism of this very experiment was the relatively small number of variants that were validated by Sanger.
The Complexities of the Experiment
I think we need to consider a bit more the problem of the very small number of SNPs that we validated in the Sanger sequencing and, perhaps more importantly, what those SNPs were and why we went ahead and used the SureSelect data instead to draw our conclusions.
The SNPs that were selected for Sanger validation included SNPs at every quality level that passed thresholds. We selected twenty variants from each set (concordant, Illumina-specific, and CG-specific) and tried to validate them by Sanger.
Please note the “number tested” in the above table doesn’t match with the 20/20/20 that I just said we designed. This is because for each of the platform-specific sets, we were unable to design primers that amplified product across seven of the SNPs, and those were indeed consistently “low quality” relative to the mean quality score for concordant variants (although, again, they passed threshold).
This is why we went forward with using the Agilent SureSelect targeted sequencing data for validation. Of course, we fully realized that such an assay would be potentially biased towards Illumina because the validation is being done on the same machine as the whole genome sequencing. That was, in fact, the reason we initially went for Sanger. But after sixty Sangers and realizing it would be hundreds more (think of that in terms of time, man-power and monetary cost) before we could generate anything really meaningful with Sanger, we decided the few we did had accomplished our goal of at least demonstrating that concordant SNPs are highly accurate but non-concordant are not and moved on.
In Cliff’s post, he extrapolates the number of platform-specific SNPs that would validate if the Sanger rates were correct across the board and concludes that CG is, in fact, more sensitive than Illumina. I caution against using this Sanger data this way because in the paper, we clearly utilized the SureSelect capture validation to make up for an inadequate Sanger experiment.
Recalculating
Regarding experimental design and how it gets conveyed in the literature. We tried to make it clear that the Sanger was only suggestive, not conclusive. I hope that got conveyed, but I’ll reiterate here that the Sanger data in that paper is limited in its usefulness because there simply isn’t enough of it and because of the lower mean quality score across those positions resulting.
We do, in fact, have a figure (Supplementary Figure 1) that demonstrates lower quality scores for the platform-specific SNPs from both platforms, and that played out in the validation. But the problem is that the sentence describing that is four paragraphs up from the Sanger validation paragraph and it isn’t linked in the text.
Here’s a quote from Cliff’s post that I want to respond to:
“The paper also points to the magnitude of the problem caused by validating the Illumina platform with the Illumina platform. The Sanger validation data can be used to estimate the confidence in the results of the target enrichment validation data. If the Illumina unique SNPs really were 64.3% true SNPs as reported, then the likelihood of getting the Sanger validation results (2 of 15 validated SNPs) is less than 1 in 10,000. While the exact Illumina SNP validation rate is unknown, the Sanger data tells us that we can be more than 99.99% confident that it is less than the 64.3% calculated by this biased validation approach. For these reasons, we believe 64.3% is not the correct number to use in calculating the sensitivity of the Illumina platform in this study.”
I want to stress the importance of Table 2 in the paper and how it shows perhaps the most important information in the entire paper. Here’s a summary of it:
Validated | Invalidated | Not validated | Validation rate | |
Concordant | 81.0% | 6.4% | 12.6% | 92.7% |
Illumina-specific | 50.5% | 28.0% | 21.4% | 64.3% |
CG-specific | 53.9% | 33.2% | 12.9% | 61.9% |
“Validated” means those that were present in the whole genome sequencing and observed in the SureSelect data (true positives). “Invalidated” are those that were present in the WGS, but observed as false in the SureSelect data (false positives). And finally “not validated” are those that could not be detected adequately in the SureSelect data. The “validation rate” was determined by removing those “not validated” from the targeted count and then determining what percent of those were validated (not invalidated).
This is key. Those “not validated” do include quite a few of those “low quality” candidates that wouldn’t validate by Sanger either, of course. And those make up a relatively high proportion of the Illumina-specific SNPs (21.4% compared with CG-specific SNPs at 12.9%).
Now go back to the total counts and extrapolate that value to actual variants counts. If 21.4% of Illumina-specific variants are of this ilk, that brings the Illumina-specific count down to 271,248 SNPs. If 12.9% of CG are of this ilk (which seems reasonable given both Sanger and SureSelect), the CG–specific count goes down to 86,732 SNPs.
If you’re following me, we’re now at variant counts that we can now attach our “validation rate” to determine the actual number of true positives in a way Cliff might approve of (but without using the inadequate Sanger data). Here are the results:
Total | Extrapolated "good" calls | Extrapolated validation | |
Concordant | 3,295,023 | 2,879,850 | 2,669,621 |
Illumina-specific | 345,100 | 271,248 | 174,412 |
CG-specific | 99,578 | 86,732 | 53,687 |
This being the case, I think it’s clear why we say “Illumina was more sensitive.” I feel confident these are the numbers to use, and Illumina detected quite a few more total SNPs. However, it clearly has a higher error rate as well, so that can affect things downstream, as it’s not trivial to differentiate all those errors from the total.
As for the criticism regarding the chances of getting a 2/15 validation rate if the actual validation rate is 64.3%, for one thing I think we should use different numbers—in this case, 2/20 (the total number of those tested and validated by Sanger) and 50.5% (the total number tested and validated by SureSelect). Still, that detail aside, you’re still going to get a very small probability (e.g. a hypergeometric p of 0.0001).
But I can also look at it another way. What’s the probability of the CG-specific result doing the same thing? It’s okay, but it’s not that likely (p=0.11). Yet you’re talking about 53.9% and 13/20 there.
There’s two reasons for that:
1) 20 is a small number. With a ratio around 50-55%, unless you get 10 or 11 out of 20, you’re deviating pretty dramatically. In fact, the range for p > 0.05 with 20 pulls and a 53.9% ratio is only from 7 to 15. This is why I said we “would have had to do hundreds of Sangers”.
2) Sanger sequencing is different from SureSelect target enrichment sequencing anyway. It’s not a true subset in the first place, and is susceptible to sources of error that don’t affect next-gen sequencing.
Anyway, I don’t think that criticism is overly fair. Really, to me, it only supports that our Sanger data should not be used this way in the first place.
Finally, I don’t want to seem like I’m bashing CG here. To the contrary—CG performed exceptionally well in our paper. It is without a doubt from what I saw more accurate.
(Maybe I should write that paper where I add SOLiD into the mix…)
Friday, February 17, 2012
Exome Annotations
I just posted a thread on 23andMe about which annotations I use for my exome data. Here's what I said:
The idea behind these types of annotations is to give us a way to sift through the data and extract biologically meaningful results. For example, we are most interested in mutations that actually cause a protein coding change, that are uncommon in the population, and that are predicted to have a dramatic effect on function.
So far these types of annotations have allowed me to narrow very long lists of results in exomes (think on the order of 30-50,000 mutations) down to just a handful (1-20) candidate mutations for particular Mendelian disorders.
Anything I missed?
I currently use Annovar for annotating VCF files. The output from Annovar is not particularly intuitive, so I wrote a perl script that generates a VCF-based report. I thought I would share the annotations I've been using and ones I plan to add, and see if anyone else has any other annotation ideas. These could be useful for us to annotate our own genomes (and potentially for 23andMe to provide in the future).
The annotations I've been including are:
- Gene annotation (type of mutation--exonic, intronic, splicing, etc.)
- Gene name
- Mutational description (i.e. specific amino acid change, etc.)
- dbSNP130
- dbSNP135
- WashU Exome Variant DB (EVS)
- Transcription Factor Binding Site (TFBS)
- SIFT score
- PolyPhen 2 score (PP2)
- GWAS presence
- Segmental duplication
(The reason I include both dbSNP130 and 135 is that 135 contains quite a few SNPs that are potentially meaningful from a disease and trait standpoint while 130 is mostly markers not directly affecting diseases and traits. 130 is a subset of 135. Also, the EVS is potentially more useful than either of them as a filtering device.)
Ones that I would like to include in the future:
- VAAST
- MIE sites/scores (Mendelian inheritance errors)
- 23andMe annotations (anything from 23andMe's SNP databases--can 23andMe help with that?)
Any other ideas for great annotations that should be included?
The idea behind these types of annotations is to give us a way to sift through the data and extract biologically meaningful results. For example, we are most interested in mutations that actually cause a protein coding change, that are uncommon in the population, and that are predicted to have a dramatic effect on function.
So far these types of annotations have allowed me to narrow very long lists of results in exomes (think on the order of 30-50,000 mutations) down to just a handful (1-20) candidate mutations for particular Mendelian disorders.
Anything I missed?
Friday, February 10, 2012
Sequencing My Exome: Why?
“Why do you want to do
this?”
My wife, immediately after I
tell her I'm going to sequence my own exome.
There are a few times in life where you
want to do something so badly, but find it difficult to convey to others why. This was one of those times. Such a
simple question, but so many different answers. And each answer as valid as all
the others. All of them coming together to explain why I would want to do
something so, well, unusual.
I could frame a whole dissertation on
the reasons behind wanting to sequence my own exome. (And I will.) But first,
the simple answer:
I’m
curious about myself.
I want to see if I can figure out why I
am the way I am. And by that I mean both physically and mentally. For some
people, this isn’t something they’ll ever think about. For others, they might
see very clearly that they are this way because God made them this way, or
because their parents raised them like this, or because they had bad luck. They
may simply accept that they have specific features that make them who they are
and aren’t concerned with why.
For me, those answers are not good
enough.
I know that there are mysteries to be
solved in my genetic code. It comes with the territory of being a geneticist.
That said, though, almost everyone thinks this way, usually without even
realizing it. We can all look at our own families as a proxy for genetics. If
your mom had type 2 diabetes and your sister has type 2 diabetes and your uncle
has type 2 diabetes, you’re pretty sure you have a higher chance of getting type
2 diabetes. You’ll hear all sorts of people saying that—“guess I got my mom’s
bad genes” and “guess he took after his father” and so forth. If you’ve ever
known somebody who got old enough, you might have heard her tell you about her
mother lived to a ripe, old age and her mother before her, and so on. That’s
what I call thinking genetic.
The difference for me is that I’m thinking genetic at a different level. I actually want to look at
my genetics to try to explain these types of things. I don’t believe in fate
without reason. If my mom lives to be 90, and my grandmother lived to be 90, I
want to know if I got the mutations that helped them get there. Could I just
shrug my shoulders, say, “I probably did,” and move on? Absolutely. But that’s
just not good enough for me.
And really, it’s not good enough for anyone. We’re no longer
entering the era where we can do better than that. We’re already there. Exome
sequencing represents that first major step into the era.
I think, to make a case for exome sequencing (and by the
way, whole genome sequencing is basically just an expansion and improvement
upon exome sequencing—more on that later), I first need to explain what we can
learn from it. And to do that, first you’ll
need to know what an exome is. For those readers who already know all about
this, feel free to skip down.
What is an exome,
anyway?
To understand what the exome is, you first have to
understand what the genome is. There are massive tomes on the details of the
subject, but to describe the genome succinctly:
The genome is the
blueprint for every cell in your body.
Every single protein in every one of your cells is encoded
on this massive blueprint. In order to create and maintain a cell (and
therefore, your body and very being), your cell quite literally reads the
genome and generates certain amounts and types of various proteins to fit the
particular cell it’s trying to become or to fulfill a particular function.
The exome is a subset of the genome that contains the
instruction to create the proteins themselves. The exome makes up about 1-2% of
the whole genome. If the genome is the blueprint, then:
The exome is the
instructions for making every protein in your body.
Therefore, being able to read those instructions means we
can figure out if differences in them will result in different protein
structures.
What can I learn from
the exome?
Identifying variations in the exome that lead to differences
in proteins (which we call mutations) can give us a direct way of determining
if a protein might have altered function in us compared to other people.
Significant protein mutations will manifest themselves as traits. To bring up
an example from a previous post, the earwax trait is a result of a variation in
the exome that leads to a mutation in a protein that results in determining if
your earwax will be wet or dry. But it goes far beyond that type of
“interesting” trait. Mendelian disorders (which this blog derives its title
from) are disorders resulting from mutations in a single gene, which we can
detect in the exome (and, in fact, quite a few Mendelian disorders have been
“solved” through exome sequencing now).
By sequencing the
exome, we can directly assess every line of the “instructions” and identify
those lines that differ from the norm.
But that’s not the only way to use this information. We can
hunt for mutations that damage our proteins, and that is the first obvious
thing to do when looking at the exome. But we don’t know how every mutation
will affect a person. To the contrary—there are very few mutations for which we
understand the effect.
In fact, the current standard in personal genomics testing
(such as that from DTC companies like 23andMe or through-physician companies
like Navigenics) is actually an approach that dominated the field for about a
decade before next-generation sequencing really became a reality. Using
microarray technology, these approaches measure specific sites known to harbor
variants in the genome that are associated
with a trait or disease but typically not causative
for the trait or disease.
For example, right now if you were to do a standard 23andMe
test, you’d have genetic variations assessed at about a million sites across
your genome. These variations would then be compared to a database that tells
how strongly particular variations associate with particular traits or
diseases. So 23andMe can tell me that I have a collection of variants that
associate with type 2 diabetes, and it can calculate how that increases my risk
of getting the disease compared to the average person.
This is more of a science than people often think. This is thinking genetic at a slightly more
advanced level. I could simply turn to my family history and guess that I’m at
an increased risk for type 2 diabetes. However, the fact that my genetics
confirm the increased risk makes it much more “real” to me. Not only do I have
a family history, I actually inherited some of those genetic factors. My risk
is real.
Knowing my exome sequence takes that to the next level.
Rather than simply having associations, I may be actually able to go into the
regions of association and identify mutations causing these problems.
Moreover, as more and more information regarding the genetic
causes of various traits and diseases are discovered, my exome sequence will
always be at hand for me to cross-reference. Imagine that tomorrow a study is
released identifying a gene that tells you with complete confidence whether or
not you’ll get type 2 diabetes. I would check that gene in my own exome for mutations
immediately!
That may sound unrealistic, but when it comes to conditions
like cancer, these kinds of studies come out all the time. I may identify a
random mutation in a gene that pre-disposes people to getting a particular type
of cancer in my own genome, and then I will know that I need to have my doctor
monitor for that. Having worked closely on brain cancer for a few years, it
struck me that the reason it’s the deadliest type of cancer is because by the
time we detect it, it’s already at a very advanced stage. But if we have a gene
or set of genes that we know predisposes people to get malignant brain tumors,
we could look in our own exomes for mutations in those genes and then get
ourselves MRIs starting at a particular age to try to detect them earlier and
hopefully allow effective, long-term treatement.
I think anyone can see how powerful that type of diagnostic
and predictive tool can be.
And that brings up a major reason to sequence one’s genome:
This information is immutable. Your exome is not changing. On the day you die,
you’ve got pretty much the same exome and genome you had when you were born. If
a major discovery is made tomorrow, I’ll have my exome to look at for it. If
another discovery is made in ten years, I can take that same exome sequence and
look for it. There’s no “expiration date” on that information.
And that’s what really sold me on the whole thing, actually.
My intimate knowledge that my exome is always going to be a part of me, and
that our understanding of genetics and diseases will always be expanding. That
means my investment now is going to pay off for my whole life. Or at least
until I sequence my whole genome.
I hope that conveys my major reasoning behind why I would
want to do this. Of course there are other factors as well. For one thing, I am
a geneticist. Genetics is not just my job, it’s my hobby. I love it. And over
the years I’ve become increasingly interested in my own genetics. But that’s
honestly not the only reason. At this point, I see it as a choice that will
help me keep myself healthy throughout my life.
I think there will be a shift
generally towards that thinking in the medical community at large in the very
near future as well. It may only be a couple years before your doctor suggests
you get your exome sequenced as well. In a society where I feel most of us already think genetic, I think it's only a matter of time before we stop simply guessing that it's genetic and instead actually prove it. And beyond that, we actually figure out that there's something we can do about it. That is empowering right there.
Thursday, February 9, 2012
Enter the Exome
My Exome kit from 23andMe has arrived! In a few short weeks, I will have my exome sequence in hand and ready to analyze. I've looked at hundreds of exomes over the past year, but only now, when I'm about to look at my own, have I started to really think about how to extract meaning from a healthy individual's exome. All of the work I've done has either been to assess exome sequencing as a science or to hunt for mutations causing specific conditions (Mendelian Disorders and novel genetic syndromes).
Now I'm going to have my own sequence in hand and have a very basic yet exceedingly complex question to answer: What does this all mean?
And with that question comes other questions:
Why do I care about my own exome?
What can I learn from it?
What justifies the cost?
Is it safe?
In the coming days, I will be posting about my answers to these questions and more. I came to a realization a couple of days ago (right after I ordered this kit) that even questions that seem simple to me as a geneticist are not so simple for most people.
"Why do you want to do this?" is harder to answer than people may think. Off the cuff I might say, "I'm a geneticist, it's what I do!" but that isn't at all the whole answer. So I am going to make it a goal to explain why anyone would want to have his or her exome (or genome) sequenced in terms that hopefully anyone can understand.
Subscribe to:
Posts (Atom)