Mendelian Disorder

Why we didn't sequence HeLa

2014-02-17T16:36:00.001-08:00

Recently, another paper about HeLa's genome sequence was published in Nature. Along with that came an article explaining how the ethical ramifications of publishing the sequence of HeLa were handled before this recent paper's release.

My friend Dr. Cuiping Pan asked about why, when we were deciding on a cell line to whole genome sequence and publish back in 2008, we chose U87MG rather than HeLa. Here is my response:

At the time, we had a big argument about whether to do HeLa. We all knew it would be higher profile to do HeLa, but even in 2007/8 we were aware that the cell line had been derived without Henrietta Lacks' informed consent. Meanwhile, obviously HeLa is the most commonly used cell line, so we knew it would be more useful generally than U87MG, which we identified as the most commonly used glioma cell line but also only around the twelfth most commonly used cell line generally. [A quick Pubmed search finds 1,110 papers with the term "U87MG" in them, compared with 78,343 for HeLa!]

So it came down to an ethical choice versus a utilitarian choice, and it happened that the ethical choice would also be less "rewarded" with references, high impact journals, et cetera. In fact, we spoke with Nature and were basically told it would be harder to publish U87MG than HeLa there (but it's a moot point because we also wanted to support open access and Plos, which is why we went for Plos Genetics at the time).

Anyway, looking back I think we made the right choice. Back then, Francis Collins was not head of NIH and the book on Henrietta Lacks had not been released yet. The [European] group who subsequently sequenced HeLa got in a lot of hot water.

As for U87MG, we looked into its history. It was an anonymously donated sample that had been consented as far as we could tell in Germany. [Note: This is all off the top of my head, so don't use that as a primary reference, please.] The particular individual it was derived from was not publicly known that we could see, though his identity is still likely in the medical records of the hospital where he was treated. We did not try to find out who he was, because keeping him anonymous and deidentified was the ethical choice.

Also, on a practical level, doing a male genome seemed interesting since it would get us Y chromosome, but that was a minor and not very compelling reason.

I am of the opinion that as geneticists and genomic scientists, we must do things above the board and be very careful about informed consent and protecting people's identities. There is literally nothing as identifying as the genome, so we must be cautious and considerate in all cases. This is why, before ever considering sequencing something, we always need to consider the ethical implications.

HeLa, the cell line, is something we simply cannot give up on now. It is too late, and it has done too much good and it has too much future potential to do good in the future for all of us. But even with that in mind, I still feel using it and even sequencing it is questionable. It's a hard feeling to shake and it's sourced at the fact that the person who the cell is derived from never consented. I think, over the past few years, the dialogue between health science and the Lacks family has been a very good change, and I'm very happy the Lacks have basically consented to use of HeLa generally.

Really, it goes to show why it is so important for us to do as much as possible to behave ethically from the start. To avoid these types of situations in the future.

Where have I been?

2013-01-21T21:35:00.002-08:00

I was never someone who blogged a great deal. But once or twice a month, I would put hand to keyboard and come up with something to talk about that is relevant to my favorite field of study: Genetics.

I have every intention to continue doing that. But last April I started working for a small start-up company, and I've been very busy. And I do mean busy. I thought I was busy before in graduate school and my post-doc, but with the exception of grant time or paper writing, I can honestly say I have never been as busy as I have been the past nine months.

That's right, I've been working in industry for nine months. I'm not sure if I mentioned that before. Admittedly, I'm working at a start-up, so it's not quite like joining Massive Dynamic and getting lost in the army of scientists doing this or that. It's much more fun than that. I would say it isn't quite at the level of the Internet start-up era in the late 90s/early 00s, but it's interesting.

The genomics field is really heating up. It's great to see how rapidly it is advancing. And the competition is fierce. Every little company competing in this space is exceptional with exceptional people, exceptional goals, and (hopefully) exceptional funding.

There was a time where I was gung ho about staying in academia. I still love academia and research. But I began to see the writing on the wall about two years ago, when I realized that, frankly, the way academia has handled genomics has resulted in a few massive centers for genomics and sequencing and then a lot of little centers that amount to core facilities servicing their attached universities. I took note of what opportunities there were, and there seemed to be a lack in academia. Perhaps it's because of the funding situation with the NIH and the way it has funded massive centers over smaller sequencing projects.

Meanwhile, at the same time, I saw a number of start-ups in this space popping up. Places that had money and drive to advance the science of genomic analysis--which is exactly what I had wanted to do in my academic career.

I tried writing a grant once where I had planned out ten years of work with the goal of advancing genomic analysis and after reading it through a few times I realized I would have had far better chances securing VC funding with that grant than obtaining an R01 from the NIH. That's the moment it struck me that perhaps the most rapid and impressive advancement in this area will take place in industry for the time being.

I basically feel validated when I see the Biobase mark on the ANNOVAR website, or the Appistry logo on the GATK website.

That's not to say that I would never return to academia. I would love to, and if there's a shift in the wind and it pushes me back in that direction, I would go there. But for now, I would just say that I am having a very good time in the industry, learning new skills but also flexing my strengths.

Anyway, that's where I've been. I'm still around. I'm just working longer hours.

Save the Square Watermelons

2012-11-05T18:23:00.001-08:00

NO on 37! Think about it, Californians. Are you telling me you can't figure this one out, @carighttoknow? #prop37 twitter.com/mjcgenetics/st…
— Michael James Clark (@mjcgenetics) November 6, 2012

Also, I just realized I've had over 25,000 hits on this blog. Thanks for reading! I hope it's been useful and interesting!

"Why I Don’t Want to Know My Genome Sequence"

2012-11-01T12:08:00.002-07:00

"Why I Don't Want to Know My Genome Sequence"

Kind of interesting to read this as someone who at least has his own exome sequenced and has learned a lot from it. It's kind of interesting to read it in contrast to my own post explaining why I wanted to sequence myself.

It ends with a funny statement:

An osteoarthritis mutation manifested itself as an inability to play an F chord at age 33. A p53 mutation and then another that bloomed in response to years of orthodontia X-rays gave me thyroid cancer a few years after I gave up the guitar. And I don’t need a genetic test to know I didn’t inherit my father and grandfather’s psychotic depression.

Ron Crystal, even though he’s among the sequenced, has the right idea: don’t smoke, exercise, eat a healthy diet, and don’t worry about DNA sequences. That’s good enough for me – at least for now.

This is one of those statements used to make one's self feel better about not doing something he or she wants to do. That's my feeling, at least.

To be more specific, let's use myself.

Sure, I could have lived my life thinking my borderline migranes were from caffeine withdrawel and my mother and not bothered to find out the exact mutations likely to cause it and therefore which drugs may actually have a beneficial effect on me. I could have. But I didn't.

I could have lived my life thinking my intestinal issues were due to a bad diet or food allergies. Because clearly copious amounts of salad, low fat, low carb, low salt is a bad diet, right? And clearly my wife who eats the same things and has zero gastrointestinal issues is just lucky. Oh no, it must be food allergies. It couldn't possibly be that I inherited two mutations, one from each parent, that damage a particular gene already known to be causative for gastrointestinal problems (among a number of other things that I happen to have that most general practitioners wouldn't link together).

I could have survived without knowing I have asthma. Maybe.

I mean, go ahead and live in ignorance if that's your thing. If you're not interested in your own genetics, then fair enough. You're not alone. But it was damn reassuring to me, personally, to figure out exactly what genes are mutated and how those are causing conditions that negatively effect me.

And I would say not only am I healthier, but I'm also more aware of my health. And that's a good thing.

And that genetic information isn't going to expire.

Anyway, let's not devalue our genetics this way. The fact that we don't know everything yet doesn't make the data itself less valuable. Yes, it will take effort to understand but then again, so does nearly everything about your health and life.

I see it as a great thing.

ASGH2012 Fail

2012-10-31T11:58:00.001-07:00

#ASHG2012 : Encouraging cloning by putting multiple highly interesting talks I want to see in concurrent sessions. twitter.com/mjcgenetics/st…
— Michael James Clark (@mjcgenetics) October 31, 2012

Misunderstood: Genetic tests and the people who don't understand them

2012-10-22T00:30:00.000-07:00

On Friday, I was made aware of a story about a boy in my town (Palo Alto, California) who was being transferred from his school to another school against his and his parents wishes. He didn’t do anything wrong and he didn’t want to leave. Instead, the administrators at his school made this decision based on genetic information they were given by the parents. Genetic information that they apparently did not understand fully.

The Story

The original story made a bit of a splash, with articles on major news sources. Here are a few links explaining the fiasco:
The original article that exposed what happened.
An official response from the superintendent of the school district.
An SFGate article that exposes important facts about this case.
The article I first heard about the story in.

The basic story goes like this:

A boy named Colman Chadam is being forced to transfer out of his current school in the middle of the school year because he carries mutations that cause cystic fibrosis (CF). CF is a relatively common yet fairly severe genetic condition. Kids with CF should not be in close proximity to one another because they can easily infect each other with respiratory infections. In other words, a school would not want two kids with CF in the same class, and would probably be justified in keeping them separated for their own health (though it’s unclear if forcing them to go to separate schools is even necessary).

CF can be predicted genetically—we are aware of major CF mutations (in the CFTR gene) in the population and a standard genetic tests can identify them. But CF is typically diagnosed through a non-genetic test called a sweat test, and is otherwise a rather self-evident disorder because it is a serious condition. It affects one in twenty-five Caucasians. Moreover, the genetic tests are necessarily conclusive—one could carry mutations for the disease yet not have it.

The thing is, Colman was apparently never diagnosed with CF. He has never displayed any symptoms of the disorder. Instead, according to the SFGate article, Colman was found to have mutations in CFTR that could potentially cause CF. But they haven’t for him.

For whatever reason Colman had a genetic test including CFTR eleven years ago. It’s unclear in the articles I’ve found, but I’ve been told he’s likely to have tested positive for a sign of CF at birth and then had the genetic testing to confirm, but the disease never manifested later. So despite not having cystic fibrosis, his parents are aware that he carries mutations in the CFTR gene.

And, probably without thinking that there could be serious ramifications for Colman, his parents told the school that he carries mutations for CFTR when they were asked to report any pertinent genetic information.

The SFGate article discusses the progression from an innocuous bit of irrelevant “medical” information to Colman being transferred out of his school:

“A few weeks into the school year at Jordan Middle School, school officials took note of Colman's medical history, information that eventually was shared with another Jordan parent whose two children have classic cystic fibrosis and are predisposed to chronic lung infections.”

Can anyone else see some major problems with this little statement? First off, school officials were looking into this child’s medical history why, exactly? He does not have cystic fibrosis, so why was it being followed up so heavily by the administration? Second, his information should never have been shared with the other parent who has two kids with CF.

Moreover, with one in twenty-five Caucasians carrying mutations for CF, there’s likely to be numerous children at the same school carrying CF mutations who are not being transferred because they never had any symptoms and they never had a genetic test. That’s one of the things that makes this action by the school so unreasonable.

So I tried to think about how this could have happened.

I think what probably occurred was that Colman’s parents reported his CFTR mutations just in the interest of being fully compliant with school rules that say, “Tell us everything (or else),” which most parents are probably familiar with.

Then a nurse or administrator or someone (who hopefully had permission to see Colman’s records) who was familiar with cystic fibrosis (at least at the level of understanding that two kids with CF shouldn’t be in the same class) noted that Colman had CF mutations and misunderstood that to mean he has cystic fibrosis.

I’m not sure whether the other parents were really informed or not (at least one article suggests they were—if they were, they shouldn’t have been). But what I can imagine is that in the interest of avoiding any potential problems (imagined or otherwise), they decided to transfer Colman out without really understanding that he does not have cystic fibrosis.

There’s some pretty strong evidence of this in a letter from the superintendant that was posted on PaloAltoPatch. I’m pasting it below for posterity:

“The Palo Alto Unified School District strives to meet the needs of all students and to ensure that a safe and welcoming learning environment is provided for all our students and staff. To be clear, this commitment includes caring for our students with medical needs.
“At this time, our District serves students who are diagnosed with Cystic Fibrosis (CF), a serious medical condition that creates a need for our staff to observe strict protocols on cleanliness in order to help protect these students' fragile condition. I am proud that our staff energetically and compassionately works to observe the necessary protocols for these students. Further, I am proud that our staff thoughtfully embraces these, and all, students.
"A recent story that appeared in the San Francisco Chronicle addresses the distance -- the literal physical distance -- that must be constantly maintained between children with CF in order to avoid bacterial cross contamination. I want to stress that medical authorities clarify that CF is not a health threat to the general public (a theme that was touched on in the story), but it is a topic of concern for non-sibling children with CF.
“As background, at the beginning of this school year, we became aware that students (not related) with CF were on the same middle school campus. Based on the advice of medical experts, who stress the need for non-sibling CF patients to constantly maintain a specific distance from each other, we thought it best to place the students at different campuses. Again, based on the advice of medical experts, this is the zero risk option, and most certainly helps our District deliver on its commitment to provide safe learning environments.
“We asked the family of the student who is new to our District if they would consider moving their child to Terman Middle School. We asked because Board policy gives priority site choice to students who are established at a site, and the other students have been with our District for years. Sadly, it was this request that caused the controversy, and the media coverage.
“I want to stress a few critical points.
“First, at both of these schools the dedicated staff are stepping up and working to provide these students with excellent learning environments. We are grateful for their professionalism and compassion.
“Second, we are education professionals, not medical experts. I assure you that we will take action based on recommendations from medical authorities. Expert advice will be balanced by the PAUSD Board policy that strictly instructs our District to ensure we provide a safe learning environment for all students. To be clear, our Board of Education places the highest priority on student safety, in all situations and in protecting the privacy interests of all students.
"Finally, we must be honest about the fact that we are talking about public schools, not medical facilities. We will strive to implement all health and safety protocols to help protect the health of these children; however, the harsh reality of a busy middle school campus, where students ranging in ages from 12 to 15 share a cafeteria, restrooms, the gym and locker room, a library and other settings, is that it might be virtually impossible to maintain a specified separation and sanitation protocols at all times. This reality is what made us gravitate to the separate campus option. CF is a life-threatening condition, and in this context, based on the information provided to us at the time by all parties, we believed that zero risk was the best course of action. We hoped that these families would agree.
“This topic is understandably emotional for the families involved, and we will work with them to meet the needs of their children. However, I must again stress that we will follow the advice of medical experts, who we trust will help us understand the option that eliminates, if possible, all risks.”
Sincerely,
Charles F. Young, Ed.D.
Associate Superintendent Education Services
Palo Alto Unified School District

I feel that this letter clearly shows that at the very least, Dr. Young here was under the impression that Colman has cystic fibrosis. But he apparently does not. He carries mutations in CFTR, but he doesn’t have CF. Here’s another piece of evidence that Dr. Young and the other administrators are misinformed, thinking that carrying CFTR mutations is equivalent to having CF:

“The administrators sought medical advice, Silverman said Thursday, which resulted in a recommendation from Dr. Carlos Milla, of Lucile Packard Children's Hospital, saying that, ideally, children with cystic fibrosis would attend separate schools.
The recommendation was not based on knowledge of Colman's specific medical history, his parents and their attorney said prior to Friday's court appearance.”

Why else would they seek medical advice about whether children with cystic fibrosis should attend the same school unless they thought that Colman actually has cystic fibrosis? Rather than speaking with a medical doctor, they should have consulted with a board certified genetic counselor (which Stanford, where they received consult on this issue, has a number of). At the very least they should have spoken with a medical geneticist or even just a geneticist like yours truly. They didn’t even know that, though, because they collectively did not even realize genetics makes a huge difference here.

The Lesson

I honestly do think this is all a big misunderstanding. I hope that the administration is willing to admit a mistake was made and rescind their decision to transfer this kid over a misunderstanding about how cystic fibrosis works.

But to me, it’s also a troubling sign. I’ve sequenced my own exome. I know genetic facts about myself, and if Colman’s case is any indication, I would have no immediate recourse if that information were to be used against me in a legal way. What’s happening to Colman has yet to be ruled on, but regardless, his life is still being thrown to the wind while these decisions get made based on misunderstanding and ignorance about how genetics and disease work.

Moreover, what of my own future children? Or my current relatives? Information comes out about me and that information is shared by my relatives, at least in part. Could it be used against them by people in positions of power who don’t understand genetics?

These are major reasons I’ve yet to share my genetic information with others openly. I would absolutely love to make my data public, to reap the benefits of the community at large studying my genome. I had plans to do just that, but I have yet to enact them for these reasons. Because the current protections do not consider all possible problems—like this issue with Colman Chadam.

Solutions

Currently, I can’t stress with people enough that they have to protect themselves. I love the idea of living in a perfect world where we can share our genetic data and not be subject to prejudice (intentional or unintentional) as a result. But we honestly don’t live in that world yet, so those of us who have been sequenced should seriously consider whether sharing our data is in our best interest.

That includes not sharing genetic information with, for example, your kid’s school district. Of course we think it’s pertinent information they might need, but they do not need to know that your child carries CFTR mutations (only whether or not the child actually has CF).

As for Colman and his family, I think they need to push this. At the personal level, what’s happened to Colman is certainly unjust and possibly illegal. And for the PAUSD and its administrators, they need to learn a lesson here—that they need to fully comprehend what’s going on with students medically before making blunt-ended decisions about what to do with them. That if they are confronted with genetic information, they need to see a genetic specialist to interpret it correctly.

On a more general level, it needs to be recognized that genetic information cannot be used to discriminate against people even in the most innocent-seeming way, even when it feels like the right thing is being done. Until then, we who have genetic information about ourselves need to be careful with it and how we share it.

This issue is starting to spring up as genetic testing becomes more common. Imagine if a young person has genetic predisposition to asthma, but has yet to have an asthma attack. Do you stop that child from playing soccer or do you give her an inhaler just in case? The first one is, in my opinion, genetic discrimination. The second one is just common sense (or at least it should be).

Mac OSX Xcode/Command Line Tools

2012-08-11T18:35:00.001-07:00

So it looks like Xcode 4.1.1 no longer includes command line tools now that it's on the Apple Mac Store. Terrible.

The good news is, after going through multiple Apple login pages and hunting around, I finally got to the one that gets me command line tools:

https://developer.apple.com/downloads/index.action

Go there, log in, and in addition to Xcode itself is a SEPARATE download for command line tools. Wonderful.

I guess I should be happy they still have them somewhere to download. Too bad they removed them from the general Xcode download (or, alternatively, too bad they haven't made them another option on the App store if that's the direction they're heading).

ANNOVAR Patching (May 11, 2012)

2012-05-11T14:47:00.001-07:00

Making Annovar Work

Add support for Feb 1kG release

It turns out I had made a typo in the command and the Feb 2012 variants work fine! Just be very careful to have it as "1000g2012feb_all" in the annotation part.

Kai, the author of Annovar, explained to me that the 1kG is going to start releasing variants in ethnic subgroups in the future, thus the addition of the "_all" to these datasets now. Works for me! (see below)

Fix to 1kG annotation confusion

Also something to note is that the Annovar website is a bit unclear (in my opinion) about the naming scheme it uses for the databases from 1kG. To download them, one would use the naming scheme on the downloads page (e.g. "1000g2011may").

But line 136 in the annotate_variation.pl script enforces a naming scheme that doesn't match this. Rather, one would need to use "1000g2011may_all" or else the db isn't recognized as native and is treated as a "generic" db.

Rather than patching a "fix" to that, I simply changed my queries to add the "_all" onto all of the 1kG annotations.

TFBS memory problem

I don't know if I'm experiencing this because of using a new cluster system (I think that's why) or if it's the new version of Annovar, but I was actually having annotation against transcription factor binding sites ("tfbs" in Annovar). Thankfully, there's now memory functions built into Annovar that seem to help this.

I simply added "--memtotal 100000000" and it started working.

Not sure what happens if it actually tries to use more than that much memory--does it dump to a tmp file or crash? Not sure. But so far it's working with this much room.

What's kind of weird is no other database has this issue so far. Just TFBS of all things.

Cliff Reid on CG vs Illumina

2012-03-12T19:23:00.002-07:00

Recently saw a post on the Complete Genomics website from Cliff Reid discussing our Nature Biotech paper (Lam et al, Dec 2011) in which we compared sequencing the same individual to high depth on both Complete Genomics and Illumina and compared them.

I think some of it is fair, but I do want to go a bit into it because it’s more complicated than just breaking it down by the Sanger validation rate.

Accuracy/Sensitivity

First off, Cliff addresses the fact that we found CG was more accurate than Illumina when it comes to SNP detection rate. To determine this, we used three small Sanger validation sets. We took a set of 20 SNPs detected by both platforms, a set of 15 detected only by Illumina and a set of 18 detected only by CG and Sanger sequenced them. Here’s how it broke down:

	Number tested	Validated	Validation rate
Both platforms	20	20	100%
Illumina-specific	15	2	13.3%
CG-specific	18	17	94.4%

This is certainly pretty cut-and-dried—obviously the CG-specific variants that we were able to validate were accurate at a higher rate than the Illumina-specific ones. However, I think a fair criticism of this very experiment was the relatively small number of variants that were validated by Sanger.

The Complexities of the Experiment

I think we need to consider a bit more the problem of the very small number of SNPs that we validated in the Sanger sequencing and, perhaps more importantly, what those SNPs were and why we went ahead and used the SureSelect data instead to draw our conclusions.

The SNPs that were selected for Sanger validation included SNPs at every quality level that passed thresholds. We selected twenty variants from each set (concordant, Illumina-specific, and CG-specific) and tried to validate them by Sanger.

Please note the “number tested” in the above table doesn’t match with the 20/20/20 that I just said we designed. This is because for each of the platform-specific sets, we were unable to design primers that amplified product across seven of the SNPs, and those were indeed consistently “low quality” relative to the mean quality score for concordant variants (although, again, they passed threshold).

This is why we went forward with using the Agilent SureSelect targeted sequencing data for validation. Of course, we fully realized that such an assay would be potentially biased towards Illumina because the validation is being done on the same machine as the whole genome sequencing. That was, in fact, the reason we initially went for Sanger. But after sixty Sangers and realizing it would be hundreds more (think of that in terms of time, man-power and monetary cost) before we could generate anything really meaningful with Sanger, we decided the few we did had accomplished our goal of at least demonstrating that concordant SNPs are highly accurate but non-concordant are not and moved on.

In Cliff’s post, he extrapolates the number of platform-specific SNPs that would validate if the Sanger rates were correct across the board and concludes that CG is, in fact, more sensitive than Illumina. I caution against using this Sanger data this way because in the paper, we clearly utilized the SureSelect capture validation to make up for an inadequate Sanger experiment.

Recalculating

Regarding experimental design and how it gets conveyed in the literature. We tried to make it clear that the Sanger was only suggestive, not conclusive. I hope that got conveyed, but I’ll reiterate here that the Sanger data in that paper is limited in its usefulness because there simply isn’t enough of it and because of the lower mean quality score across those positions resulting.

We do, in fact, have a figure (Supplementary Figure 1) that demonstrates lower quality scores for the platform-specific SNPs from both platforms, and that played out in the validation. But the problem is that the sentence describing that is four paragraphs up from the Sanger validation paragraph and it isn’t linked in the text.

Here’s a quote from Cliff’s post that I want to respond to:

“The paper also points to the magnitude of the problem caused by validating the Illumina platform with the Illumina platform. The Sanger validation data can be used to estimate the confidence in the results of the target enrichment validation data. If the Illumina unique SNPs really were 64.3% true SNPs as reported, then the likelihood of getting the Sanger validation results (2 of 15 validated SNPs) is less than 1 in 10,000. While the exact Illumina SNP validation rate is unknown, the Sanger data tells us that we can be more than 99.99% confident that it is less than the 64.3% calculated by this biased validation approach. For these reasons, we believe 64.3% is not the correct number to use in calculating the sensitivity of the Illumina platform in this study.”

I want to stress the importance of Table 2 in the paper and how it shows perhaps the most important information in the entire paper. Here’s a summary of it:

	Validated	Invalidated	Not validated	Validation rate
Concordant	81.0%	6.4%	12.6%	92.7%
Illumina-specific	50.5%	28.0%	21.4%	64.3%
CG-specific	53.9%	33.2%	12.9%	61.9%

“Validated” means those that were present in the whole genome sequencing and observed in the SureSelect data (true positives). “Invalidated” are those that were present in the WGS, but observed as false in the SureSelect data (false positives). And finally “not validated” are those that could not be detected adequately in the SureSelect data. The “validation rate” was determined by removing those “not validated” from the targeted count and then determining what percent of those were validated (not invalidated).

This is key. Those “not validated” do include quite a few of those “low quality” candidates that wouldn’t validate by Sanger either, of course. And those make up a relatively high proportion of the Illumina-specific SNPs (21.4% compared with CG-specific SNPs at 12.9%).

Now go back to the total counts and extrapolate that value to actual variants counts. If 21.4% of Illumina-specific variants are of this ilk, that brings the Illumina-specific count down to 271,248 SNPs. If 12.9% of CG are of this ilk (which seems reasonable given both Sanger and SureSelect), the CG–specific count goes down to 86,732 SNPs.

If you’re following me, we’re now at variant counts that we can now attach our “validation rate” to determine the actual number of true positives in a way Cliff might approve of (but without using the inadequate Sanger data). Here are the results:

	Total	Extrapolated "good" calls	Extrapolated validation
Concordant	3,295,023	2,879,850	2,669,621
Illumina-specific	345,100	271,248	174,412
CG-specific	99,578	86,732	53,687

This being the case, I think it’s clear why we say “Illumina was more sensitive.” I feel confident these are the numbers to use, and Illumina detected quite a few more total SNPs. However, it clearly has a higher error rate as well, so that can affect things downstream, as it’s not trivial to differentiate all those errors from the total.

As for the criticism regarding the chances of getting a 2/15 validation rate if the actual validation rate is 64.3%, for one thing I think we should use different numbers—in this case, 2/20 (the total number of those tested and validated by Sanger) and 50.5% (the total number tested and validated by SureSelect). Still, that detail aside, you’re still going to get a very small probability (e.g. a hypergeometric p of 0.0001).

But I can also look at it another way. What’s the probability of the CG-specific result doing the same thing? It’s okay, but it’s not that likely (p=0.11). Yet you’re talking about 53.9% and 13/20 there.

There’s two reasons for that:
1) 20 is a small number. With a ratio around 50-55%, unless you get 10 or 11 out of 20, you’re deviating pretty dramatically. In fact, the range for p > 0.05 with 20 pulls and a 53.9% ratio is only from 7 to 15. This is why I said we “would have had to do hundreds of Sangers”.
2) Sanger sequencing is different from SureSelect target enrichment sequencing anyway. It’s not a true subset in the first place, and is susceptible to sources of error that don’t affect next-gen sequencing.

Anyway, I don’t think that criticism is overly fair. Really, to me, it only supports that our Sanger data should not be used this way in the first place.

Finally, I don’t want to seem like I’m bashing CG here. To the contrary—CG performed exceptionally well in our paper. It is without a doubt from what I saw more accurate.

(Maybe I should write that paper where I add SOLiD into the mix…)

Exome Annotations

2012-02-17T15:48:00.000-08:00

I just posted a thread on 23andMe about which annotations I use for my exome data. Here's what I said:

I currently use Annovar for annotating VCF files. The output from Annovar is not particularly intuitive, so I wrote a perl script that generates a VCF-based report. I thought I would share the annotations I've been using and ones I plan to add, and see if anyone else has any other annotation ideas. These could be useful for us to annotate our own genomes (and potentially for 23andMe to provide in the future).

The annotations I've been including are:

Gene annotation (type of mutation--exonic, intronic, splicing, etc.)

Gene name

Mutational description (i.e. specific amino acid change, etc.)

dbSNP130

dbSNP135

WashU Exome Variant DB (EVS)

Transcription Factor Binding Site (TFBS)

SIFT score

PolyPhen 2 score (PP2)

GWAS presence

Segmental duplication

(The reason I include both dbSNP130 and 135 is that 135 contains quite a few SNPs that are potentially meaningful from a disease and trait standpoint while 130 is mostly markers not directly affecting diseases and traits. 130 is a subset of 135. Also, the EVS is potentially more useful than either of them as a filtering device.)

Ones that I would like to include in the future:

VAAST

MIE sites/scores (Mendelian inheritance errors)

23andMe annotations (anything from 23andMe's SNP databases--can 23andMe help with that?)

Any other ideas for great annotations that should be included?

The idea behind these types of annotations is to give us a way to sift through the data and extract biologically meaningful results. For example, we are most interested in mutations that actually cause a protein coding change, that are uncommon in the population, and that are predicted to have a dramatic effect on function.

So far these types of annotations have allowed me to narrow very long lists of results in exomes (think on the order of 30-50,000 mutations) down to just a handful (1-20) candidate mutations for particular Mendelian disorders.

Anything I missed?

Sequencing My Exome: Why?

2012-02-10T17:22:00.000-08:00

“Why do you want to do this?”

My wife, immediately after I tell her I'm going to sequence my own exome.

There are a few times in life where you want to do something so badly, but find it difficult to convey to others why. This was one of those times. Such a simple question, but so many different answers. And each answer as valid as all the others. All of them coming together to explain why I would want to do something so, well, unusual.

I could frame a whole dissertation on the reasons behind wanting to sequence my own exome. (And I will.) But first, the simple answer:

I’m curious about myself.

I want to see if I can figure out why I am the way I am. And by that I mean both physically and mentally. For some people, this isn’t something they’ll ever think about. For others, they might see very clearly that they are this way because God made them this way, or because their parents raised them like this, or because they had bad luck. They may simply accept that they have specific features that make them who they are and aren’t concerned with why.

For me, those answers are not good enough.

I know that there are mysteries to be solved in my genetic code. It comes with the territory of being a geneticist. That said, though, almost everyone thinks this way, usually without even realizing it. We can all look at our own families as a proxy for genetics. If your mom had type 2 diabetes and your sister has type 2 diabetes and your uncle has type 2 diabetes, you’re pretty sure you have a higher chance of getting type 2 diabetes. You’ll hear all sorts of people saying that—“guess I got my mom’s bad genes” and “guess he took after his father” and so forth. If you’ve ever known somebody who got old enough, you might have heard her tell you about her mother lived to a ripe, old age and her mother before her, and so on. That’s what I call thinking genetic.

The difference for me is that I’m thinking genetic at a different level. I actually want to look at my genetics to try to explain these types of things. I don’t believe in fate without reason. If my mom lives to be 90, and my grandmother lived to be 90, I want to know if I got the mutations that helped them get there. Could I just shrug my shoulders, say, “I probably did,” and move on? Absolutely. But that’s just not good enough for me.

And really, it’s not good enough for anyone. We’re no longer entering the era where we can do better than that. We’re already there. Exome sequencing represents that first major step into the era.

I think, to make a case for exome sequencing (and by the way, whole genome sequencing is basically just an expansion and improvement upon exome sequencing—more on that later), I first need to explain what we can learn from it. And to do that, first you’ll need to know what an exome is. For those readers who already know all about this, feel free to skip down.

What is an exome, anyway?

To understand what the exome is, you first have to understand what the genome is. There are massive tomes on the details of the subject, but to describe the genome succinctly:

The genome is the blueprint for every cell in your body.

Every single protein in every one of your cells is encoded on this massive blueprint. In order to create and maintain a cell (and therefore, your body and very being), your cell quite literally reads the genome and generates certain amounts and types of various proteins to fit the particular cell it’s trying to become or to fulfill a particular function.

The exome is a subset of the genome that contains the instruction to create the proteins themselves. The exome makes up about 1-2% of the whole genome. If the genome is the blueprint, then:

The exome is the instructions for making every protein in your body.

Therefore, being able to read those instructions means we can figure out if differences in them will result in different protein structures.

What can I learn from the exome?

Identifying variations in the exome that lead to differences in proteins (which we call mutations) can give us a direct way of determining if a protein might have altered function in us compared to other people. Significant protein mutations will manifest themselves as traits. To bring up an example from a previous post, the earwax trait is a result of a variation in the exome that leads to a mutation in a protein that results in determining if your earwax will be wet or dry. But it goes far beyond that type of “interesting” trait. Mendelian disorders (which this blog derives its title from) are disorders resulting from mutations in a single gene, which we can detect in the exome (and, in fact, quite a few Mendelian disorders have been “solved” through exome sequencing now).

By sequencing the exome, we can directly assess every line of the “instructions” and identify those lines that differ from the norm.

But that’s not the only way to use this information. We can hunt for mutations that damage our proteins, and that is the first obvious thing to do when looking at the exome. But we don’t know how every mutation will affect a person. To the contrary—there are very few mutations for which we understand the effect.

In fact, the current standard in personal genomics testing (such as that from DTC companies like 23andMe or through-physician companies like Navigenics) is actually an approach that dominated the field for about a decade before next-generation sequencing really became a reality. Using microarray technology, these approaches measure specific sites known to harbor variants in the genome that are associated with a trait or disease but typically not causative for the trait or disease.

For example, right now if you were to do a standard 23andMe test, you’d have genetic variations assessed at about a million sites across your genome. These variations would then be compared to a database that tells how strongly particular variations associate with particular traits or diseases. So 23andMe can tell me that I have a collection of variants that associate with type 2 diabetes, and it can calculate how that increases my risk of getting the disease compared to the average person.

This is more of a science than people often think. This is thinking genetic at a slightly more advanced level. I could simply turn to my family history and guess that I’m at an increased risk for type 2 diabetes. However, the fact that my genetics confirm the increased risk makes it much more “real” to me. Not only do I have a family history, I actually inherited some of those genetic factors. My risk is real.

Knowing my exome sequence takes that to the next level. Rather than simply having associations, I may be actually able to go into the regions of association and identify mutations causing these problems.

Moreover, as more and more information regarding the genetic causes of various traits and diseases are discovered, my exome sequence will always be at hand for me to cross-reference. Imagine that tomorrow a study is released identifying a gene that tells you with complete confidence whether or not you’ll get type 2 diabetes. I would check that gene in my own exome for mutations immediately!

That may sound unrealistic, but when it comes to conditions like cancer, these kinds of studies come out all the time. I may identify a random mutation in a gene that pre-disposes people to getting a particular type of cancer in my own genome, and then I will know that I need to have my doctor monitor for that. Having worked closely on brain cancer for a few years, it struck me that the reason it’s the deadliest type of cancer is because by the time we detect it, it’s already at a very advanced stage. But if we have a gene or set of genes that we know predisposes people to get malignant brain tumors, we could look in our own exomes for mutations in those genes and then get ourselves MRIs starting at a particular age to try to detect them earlier and hopefully allow effective, long-term treatement.

I think anyone can see how powerful that type of diagnostic and predictive tool can be.

And that brings up a major reason to sequence one’s genome: This information is immutable. Your exome is not changing. On the day you die, you’ve got pretty much the same exome and genome you had when you were born. If a major discovery is made tomorrow, I’ll have my exome to look at for it. If another discovery is made in ten years, I can take that same exome sequence and look for it. There’s no “expiration date” on that information.

And that’s what really sold me on the whole thing, actually. My intimate knowledge that my exome is always going to be a part of me, and that our understanding of genetics and diseases will always be expanding. That means my investment now is going to pay off for my whole life. Or at least until I sequence my whole genome.

I hope that conveys my major reasoning behind why I would want to do this. Of course there are other factors as well. For one thing, I am a geneticist. Genetics is not just my job, it’s my hobby. I love it. And over the years I’ve become increasingly interested in my own genetics. But that’s honestly not the only reason. At this point, I see it as a choice that will help me keep myself healthy throughout my life.

I think there will be a shift generally towards that thinking in the medical community at large in the very near future as well. It may only be a couple years before your doctor suggests you get your exome sequenced as well. In a society where I feel most of us already think genetic, I think it's only a matter of time before we stop simply guessing that it's genetic and instead actually prove it. And beyond that, we actually figure out that there's something we can do about it. That is empowering right there.

Enter the Exome

2012-02-09T19:15:00.000-08:00

My Exome kit from 23andMe has arrived! In a few short weeks, I will have my exome sequence in hand and ready to analyze. I've looked at hundreds of exomes over the past year, but only now, when I'm about to look at my own, have I started to really think about how to extract meaning from a healthy individual's exome. All of the work I've done has either been to assess exome sequencing as a science or to hunt for mutations causing specific conditions (Mendelian Disorders and novel genetic syndromes).

Now I'm going to have my own sequence in hand and have a very basic yet exceedingly complex question to answer: What does this all mean?

And with that question comes other questions:

Why do I care about my own exome?

What can I learn from it?

What justifies the cost?

Is it safe?

In the coming days, I will be posting about my answers to these questions and more. I came to a realization a couple of days ago (right after I ordered this kit) that even questions that seem simple to me as a geneticist are not so simple for most people.

"Why do you want to do this?" is harder to answer than people may think. Off the cuff I might say, "I'm a geneticist, it's what I do!" but that isn't at all the whole answer. So I am going to make it a goal to explain why anyone would want to have his or her exome (or genome) sequenced in terms that hopefully anyone can understand.

GATK's available annotations

2011-12-09T16:33:00.001-08:00

Perhaps because it changes too often, GATK's available annotations for VCF files does not seem to be online anywhere that I've seen. The GATK site says to run GATK with the "--list" parameter to list them. Doing that requires putting in valid input files and such. Basically, it's a pain.

So here's the list from GATK v1.3-21-gcb284ee

Available annotations for the VCF INFO field:
    ChromosomeCounts
    IndelType
    HardyWeinberg
    SpanningDeletions
    NBaseCount
    AlleleBalance
    MappingQualityZero
    LowMQ
    BaseCounts
    MVLikelihoodRatio
    InbreedingCoeff
    RMSMappingQuality
    TechnologyComposition
    HaplotypeScore
    SampleList
    QualByDepth
    FisherStrand
    SnpEff
    HomopolymerRun
    DepthOfCoverage
    MappingQualityZeroFraction
    GCContent
    MappingQualityRankSumTest
    ReadPosRankSumTest
    BaseQualityRankSumTest

Available annotations for the VCF FORMAT field:
    ReadDepthAndAllelicFractionBySample
    AlleleBalanceBySample
    DepthPerAlleleBySample
    MappingQualityZeroBySample

Available classes/groups of annotations:
    RodRequiringAnnotation
    StandardAnnotation
    WorkInProgressAnnotation
    ExperimentalAnnotation
    RankSumTest

No promises about how accurate this is for any other version.

Stop SOPA

2011-11-16T15:56:00.001-08:00

I'm not a huge politics guy, so I don't want to go on a tirade about the Stop Online Piracy Act. Sufficed to say, it's a huge censorship bill parading as a bill to protect intellectual property. While I think the majority of us support protecting IP, I can't imagine the best way to do so is to monitor everything we do and censor websites based on some government-backed list of sensitive content.

As an example, if someone were to post copyrighted material in the comments section of my blog without my notice, my blog could potentially be shut down (censored) because of it. If I were to link to a site that had, somewhere on it, shared copyrighted material (even if I had no idea it was there and didn't intend for anyone to go there and see it or download it), my blog could be shut down (censored).

Moreover, the bill basically forces both providers and hosting services to strongly monitor content and shut down sites that potentially "infringe" on protected IP. Ever posted a picture of something that was copyrighted? Ever shared a link to a YouTube video with a copyrighted song in the background? That could be enough to get you shut down (because your host or provider doesn't want to be sued).

My wife and I often talk about the situation in Japan (she's Japanese) regarding the Fukushima nuclear reactor situation and how censorship and control of the media in Japan is so strong that the general public has no idea how dire the situation is. We even see that censorship and media control bleed over to here in America, where the general public is under the impression the nuclear situation isn't as bad as it is, in large part because the government and big media have an incentive to see nuclear power as an industry succeed.

Now here's the scary thing: If I were to start posting excerpts from copyrighted articles about that topic to respond to, if SOPA were to pass, my blog could potentially be shut down. I could personally be denied access. It's unclear to me exactly how much power this bill would give big media and the government. And that's the major problem.

In our field of genomics, a lot of us utilize freedom of sharing information and media to rapidly advance the science. I understand that this bill is meant to limit piracy of software and other digital media, but it represents a foot-in-the-door to all sorts of censorship. Could SEQanswers, for example, be sued for having a post up that contains the Illumina adaptor sequences? It certainly has been threatened in the past based on such things, but with SOPA passed, SEQanswers very well could have been shut down for that. What a detriment that would have been to the genomics and bioinformatics community.

Anyway, I just wanted to share this on my outlet to the world, as it is a very important issue generally and to our field in particular.

If you are an American and you do not support SOPA, please send a notice to your congresspeople telling them not to support it, either.

Drobo

2011-11-16T15:34:00.001-08:00

A secure, offline storage solution with automagical backing up. Forget the cloud for sensitive data.

#lovetech

There's something funny about this quality score

2011-11-03T13:15:00.001-07:00

@COLUMBO:3:1:1653:950#0/1 NGCCGCGATATCGGATCCAACAGATCGGAAGAGCTC +COLUMBO:3:1:1653:950#0/1 BOOOOTTTTTYYYYY__________b____[T[[[_

Exome Sequencing Q&A

2011-10-13T11:16:00.000-07:00

In case you missed it, the most recent issue of Genome Biology is focused on exome-sequencing. Included among a number of exome-seq platform comparison papers (like this one and this one, which I intend to discuss in comparison with my own recent publication in Nature Biotech in a future post) is a fun Q&A article entitled "Exome sequencing: the expert view". The article asked three experts (Dr. Leslie G. Biesecker of NHGRI, Dr. Kevin V. Shianna of Duke University, and Dr. Jim C. Mullikin of NHGRI) a series of questions related to exome-seq ranging from what we've learned to whether exome-seq will be around in a couple years.

Since I like to think of myself as a bit of an expert on the topic (mostly because I dedicated the past year of my life to it), I thought it might be fun to answer the questions myself and discuss a bit what I think of the answers in the article.

How can exome sequencing contribute to our understanding of the dynamic nature of the genome?
Exome sequencing is quite literally the best way we currently have to assess the most interpretable part of the genome: the protein-coding sequence. While it accounts for only about 1% of the whole genome, that 1% is the part we understand the best, because it's the part that we can assess through central dogma. Basically, when we see a mutation in an exon, we can then determine how the RNA will be affected, and subsequently how the protein will be mutated. We can determine precisely which amino acids will be mutated, and which amino acids they will turn into. And thanks to decades of work on proteins and amino acids, we can predict fairly reasonably how damaging a particular amino acid change will be. We can also predict nonsense mutations, which almost always result in loss of the protein as well as frameshifts (which typically lead to a nonsense mutation), splice site variations, and regulatory site mutations (assuming the UTRs/upstream and downstream regions are enriched).

I put it like this to someone not long ago: What would you look at if you sequenced someone's genome? Exons. That is literally the first thing, because it's the low hanging fruit. If you find nothing there, you might zoom out and look at splice sites (which exome-seq generally captures), micro RNAs (which exome-seq generally captures), and regulatory elements (which are becoming recognized as increasingly important--be ready for regulatory element/transcription factor binding site enrichment kits in the near future). As for the intergenic regions, they're such a mystery with regards to function that I feel secure saying we basically don't know what to do with variants in those regions.

At the very least, I have no doubt exome-sequencing will drive a revolution in Mendelian disorder research (it already is). We've already seen a number of them published and as an increasing number of samples are sequenced, that will only improve. Because Mendelians are most often caused by changes in the protein coding region of genes, exome-seq is a prime way of solving them. The one major barrier I see for solving Mendelians is the (perhaps surprising) prevalance of dominant low-penetrance disorders. Mendelian disorders with this mode of inheritance are inherently difficult to solve, and will require a large number of samples sequenced. That said, this again primes exome-seq as a major method, and I feel secures it as a major technology for at least a few more years.

In the paper, Kevin Shianna mentions some shortcomings. In particular, he mentions that structural variations (SVs) are difficult to detect by exome-seq, which is certainly true. I would say there is some light at the end of that tunnel, at least with regards to copy-number variations (CNVs), as some successful work is already showing positive results with detecting CNVs in exome-seq.

How much has exome sequencing been driven by cost alone?
Certainly I think the increasing prevalence of exome-seq has been almost completely driven by cost. Targeted sequencing (and subsequently, exome-seq) were developed precisely because labs wanted to assess particular regions by sequencing without paying for an entire genome sequence (most of which they wouldn't even look at the data from). That said, were the exome-seq and WGS assays the same cost, there would still be significantly higher cost both in terms of time and money associated with the bioinformatics, analysis and storage of WGS data compared with exome-seq data. Even at equivalent assay cost, exome-seq is cheaper. Will that ever change? Unlikely, I think.

Consider the value of an exome versus the value of a whole genome currently. One could certainly argue that because the information in the exome is so much more meaningful (please don't write me letters for saying that), every exome base is worth significantly more than intergenic bases. In the paper, LGB and JCM make the point that an exome costs about 1/6th what a whole genome costs. That seems pretty accurate to me. And that does demonstrate how much more valuable we consider the exome compared to the other regions. Because of that I would not say that exome-seq has been driven by cost alone, but also because it is considered more valuable per base than whole genome.

What are the major limitations of exome sequencing?
Obviously, any meaningful variation outside of the exome is missed by exome-seq, and that's a major drawback. Just because they're less interpretable does not mean that non-exonic variations are meaningless. Missing SVs is another large drawback, although I feel it's important to point out that SV detection does not require a deep whole genome sequencing experiment--a low depth WGS that is not adequate for small variant calling is more than adequate for SV calling. Therefore, exome-seq could be supplemented by low depth WGS for SV calling without going as far as doing a full WGS. I'd estimate the cost of an exome-seq plus low-depth SV-WGS to still be less than 1/4th the cost of a full WGS.

The other major issue, which is also brought up in the paper, is the fact that exome-seq misses almost everything it doesn't target. If the mutation causing a particular disorder is in an exon that doesn't happen to be on your target list, you're going to miss it. WGS, on the other hand, would likely catch it. That's a major limitation. However, if the exon is poorly annotated and that's why it's not on the exome-seq design, there's a good chance you'll miss it with WGS as well. So while this is a major limitation, it's as much a major limitation of sequencing in general. Specifically, that mutational analysis is so heavily reliant on annotation, which is incomplete at best.

What lessons from exome sequencing studies can be applied to whole genome sequencing studies?
Most of what we've learned from exome sequencing studies is directly applicable to whole genome sequencing studies. Currently, the field is in a state where exomic variation is highly detectable and interpretable, and that's due in no small part to exome-seq. Coding region annotation in particular has become significantly more powerful as a result of all the exome-seq work that's been done. We have a much stronger appreciation for the limitations of sequencing as well. Although we are very sensitively detecting variants, even at miniscule false positive rates we have trouble detecting disease-causing variants with single samples. As far as I'm aware, not a single Mendelian disorder has been solved by sequencing of a single sample (exome or otherwise). Because we need some way to eliminate false positives, and we use family members and additional samples for that.

And that brings up the other major lesson we've learned that we'll need to apply to WGS in the future: Small sample groups are simply not adequate for most purposes. Even for monogenic conditions, numerous samples and multiple families are required. Much like GWAS and linkage studies, we're going to need to sequence a lot of people to find meaning. This has a lot to do with sensitivity and specificity. A false positive rate of 1% sounds okay in a lot of fields, but in genomics, that's 400 false positives in the exome and 30,000 in the whole genome. Good luck finding your one disease causing mutation in there! But we're doing it through clever decision-making with regards to which family members to sequence, which individuals to sequence, et cetera. And that is perhaps the most useful lesson from exome-seq so far.

How do exome sequencing studies contribute to our mechanistic understanding of disease?
The most obvious answer is that exome-seq is very likely to be how we find the genetic etiology of the vast majority of Mendelian disorders. How this will translate to complex disease has yet to be seen, though it's certainly being pursued, particularly in the field of autism.

I would like to respond to a statement made by Kevin Shianna in the paper:

However, exome studies will have very limited power to identify causative variants in regulatory regions spread across the genome (transcription binding sites, enhancers, and so on). Implementing a WGS approach would allow detection of variants in these regions, thus increasing our knowledge of disease beyond the coding region of the genome.

While it is certainly true that exome-seq may be unable to identify mutations in the regulome (yes, regulome), WGS is not the only alternative to it. Why not instead supplement the exome-seq with a regulome-seq? I guarantee you companies are going to create such a thing, and even if you don't want to wait for them to be commercially available, one could create a custom target enrichment that covers the regulome as well.

Back to the cost issue, at about 1/6th the cost of WGS each, exome-seq + regulome-seq would still only come to about 1/3rd the cost of WGS. Outside the exome and regulome, is there much more in the genome that we can even comprehend by sequencing at this moment? Structural variation, perhaps, but as I've already mentioned, SVs can be detected with a low depth WGS. So for half the cost, we may be able to obtain the same biological meaning. That's something to seriously consider when deciding how to budget your sequencing. If you can squeeze out double the samples and obtain practically the same info by doing exome-seq + regulome-seq + low depth WGS, wouldn't you?

I would also add that while it may not be mechanistic, the rise of exome-seq has led to a realization that that we are not adequately prepared from a policy standpoint yet. More than anything, I think it has led to a general realization that medical genetics, genetic counselors and DTC genomics are arriving before society is really ready for them. And that's an issue that I think is going to need increased attention as we researchers charge forward with sequencing everybody.

Does exome sequencing have a limited 'shelf life'?
Not as limited as many people think. Even in the $1000 genome world, exome-seq will still have a place. I'll discuss this a bit more shortly, but there are genomic elements that are more sensitively determined by exome-seq. I feel strongly that once sequencing is sufficiently cheap, exome-seq will become the de facto standard for genomic information much as microarrays were for a good decade. This is due to a combination of factors--the majority of the genome being uninterpretable, the storage cost of genomes versus exomes, the bioinformatic challenges of whole genome versus whole exome, et cetera. The article goes into good depth on many of those issues. The one I'd focus on, however, is that until WGS is actually cheaper than exome-seq, it will always have a place in our field.

I have to disagree strongly with the following statement:

Yes, as soon as the difference in cost between exome and whole genome diminishes (which will be soon) and issues with data management and storage are resolved, whole genome sequencing will be the method of choice.

First of all, he knows as well as the rest of us that "issues with data management and storage" are not trivial to resolve. But beyond that, this is the stance that says, basically, "what's a thousand dollars for a whole genome compared to $300 for just the exome"? I state these numbers because that's where we're heading. At some point exome-seq will stop getting cheaper because the enrichment assay will always cost something. Same for WGS. And my answer to that question is simple: You'll still be able to get three exomes for the price of one whole genome.

The real thing that will limit exome-seq's lifespan is the interpretability of the rest of the genome. If we can utilize the intergenic data to a greater degree, then those bases become more valuable. Then WGS becomes more appealing. Until then, it's all exome (and regulome--I swear, it is coming soon!).

How much do you think that future research will be restricted by the IT-related costs of the analysis?
A great deal. Clouds are not cheap. I'm not convinced they're even the answer at this point. But are clusters? I'm not so sure there, either. I'm not convinced we have the IT solution to this yet because, frankly, I don't think the hardware manufacturers have realized what a lucrative market it's going to be.

I had a conversation about this with somebody not long ago. I said, we've gone from producing gigabytes to terabytes to petabytes over the span of four years in the sequencing world. By now I think there's a good chance we're at or nearing a worldwide exabyte of genomic information. As we start sequencing more and more people, we're going to start hitting the limit for storage capacity worldwide. Sounds crazy, right? But is it?

I also had another rather humorous thought that evolved from that one. How much energy will the number of hard drives needed to store the entire human population's genomic data require? How much heat will those drives produce? In the future, could sequencing be a major cause of global warming? Think about it!

Frankly, I've started to take the issue seriously. One of the tasks on my plate is a thorough assessment of alternative storage formats for genomic data. But that's a post for another day.

Are there any advantages of whole exome over whole genome sequencing?
Yes. Let me say that in less uncertain terms:
YES!
And by that I don't mean what Leslie Biesecker and Jim Mullikin said in their response in the paper. There are literally exonic regions that are better resolved by exome-seq than WGS. We demonstrate that in our recent Nature Biotech paper. A typical WGS will miss a small but meaningful number of exonic variations that are detected by exome-seq. To be fair, the opposite is also true: WGS will detect some exonic variations missed by exome-seq. To me, that says one thing: To be truly comprehensive at this point in time, we need to do both. Naturally, budgets prevent such a thing, but it's important to recognize that targeted enrichment can allow sequencing of regions missed by WGS, and that this is an advantage of exome-seq.

Reference:

Biesecker LG, Shianna KV, Mullikin JC. Exome sequencing: the expert view. 2011. Genome Biol. Sep 14; 12(9):128 [Epub]

And he's back with a site redesign!

2011-10-04T17:06:00.000-07:00

You may have been wondering: Where have I been?

Choose from the following:

Answer:
If you checked any of the above boxes, you're correct! Congratulations.

Yes, it's been a very busy time in my life, but I am finally back and promise to update more often than ever before.

Also, I hope readers like the site redesign. This is a new offering from Blogger. A dynamic theme that allows you to choose how you'll view it from the dropdown menu on the left. I'm a big fan of the "Magazine" look, so that's what I've set to default.

Also, here's a cool site: [http://opensnp.org/] Basically, they let you sign up and will host your DTC SNP chip data with them free (with all the consequences* that comes with).

Finally, this offer from 23andMe is interesting. $999 for an 80x human exome. That's raw data only, folks. No analysis. I've got a long post about this whole thing in mind that I'll put up in the coming days. Still, it's a great opening salvo in the DTC/PGM era. Exome-seq DTC is truly here. Finally.

*Consequences unspecified. You should be protected by GINA regardless and, honestly, if someone wanted to know your genotypes so bad they could just pick up that coffee cup you threw away last week and do it themselves. But whatever, you still better ask dad before you post half of his DNA on the web.

$50 off 23andMe Coupon Code

2011-08-02T12:04:00.000-07:00

Hello friends!

23andMe just sent me an email with a coupon code in it for $50 off! Apparently it's share-able, so if you've been waiting for a genomic deal, here you go!

To use this coupon, visit our online store and add an order to your cart. Click "I have a discount code" and enter the code below.

$50 Off

Coupon code: 4DPGQP

Share with your friends!

(Valid for new customers only)

If you do use it, please share your data with me!

Intersecting Indels with VCFtools

2011-07-28T20:29:00.000-07:00

Indel detection in is not what I'd call accurate at this point in our history. I, along with probably every other bioinformatician and genomicist looking at next-gen data, have noticed that immediately adjacent indels called as separate events but which are really the same variant called differently due to sequence context and the nature of our variant callers get called all the time.

A band-aid approach is to simply look for overlap in indel calls within a window. Even a tiny window can make a big difference to small indels.

To do this, I currently use VCFtools, which makes it very simple. Specifically, use the vcf-isec command with the -w parameter.

If I compare two libraries sequenced from the same individual that had indels called independently (using the same method), I end up with a few thousand overlapping indels that would have been assessed as independent from one another if I looked for exact overlap.

Exact overlap:

vcf-isec -f -n =2 -o indels1.vcf.gz indels2.vcf.gz | wc -l
468136

Overlap +/- 5b:

vcf-isec -f -n =2 -o -w 5 indels1.vcf.gz indels2.vcf.gz | wc -l
471047

I realize it's not that astounding a difference, but keep in mind this is looking at two different libraries from the same individual. If you're comparing calls from two completely different sequencing platforms or variant callers, these numbers jump quite a bit.

The Ion Torrent Paper (Nature)

2011-07-26T19:48:00.000-07:00

An integrated semiconductor device enabling non-optical genome sequencing

I'm just going to discuss my thoughts and comments on the paper, their findings, and how they relate to claimed specs for the IonTorrent PGM.

And just for some comparisons later on, the current Ion Torrent PGM product sheet is also attached [click].

I'm going to try to step around some of the issues with the paper that have been well covered at Daniel MacArthur's blog Genetic Future. I think he is pretty fair to the paper in criticizing its "validation rate" and so forth. Sufficed to say, a year and a half to two years ago, perhaps a 15x human genome would have been considered adequate, but in a paper coming out of LifeTech, manufacturer of the SOLiD sequencer, if you're going to use the whole genome sequence off the SOLiD as validation, let's go for at least 30x coverage.

Comments

"..there is a desire to continue to drop the cost of sequencing at an exponential rate consistent with the semiconductor industry's Moore's Law..."

They bring up Moore's law repeatedly, and they sequenced Moore himself in the paper. But wait a second... sequencing costs are dropping significantly faster than Moore's law! I suppose it's a minor complaint, but let's give sequencing the credit it's due--per-base cost of sequencing is dropping much faster than Moore's law!

Also, let me complain very briefly about the use of a few buzz terms:

"To overcome these limitations and further democratize the practice of sequencing, a paradigm shift based on non-optical sequencing on newly developed integrated circuits was pursued."

If there is any term that is going to supplant "paradigm shift" as the default excessively pretentious term in scientific papers, it has to be "democratize". Look, unless this device is offering $100 genomes, it's not democratizing sequencing. Can we leave sensationalist buzz words for the advertisements and stick to reality for the Nature papers? (Wait, what am I saying?) I appreciate that we're talking about a non-light system here, but the observation of protons rather than photons released upon base incorporation isn't really a paradigm shift. Once we're taking pictures of DNA with electron microscopes and reading the entire genome instantaneously in one shot, then we can start talking about paradigm shifts.

Scalability

Okay, let's look at the scalability based on their data, and what's being touted in their product sheet.

A typical 2-h run using an ion chip with 1.2M sensors generates approximately 25 million bases.

That pretty much throws out the Ion 314 (1.3M wells) for human genome sequencing. A 30x diploid human genome would require 640 days on the 1.2M sensor chip in the paper. Even just 1x coverage would take three weeks. Yikes.

Later there is a rather astounding statement:

"At present, 20-40% of the sensors in a given run yield mappable reads."

Room for improvement there, methinks.

In Table 1, they test the 11M chip with E. coli and Human. The E. coli yields 273.9Mb of sequence off that chip. At about 20-40% of sensors yielding mappable reads, that gives an average read length of 62b-125b. This is consistent with their finding that 2.6M reads are >=21b and that 1.8M are >= 100b. Also with Figure S15, where it appears the majority of read lengths are around 110b-120b. So at least the read lengths are not disappointing.

Their Ion 318 is the 12 million wells chip. I think this is similar to their 11M chip in the paper. Going back to Table 1, they got 273.9Mb off the 11M chip. At issue is the promised "[starting] 1 Gb of high-quality sequence" off the Ion 318 chip in the Ion Torrent product sheet. Now, I completely believe that advancements have been made since the paper's acceptance on May 26th, 2011, but four times the yield? Not so sure about that claim. I'm not doubting it can get there, but I'll put it this way: This is a paper from the company that makes the product--if anyone can make it work optimally, it should be them. And their optimal report here has it at significantly lower than what they're advertising. Oh, and the specs sheet has small print next to the Ion 318 chip entry that says "the content provided herein [...] is subject to change without notice". Let's just say I'm skeptical.

Anyway, the rest of the paper is pretty vague. One issue on everyone's mind is the homopolymer issue, which is addressed in a single sentence stating the accuracy of 5-base homopolymers (97.328%--not terrible, but not overly good either) and that it's "better than pyrosequencing-based methods" (read: 454). How about longer ones? No idea. Figure S16b only goes out to 5b also with a curve that isn't looking too encouraging, though.

Apparently the Ion 316, their 6.3M well chip is currently available, and they claim at least 100Mb of sequence per run. This is consistent with their mapped bases in the paper (169.6Mb off a 6.1M ion chip). With this chip, you're talking about 3 days on one machine for 1x diploid human coverage, and about 90 days for 30x coverage. Better, but still not there when it comes to human sequencing. Still, it's at the level of completing entire bacterial genomes in 2 hours. If you're into that sort of thing (and don't have access to a HiSeq or something else...).

Not Quite "Post-Light"

You know, there's a lot of "post-light" jibber jabber in the paper. It's Ion Torrent's favorite buzz phrase, and I'm a fan, actually (much more so than I am of "democratize" and "paradigm shift"). But at this point, with this performance, I'm not sure we're "post-light" yet. The technology is there, but it isn't scaled up enough yet. There's a claim made a the end of the paper that's interesting:

"The G. Moore genome sequence required on the order of a thousand individual ion chips comprising about one billion sensors. ...our work suggests that readily available CMOS nodes should enable the production of one-billion-sensor ion chips and low-cost routine human genome sequencing."

Doubtless this is long in the works already, and I hope it is a reality. Because making the leap from the things in this paper to a functional 1B sensor chip would make a huge difference.

I'd say I was a bit disappointed with this paper. It felt half done. I'm confused about the way the comparison to SOLiD was done--why wasn't the SOLiD WGS of G. Moore done to an adequate depth? I'm a bit annoyed at the lack of comprehensive information, as well. The homopolymer issue is known--why hide behind homopolymers of 5b or smaller? Just give the whole story in your paper--it's an article in Nature, not an advertisement.

Anyway, to quote a very smart man I know, "it is what it is." Ion Torrent is here to stay and it's only going to improve. I certainly hope it does--I'd love to see it pumping out 1Gb/2hrs with long read lengths.

Mendel Google Doodle

2011-07-20T18:40:00.000-07:00

Google's current doodle is a very neat nod to the father of genetics, Gregor Mendel! They're celebrating his 189th birthday with a representation of his original experiment--crossing green and yellow pea plants and tracing the color trait.

Accurate genome-wide read depth calculation (how-to)

2011-07-04T00:11:00.000-07:00

I'm currently working ferociously on revisions to a paper and need to calculate mean genome-wide read depth as a fine point in the paper. My first inclination was genomeCoverageBed from BEDtools. Trying it out on chromosome 22 first, I noted a huge number (>30%) of the bases had 0 coverage. Of course, this must be because genomeCoverageBed is including the massive centromere (chr22 is acrocentric--the entire p-arm is unmappable heterochromatin). I kind of already knew genomeCoverageBed wasn't meant for this purpose anyway, but I was hoping to stumble upon something.

I decided to Google "genomeCoverageBed centromere" and "genomeCoverageBed not include centromere" and came up with bunk (well, not quite bunk, I came across a blog post about genomeCoverageBed that happens to have been posted tomorrow... yes, tomorrow!).

As I explained in a comment there, I find genomeCoverageBed's approach lacking in meaning. Is it fair to include unmappable regions in a calculation of coverage? That's what led me to asking myself if "genome-wide coverage" really has a meaningful use as a statistic. We all know you're going to have a ton of bases with 0 coverage in the centromere because they're unmappable, but that says nothing about how your sequencing performed or aligner worked. All it says is that those bases are missing from the reference assembly. And that confounds any other meaning you might get from the number, really.

Not to say that genomicCoverageBed is useless. To the contrary, it does exactly what it's supposed to: generates a histogram of genome-wide coverage. But I do not think it's overly useful beyond that histogram. A lot of people like to see the "mean coverage" or "mean read depth" statistic when you talk about a project, and you'd certainly be selling yourself short if you're generating that number with genomicCoverageBed.

I think the author of BEDtools (Hi Aaron if you ever read this!) would be the first to say, "do not use genomeCoverageBed for calculating mean read depth". But BEDtools can, fortunately, give us a very strong way of doing just that through intersectBed and coverageBed.

My solution is quite simple. I take the "gaps" track from UCSC (Tables->All Tracks->Gap) and create a BED file of all regions that are NOT in gaps. You can easily generate this file by obtaining the gap track from UCSC as a BED file and then using BEDtools subtractBed to subtract those regions from a whole genome BED file.

$ subtractBed -a hg19.bed -b ~/resources/hg19_gaps.bed > hg19.gapless.bed

Then, take your gap-less whole genome, run intersectBed with your input BAM file and pipe it to coverageBed again using the gap-less whole genome bed file as your -b. Make sure to use the -hist option in coverageBed.

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist > coverageHist.bed

If you want to save yourself some trouble, you can pipe that to grep and grep out "all"--that's your histogram of genomic coverage across the gap-less genome. Actually, you can calculate mean read depth across this gapless genome in one line without generating any output with some awk magic:

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist | grep all | awk '{NUM+=$2*$3; DEN+=$3} END {print NUM/DEN}'

Not everything outside the gaps is mappable, but at least those bases are present in the reference genome.

Here's an example from a real WGS experiment restricted to chr22:

genomeCoverageBed mean read depth: 0.68
gap-less genome coverageBed mean read depth: 30.637

And yes, calculated by other means, average read depth was right around 30x. For the sake of accuracy, I call it "mean read depth of the reference genome assemby". Doesn't roll off the tongue, but that's what it is, and it means a lot more than "genomic read depth" or "genomic coverage", in my opinion.

(All of that said, genomeCoverageBed will run a lot faster than the way above, but the info it reports is different and really serves a different purpose.)

Gists

2011-07-01T13:16:00.000-07:00

I've been super busy writing a paper lately, so I apologize for the lack of updates. I do intend to comment on the recent 23andMe paper soon™.

A new feature on the blog is my Gist feed and a link to my Gist page. Gist is basically a quick-and-dirty code-sharing site from Github. Since I'm a quick-and-dirty bioinformatics programmer, I'll do my best to keep random code snippets that might have general application for others up there and publicly available. Also, please feel free to fork my code, fix it, et cetera.

My first Gist is an interesting one: A shell script for calculating mean heterozygous allele balance from a VCF4 file (basically GATK output with the "AB" INFO field). Very simplistic, but fast and easy to use for people wondering what the overall reference bias is in their variant calls.