Saturday, June 18, 2011

Complete Genomics User Conference 2011 Day 2 Recap

Yesterday was a "half-day" at the CG User Conference and my netbook was inconveniently out of juice, so I didn't get to live blog. However, I took plenty of juicy notes at the interesting and useful talks of the day. The rundown included:

  1. Dr. Sek Won Kong, Children's Hospital Boston, about a downstream analysis and annotation pipeline for WGS/exome-seq called gKnome.
  2. Dr. Jay Kasberger of the Gallo Research Center at UCSF about their tools and scripts for analyzing CG data that they will soon make available.
  3. Dr. Stephan Sanders from Matthew State's lab at Yale talking about identification of de novo variants in CG/WGS data.
  4. Dr. Andrew Stubbs, Erasmus Medical Centre, about HuVariome, a database of variations common to the human genome for use as a filtering/informative resource. 
  5. Panel discussion that went into especially issues related to making sequence data and variants public and how to filter data that was quite interesting.
I'll summarize each of the talks and some of my thoughts below.



Dr. Sek Won Kong. gKnome: An analysis and annotation pipeline for whole-genome/exome sequencing. 

Dr. Kong presented a sequence analysis and annotation pipeline Michael Hsing and others in his group have developed for variant analysis. Although it certainly takes CG data, it looks like it'll take Illumina data as well.

Looked very nice. I think behind the scenes was a MySQL and Python based annotation pipeline. It utilizes the CG diversity panel (69 public genomes released by CG) to filter out systematic CG variants. Actually, this was a major theme at the conference that I'll go into at the panel discussion part.

In its first version, the gKnome pipeline will be able to annotate from RefSeq, CCDS, Ensembl and UCSC known genes.

The other cool part is the web front-end for the whole thing. Their system auto-generates a number of reports including box plots of #rare/NS variants/genome, which they demonstrated is closely tied to ethnicity. It also has built-in pathway analysis and disease risk summaries (including utilizing HGMD if one has a subscription). Finally they showed a nice R-based plot of CNV results that are auto-generated.

There was also a quick slide of hypervariable genes shown that was a point of much conversation generally. Basically, everyone agreed there's a set of specific genes and gene families that always end up with variants in them. Dr. Kong showed the list to include HYDIN, PDE4DIP, MUC6, AHNAK2, HRNR, PRIM2, and ZNF806. I've seen most of these pop up in my many exome-seq experiments as well. I've even had PDE4DIP, AHNAK2, PRIM2 and many of the ZNF and MUC genes look like disease-causing variants before.

So where can you get it? Well, it's not available yet, but should be completed by September. In the meantime, you can check out what they have going for them at: gknome.genomicevidence.org.

Let me just throw out there that this looks superior to the alternative currently used by most people, which is Annovar. Annovar is a great tool and getting better all the time, but with its rather clunky input and output formats, lack of any downstream stats or visuals, and some notable bugs in its conversion scripts, gKnome is looking pretty nice.


Dr. Jay Kasberger. Integrating tools and methods into analytical workflows for Complete Genomics data.


This one was pretty nice because it was kind of an overview of tools used with CG data and how a lab with a lot of CG data might implement them. Dr. Kasberger especially presented the way they assess the variants and CNVs provided by CG.

There was a tool for auto-comparison with Illumina Omni SNP chip data, which is worth a look (although I should note that you have to deal with the Illumina SNP chip problems like the ambiguous calls at some percent of spots that don't tell you the correct ref/alt alleles, etc. yourself... and frankly, I'm not sure which ones those are myself--I usually just compare heterozygous calls between the platforms for validation).

It also demonstrated a tool that goes from the CG masterVAR format to PLINK format for downstream IBD estimates.

Finally, they have a number of Circos generating SNPs for variant density, coverage (using heatmap tracks), etc. And we all know I love Circos, so cool on them for that.

These tools are available from them at: sequencing.galloresearch.org (but you'll need a login by being a collaborator/member of their group, so you'll have to contact them for access).

This talk was interesting because it showed off tools that probably every genomics lab with CG or WGS data has developed for themselves. This stuff needs to be packaged up, published and shared openly in my opinion. For example, I don't like that there's this gateway through their page for it all. Put it up GitHub or SourceForge or something, guys!


Dr. Stephan Sanders. Identifying de novo variants in Complete Genomics data.

Dr. Sanders focused on assessing de novo variants. By this he does not mean what we typically call novel variants. Rather, he's talking about variants not inherited from parents (which coincidently are not likely to be known variants either). He claimed that (based on the Roach et al. paper), de novo variants are extremely rare, somewhere around 1x10^-8 chance per base, or about 0.5 disruptive events per exome.

That equates to fewer than 100 de novo variants per genome. But when you actually assess the number from real data with standard filters, you end up with around 20,000 candidates. That's a problem.

He demonstrated that the rare de novo candidate variants (the 20,000) have a much lower distribution of quality scores than true variants.

He then went into a fairly extensive discussion of how to estimate the specificity needed to narrow down to those candidates to the correct number, which was great. BUT, the bottom line is really that you just have to move up the quality cut-off. At a high enough level, the de novo candidate total drops off and the very few leftovers are highly enriched for true positives. He showed that he narrowed down to ~70 candidates and that they validated a little more than thirty of them. This matched well with the expected number he calculated earlier.

Cool talk, but the take-home is that all you have to do is sacrifice sensitivity for specificity and you'll throw away the vast majority of false-positive de novo variants. So for the ~20k or so candidates, apply a more stringent filter and voila!

So, pretty cool idea for what to do with alleles that don't follow Mendelian inheritence errors. Just apply a stringent base quality and read depth filter to them and highly enrich for true variants. Kind of an obvious conclusion, but not something many have been doing, I'd bet.


Andrew Stubbs. HuVariome: A high quality human variation study and resource for rare variant detection and validation.

This one is going to be pretty brief, though it was a good talk especially for the conversation points it brought up.

HuVariome is a database they're putting together of well-annotated known variants. The goal is to use it as an alternative to dbSNP for filtering (which honestly shouldn't be used for filtering in the first place). It will be available at: huvariome.erasmusmc.nl. Definitely a possible alternative to using dbSNP, especially if it's as well annotated as they suggest it will be.


Panel Discussion
Worth mentioning because the same themes kept coming up. Specifically, there was quite a bit about everyone developing their own list of false positive variants/genes and how to filter SNPs adequately. Generally, it's agreed that (1) there is a subset of variants and genes that show up in nearly every sequencing experiment and therefore are more likely false positives than anything else (they don't tend to validate, either) and (2) that dbSNP is too inclusive of disease samples and lacking adequate phenomic info to be used as a blind filter. I've personally always told people to use dbSNP as a guide. Never dismiss a variant just because it's in dbSNP. You can look at the non-dbSNP variants first if you want, since those might be the "jackpot" spots, but if you find nothing there, try looking in those in dbSNP.
That then leads to the need for things like HuVariome and lists of "bad" genes/variants (like hypervariable genes mentioned by Dr. Kong). But the problem then becomes how to share variants publicly that are present in protected samples, even if they're artifacts, because consent wasn't given to make any variants publicly available.
Personally, I see that as an issue that will solve itself, but of course, we want it sooner rather than later. Solutions such as projects specifically intended to produce these lists are possible, though (and some in the audience said they were doing just that).

That's a Wrap
Anyway, that's a wrap on this conference. I hope it was informative for everyone. Also, props to Complete Genomics for putting on a pretty decent corporate conference. I didn't think it was overly biased, I found it useful and interesting, and the venue and food were quite good.