Tuesday, June 21, 2011

RealTimeGenomics Goes Free, Provides Alternative for CG Users

I've mentioned RealTimeGenomics (RTG) in the past, and Joke Reumers mentioned their software recently in her talk at the Complete Genomics user group meeting. Today, RTG announced that they were going free to individual researchers with their RTG Investigator 2.2 package. I kind of knew this was going to happen after some chats with them about the pricing models they were considering and the hard sell it would be to academia.

The Hard Sell
I think that it's a hard sell to get academic researchers to pay for something they can get for free. I've been in a lab that preferred to make its own Taq polymerase rather than pay for a commercial enzyme even when that meant using it at a 1:1 ratio with the rest of the PCR reaction (and no, I didn't stay in that lab for too long, but the point stands).

In a world where BWA, TopHat, Samtools, SoapSNP, and GATK are free and fairly well documented, selling an aligner and variant caller is going to be difficult unless it does something particularly special. Plus, a major focus of RTG's strategy is providing an alternative to Complete Genomics own analysis that comes "free" when you buy a whole genome from them. Very hard sell.

So, again, unless your software does something particularly special, like being more sensitive and/or specific, like being faster, like being significantly easier to use, like including a bunch of bells and whistles in the form of visualization tools or fancy reports, you're going to have trouble selling your product.

But how, as a company, do you prove that your software has something like this to offer? Traditionally trial licences have been the way, but that's with software that doesn't have a strong free alternative. A company lets you try the software and see if you like it, then you buy it if you do. But most sequencing labs have their pipelines done already. And comparing and contrasting two softwares isn't really worth the time unless the claims have been substantiated by other groups.

That's where this foot-in-the-door approach comes in. Basically, you give the software away to academia and let them do your leg work for you. If your software offers something special and academia can prove it, you'll start to be able to sell to the bigger corporate entities and sequencing cores. So the solution to the hard sell is to not sell at all! Brilliant!

So, how is the software?
I tested RTG's software on Illumina data because at the time that's all I was using. My findings were that it was easy to use (in a native, parallel environment), ran fast, mapped about 80% of reads (similar to Novoalign/BWA), and found a similar number of variants to GATK. Basically, it worked and seemed to work pretty well. I admit I have yet to go too in depth on comparing the findings. However, when I have some time, I intend to do more comprehensive assessment of its performance.

I also ran it in a mode that combined the Complete Genomics and Illumina data I had from the same patient. I found this to be a pretty cool option that I enjoyed using.

Really, if you're dealing with Complete Genomics data, this is your only option (as far as I know, let me know if this isn't true) for an alternative alignment and variant caller to theirs. You could also align using the RTG mapper and then try variant calling with YFA (your favorite algorithm).

What's unique?

They have a cool program called "mapx" that does a translated nucleotide alignment against protein databases. You can then take that and use their "similarity" tool to basically create phylogenetic clusters based on your reads alone. Very cool for metagenomics. I'm planning to try this out with a whole genome sample I have derived from saliva.

Why does this matter?

Well, frankly, there's the chance they just may be on to something. They make a lot of claims about their sensitivity and specificity. They have some killer ROC curves. They have that cool metagenomics tool that I honestly haven't heard about from anywhere else.

And now there's no fear of losing access to it when the trial licence expires.

I fully admit it: I wanted this to be free. Because I am one of those people who likes trying new programs and seeing if I can squeeze a bit more information out of my data set. I was just thinking today about how I should go back to our old U87MG dataset and call variants using GATK and the new SV pipeline we have.

Finally, I think it really has implications for users of Complete Genomics. Joke Reumers showed that CG variants detected by RTG as well were highly accurate. That's key as an in silico validation step. Plus, it empowers us to analyze the data ourselves. I love CG, but I also want the ability to adjust my alignment and variant calling settings myself. I also want to be able to update my analyses to be compatible with each other without having to pay a couple thousand dollars more on top of my original investment.

I do wonder how it's going to pan out. I hope, of course, that it ends up helping them out. As I tell all my corporate friends: I want them to succeed, because their success is my success.

At the very least, the software is now out in the wild. It's now on the users to figure out if it's worth using. I'll be doing my part over the coming months and I promise to share!

Saturday, June 18, 2011

Complete Genomics User Conference 2011 Day 2 Recap

Yesterday was a "half-day" at the CG User Conference and my netbook was inconveniently out of juice, so I didn't get to live blog. However, I took plenty of juicy notes at the interesting and useful talks of the day. The rundown included:

  1. Dr. Sek Won Kong, Children's Hospital Boston, about a downstream analysis and annotation pipeline for WGS/exome-seq called gKnome.
  2. Dr. Jay Kasberger of the Gallo Research Center at UCSF about their tools and scripts for analyzing CG data that they will soon make available.
  3. Dr. Stephan Sanders from Matthew State's lab at Yale talking about identification of de novo variants in CG/WGS data.
  4. Dr. Andrew Stubbs, Erasmus Medical Centre, about HuVariome, a database of variations common to the human genome for use as a filtering/informative resource. 
  5. Panel discussion that went into especially issues related to making sequence data and variants public and how to filter data that was quite interesting.
I'll summarize each of the talks and some of my thoughts below.

Dr. Sek Won Kong. gKnome: An analysis and annotation pipeline for whole-genome/exome sequencing. 

Dr. Kong presented a sequence analysis and annotation pipeline Michael Hsing and others in his group have developed for variant analysis. Although it certainly takes CG data, it looks like it'll take Illumina data as well.

Looked very nice. I think behind the scenes was a MySQL and Python based annotation pipeline. It utilizes the CG diversity panel (69 public genomes released by CG) to filter out systematic CG variants. Actually, this was a major theme at the conference that I'll go into at the panel discussion part.

In its first version, the gKnome pipeline will be able to annotate from RefSeq, CCDS, Ensembl and UCSC known genes.

The other cool part is the web front-end for the whole thing. Their system auto-generates a number of reports including box plots of #rare/NS variants/genome, which they demonstrated is closely tied to ethnicity. It also has built-in pathway analysis and disease risk summaries (including utilizing HGMD if one has a subscription). Finally they showed a nice R-based plot of CNV results that are auto-generated.

There was also a quick slide of hypervariable genes shown that was a point of much conversation generally. Basically, everyone agreed there's a set of specific genes and gene families that always end up with variants in them. Dr. Kong showed the list to include HYDIN, PDE4DIP, MUC6, AHNAK2, HRNR, PRIM2, and ZNF806. I've seen most of these pop up in my many exome-seq experiments as well. I've even had PDE4DIP, AHNAK2, PRIM2 and many of the ZNF and MUC genes look like disease-causing variants before.

So where can you get it? Well, it's not available yet, but should be completed by September. In the meantime, you can check out what they have going for them at: gknome.genomicevidence.org.

Let me just throw out there that this looks superior to the alternative currently used by most people, which is Annovar. Annovar is a great tool and getting better all the time, but with its rather clunky input and output formats, lack of any downstream stats or visuals, and some notable bugs in its conversion scripts, gKnome is looking pretty nice.

Dr. Jay Kasberger. Integrating tools and methods into analytical workflows for Complete Genomics data.

This one was pretty nice because it was kind of an overview of tools used with CG data and how a lab with a lot of CG data might implement them. Dr. Kasberger especially presented the way they assess the variants and CNVs provided by CG.

There was a tool for auto-comparison with Illumina Omni SNP chip data, which is worth a look (although I should note that you have to deal with the Illumina SNP chip problems like the ambiguous calls at some percent of spots that don't tell you the correct ref/alt alleles, etc. yourself... and frankly, I'm not sure which ones those are myself--I usually just compare heterozygous calls between the platforms for validation).

It also demonstrated a tool that goes from the CG masterVAR format to PLINK format for downstream IBD estimates.

Finally, they have a number of Circos generating SNPs for variant density, coverage (using heatmap tracks), etc. And we all know I love Circos, so cool on them for that.

These tools are available from them at: sequencing.galloresearch.org (but you'll need a login by being a collaborator/member of their group, so you'll have to contact them for access).

This talk was interesting because it showed off tools that probably every genomics lab with CG or WGS data has developed for themselves. This stuff needs to be packaged up, published and shared openly in my opinion. For example, I don't like that there's this gateway through their page for it all. Put it up GitHub or SourceForge or something, guys!

Dr. Stephan Sanders. Identifying de novo variants in Complete Genomics data.

Dr. Sanders focused on assessing de novo variants. By this he does not mean what we typically call novel variants. Rather, he's talking about variants not inherited from parents (which coincidently are not likely to be known variants either). He claimed that (based on the Roach et al. paper), de novo variants are extremely rare, somewhere around 1x10^-8 chance per base, or about 0.5 disruptive events per exome.

That equates to fewer than 100 de novo variants per genome. But when you actually assess the number from real data with standard filters, you end up with around 20,000 candidates. That's a problem.

He demonstrated that the rare de novo candidate variants (the 20,000) have a much lower distribution of quality scores than true variants.

He then went into a fairly extensive discussion of how to estimate the specificity needed to narrow down to those candidates to the correct number, which was great. BUT, the bottom line is really that you just have to move up the quality cut-off. At a high enough level, the de novo candidate total drops off and the very few leftovers are highly enriched for true positives. He showed that he narrowed down to ~70 candidates and that they validated a little more than thirty of them. This matched well with the expected number he calculated earlier.

Cool talk, but the take-home is that all you have to do is sacrifice sensitivity for specificity and you'll throw away the vast majority of false-positive de novo variants. So for the ~20k or so candidates, apply a more stringent filter and voila!

So, pretty cool idea for what to do with alleles that don't follow Mendelian inheritence errors. Just apply a stringent base quality and read depth filter to them and highly enrich for true variants. Kind of an obvious conclusion, but not something many have been doing, I'd bet.

Andrew Stubbs. HuVariome: A high quality human variation study and resource for rare variant detection and validation.

This one is going to be pretty brief, though it was a good talk especially for the conversation points it brought up.

HuVariome is a database they're putting together of well-annotated known variants. The goal is to use it as an alternative to dbSNP for filtering (which honestly shouldn't be used for filtering in the first place). It will be available at: huvariome.erasmusmc.nl. Definitely a possible alternative to using dbSNP, especially if it's as well annotated as they suggest it will be.

Panel Discussion
Worth mentioning because the same themes kept coming up. Specifically, there was quite a bit about everyone developing their own list of false positive variants/genes and how to filter SNPs adequately. Generally, it's agreed that (1) there is a subset of variants and genes that show up in nearly every sequencing experiment and therefore are more likely false positives than anything else (they don't tend to validate, either) and (2) that dbSNP is too inclusive of disease samples and lacking adequate phenomic info to be used as a blind filter. I've personally always told people to use dbSNP as a guide. Never dismiss a variant just because it's in dbSNP. You can look at the non-dbSNP variants first if you want, since those might be the "jackpot" spots, but if you find nothing there, try looking in those in dbSNP.
That then leads to the need for things like HuVariome and lists of "bad" genes/variants (like hypervariable genes mentioned by Dr. Kong). But the problem then becomes how to share variants publicly that are present in protected samples, even if they're artifacts, because consent wasn't given to make any variants publicly available.
Personally, I see that as an issue that will solve itself, but of course, we want it sooner rather than later. Solutions such as projects specifically intended to produce these lists are possible, though (and some in the audience said they were doing just that).

That's a Wrap
Anyway, that's a wrap on this conference. I hope it was informative for everyone. Also, props to Complete Genomics for putting on a pretty decent corporate conference. I didn't think it was overly biased, I found it useful and interesting, and the venue and food were quite good.

Thursday, June 16, 2011

Complete Genomics Community

Apparently they just released this publicly:

Pretty cool, actually. It has a forum, a wiki-ish knowledgebase (or at least, hopefully it'll be like a wiki in the future) and a tools section.

Currently the tools section has CGAtools and some other scripts written by Steve Lincoln's team at CG. Seems like in the future it will include community-derived data as well.

Also, I'm just gonna throw this out there: It seems like the community is really not satisfied with annotation tools. And by that I do mean Annovar and the Ensembl annotation tools. So I think that's really an open area for development for an ambitious bioinformaticist or two out there.

Cancer Grant Program
Will be officially announced week of June 20th. Basically, it'll be an abstract plus some simple questions. There will be two winners in US and Europe. Applications will be due July 29th, winners decided by August 12th and samples will need to be submitted by September 16th.
Bonus: All applicants will get a future discount regardless.

Pretty cool... time to find some cancer samples to sequence.

Live Polling
Kinda cool (and totally unrelated to genomics/CG... oh well). Never seen this before. Hosted by PollEverywhere.com. Basically, you can host a live poll during a Powerpoint presentation. People can text message in a response to the poll and it'll update on the fly within the presentation.

Simon Lin Talk

Sequencing Outsourcing: Northwestern Experience

An overview of the needs and costs of running your own sequencing rather than simply outsourcing.

In-house: Numerous costs, need of additional staff, need of storage and analysis computers, et cetera.

Outsource: lower cost, access to bioinformatics, etc.

Need to consider cost, quality, and the impact to "internal culture/image/students"

Lose at least some of the control by outsourcing. (However, re-analysis of the data itself is obviously possible.) But tweaks and changes can't really be made with outsourced analyses.

They have a 454 (four-fifty-four... or is it four-five-four? I've always called it four-five-four.)

Zachary Hunter Talk

Paired Whole Genome Sequencing Studies in Waldenstrom's Macroglobulinemia

This type of lymphoma is pretty rare and also not much is known about it.

Approach: 10 paired genomes plus 20 unpaired

For normal they used CD19 depleted PBMCs and also took buccal cells as a backup.

They actually did do exome sequencing as well. 

They found a large number of strong candidates (>7?)... one was present in 100% of the 10 paired and 87% (26/30) in all 30 patiants. Strangely, it was the same exact SNP in all individuals, but it's a very good functional candidate.

Interestingly, with Sanger sequencing they saw a very small peak of the same variant in the trace from an individual with a very weak case (one of the four that didn't have it). Perhaps all patients with it have this variant?

Detailed circos plots. Mapped out zygosity, CN, allele balance, CGI coverage levels and then testing results. At this detailed level, there were many notable CN/zygosity regions and UPD regions. (Everyone LOVES Circos.)

They made their own wrapper for Annovar because the input/output for Annovar is something the "don't like". (Hey, guess what? You're not alone! I made just such a wrapper myself!)

Nice talk. I loved the Circos. Again, I have to ask whether a lot of these findings couldn't have been done solely through exome-seq, though...

View from the CG user conference

Fantastic view and great San Francisco weather today!

Joke Reumers Talk

Two major experiments covered:
tumor/normal ovarian cancer
monozygotic twins (schizofrenia)

Even at low error rate of CG data (1/10000), that's ~30k errors per genome, which is too much for twins.
Detected 2.7M shared variants, but 46k discordant variants between twins

Individual filters for quality/genomic complexity/bioinformatic errors used...
Quality: low read depth, low variation score, snp clusters, indel proximity to snp (5bp from snp)
Complexity: simple repeats, segdups, homopolymer stretches
Bioinformatic errors: Collaboration with RealTimeGenomics, re-analysis

1.7M shared variants, 846 discordant variants

So basically swung the error rate from type 1 to type 2.

All 846 discordancies here were validated by Sanger sequencing.

Also 2 of the shared variants were found to actually be discordant.

Reduced error rate down to 4.3x10^-7 (from 1.79x10^-4).

Of the 846, 541 were  false positives.

NA19240, 1000 genomes Illumina sequencing versus CG

Before filtering, CG had more false negatives, Illumina had more false positives.

After filtering, they were both down to about 1% error rates.

As for tumor/normal, adding the filtering made a little difference... from 437 down to 21. But of course, this kills some true positives.

Summary: Very good talk, I liked this one. I was pleasantly surprised to see RealTimeGenomics get a shout out as one of their filter approaches. I used their software myself, it's very good and I hope to collaborate again with them, especially after seeing it helping other groups with their filters. Also, I think there's a lot to be said about the different error rates with CG versus BWA/GATK et cetera. I'm leaning toward combined approaches... for example, why not do exome-seq on Illumina as a validation of CG and to adjust error rates?

Bonus: Hilarious pic of a desk covered in hard drives. In our lab's case, I think they're stuffed under people's desks. Someone needs to do something about the drive overload.

Complete Genomics User Conference 2011

Today I'm attending the Complete Genomics User Conference in San Francisco at the Fairmont. Nice venue, particularly for the guests that had to come from far away (actually, being from Palo Alto, I would have preferred if they had it down in Mountain View).

First talk I saw was from Dr. Kevin Jacobs of the National Cancer Institute. Thought it was very nice, and I'll provide a run-down in a few minutes.