Monday, March 12, 2012

Cliff Reid on CG vs Illumina

Recently saw a post on the Complete Genomics website from Cliff Reid discussing our Nature Biotech paper (Lam et al, Dec 2011) in which we compared sequencing the same individual to high depth on both Complete Genomics and Illumina and compared them.

I think some of it is fair, but I do want to go a bit into it because it’s more complicated than just breaking it down by the Sanger validation rate.


First off, Cliff addresses the fact that we found CG was more accurate than Illumina when it comes to SNP detection rate. To determine this, we used three small Sanger validation sets. We took a set of 20 SNPs detected by both platforms, a set of 15 detected only by Illumina and a set of 18 detected only by CG and Sanger sequenced them. Here’s how it broke down:

Number testedValidatedValidation rate
Both platforms2020100%

This is certainly pretty cut-and-dried—obviously the CG-specific variants that we were able to validate were accurate at a higher rate than the Illumina-specific ones. However, I think a fair criticism of this very experiment was the relatively small number of variants that were validated by Sanger.

The Complexities of the Experiment

I think we need to consider a bit more the problem of the very small number of SNPs that we validated in the Sanger sequencing and, perhaps more importantly, what those SNPs were and why we went ahead and used the SureSelect data instead to draw our conclusions.

The SNPs that were selected for Sanger validation included SNPs at every quality level that passed thresholds. We selected twenty variants from each set (concordant, Illumina-specific, and CG-specific) and tried to validate them by Sanger.

Please note the “number tested” in the above table doesn’t match with the 20/20/20 that I just said we designed. This is because for each of the platform-specific sets, we were unable to design primers that amplified product across seven of the SNPs, and those were indeed consistently “low quality” relative to the mean quality score for concordant variants (although, again, they passed threshold).

This is why we went forward with using the Agilent SureSelect targeted sequencing data for validation. Of course, we fully realized that such an assay would be potentially biased towards Illumina because the validation is being done on the same machine as the whole genome sequencing. That was, in fact, the reason we initially went for Sanger. But after sixty Sangers and realizing it would be hundreds more (think of that in terms of time, man-power and monetary cost) before we could generate anything really meaningful with Sanger, we decided the few we did had accomplished our goal of at least demonstrating that concordant SNPs are highly accurate but non-concordant are not and moved on.

In Cliff’s post, he extrapolates the number of platform-specific SNPs that would validate if the Sanger rates were correct across the board and concludes that CG is, in fact, more sensitive than Illumina. I caution against using this Sanger data this way because in the paper, we clearly utilized the SureSelect capture validation to make up for an inadequate Sanger experiment.


Regarding experimental design and how it gets conveyed in the literature. We tried to make it clear that the Sanger was only suggestive, not conclusive. I hope that got conveyed, but I’ll reiterate here that the Sanger data in that paper is limited in its usefulness because there simply isn’t enough of it and because of the lower mean quality score across those positions resulting.

We do, in fact, have a figure (Supplementary Figure 1) that demonstrates lower quality scores for the platform-specific SNPs from both platforms, and that played out in the validation. But the problem is that the sentence describing that is four paragraphs up from the Sanger validation paragraph and it isn’t linked in the text.

Here’s a quote from Cliff’s post that I want to respond to:
“The paper also points to the magnitude of the problem caused by validating the Illumina platform with the Illumina platform. The Sanger validation data can be used to estimate the confidence in the results of the target enrichment validation data. If the Illumina unique SNPs really were 64.3% true SNPs as reported, then the likelihood of getting the Sanger validation results (2 of 15 validated SNPs) is less than 1 in 10,000. While the exact Illumina SNP validation rate is unknown, the Sanger data tells us that we can be more than 99.99% confident that it is less than the 64.3% calculated by this biased validation approach. For these reasons, we believe 64.3% is not the correct number to use in calculating the sensitivity of the Illumina platform in this study.”

I want to stress the importance of Table 2 in the paper and how it  shows perhaps the most important information in the entire paper. Here’s a summary of it:

ValidatedInvalidatedNot validatedValidation rate

“Validated” means those that were present in the whole genome sequencing and observed in the SureSelect data (true positives). “Invalidated” are those that were present in the WGS, but observed as false in the SureSelect data (false positives). And finally “not validated” are those that could not be detected adequately in the SureSelect data. The “validation rate” was determined by removing those “not validated” from the targeted count and then determining what percent of those were validated (not invalidated).

This is key. Those “not validated” do include quite a few of those “low quality” candidates that wouldn’t validate by Sanger either, of course. And those make up a relatively high proportion of the Illumina-specific SNPs (21.4% compared with CG-specific SNPs at 12.9%).

 Now go back to the total counts and extrapolate that value to actual variants counts. If 21.4% of Illumina-specific variants are of this ilk, that brings the Illumina-specific count down to 271,248 SNPs. If 12.9% of CG are of this ilk (which seems reasonable given both Sanger and SureSelect), the CG–specific count goes down to 86,732 SNPs.

If you’re following me, we’re now at variant counts that we can now attach our “validation rate” to determine the actual number of true positives in a way Cliff might approve of (but without using the inadequate Sanger data). Here are the results:

TotalExtrapolated "good" callsExtrapolated validation

This being the case, I think it’s clear why we say “Illumina was more sensitive.” I feel confident these are the numbers to use, and Illumina detected quite a few more total SNPs. However, it clearly has a higher error rate as well, so that can affect things downstream, as it’s not trivial to differentiate all those errors from the total.

As for the criticism regarding the chances of getting a 2/15 validation rate if the actual validation rate is 64.3%, for one thing I think we should use different numbers—in this case, 2/20 (the total number of those tested and validated by Sanger) and 50.5% (the total number tested and validated by SureSelect). Still, that detail aside, you’re still going to get a very small probability (e.g. a hypergeometric p of 0.0001).

But I can also look at it another way. What’s the probability of the CG-specific result doing the same thing? It’s okay, but it’s not that likely (p=0.11). Yet you’re talking about 53.9% and 13/20 there.

There’s two reasons for that:
1) 20 is a small number. With a ratio around 50-55%, unless you get 10 or 11 out of 20, you’re deviating pretty dramatically. In fact, the range for p > 0.05 with 20 pulls and a 53.9% ratio is only from 7 to 15. This is why I said we “would have had to do hundreds of Sangers”.
2) Sanger sequencing is different from SureSelect target enrichment sequencing anyway. It’s not a true subset in the first place, and is susceptible to sources of error that don’t affect next-gen sequencing.

Anyway, I don’t think that criticism is overly fair. Really, to me, it only supports that our Sanger data should not be used this way in the first place.

Finally, I don’t want to seem like I’m bashing CG here. To the contrary—CG performed exceptionally well in our paper. It is without a doubt from what I saw more accurate.

(Maybe I should write that paper where I add SOLiD into the mix…)


  1. A confidence level of 95% and a confidence interval of 5% for each of the platform specific call set requires a minimum sample size of ~380. Any further estimation based on a statistically insignificant set is inconclusive. That's why we went on to SureSelect at a larger scale, which gives us a statistically significant result.

    As mentioned on the paper, the SureSelect may have potential bias since it was followed by Illumina sequencing. But if there is a strong bias towards Illumina due to systematic errors, probably the invalidation rate for Illumina itself wouldn't be as much as that for Complete.

    Let's take the existing Sanger numbers and calculate it once again with its possible errors. With the same confidence level of 95% aforementioned, the possibly best validation rate for Illumina is 30% and the worst for Complete is 83%, which convert into 104K and 83K true positives in their specific call sets, respectively. That said, Illumina is still having a higher sensitivity, whereas Complete is more accurate (less FDR).

    If it looks unfair, that's the problem of extrapolating on a set with big error bars. One thing that is true is that we can do a larger scale of Sanger sequencing on the specific calls, then we can have a better sense of the potential ground truth which will be less controversial.

    Until then, we gotta believe that they both have their goods and bads, and performed very well overall.

    1. Thanks Hugo.

      I do think an alternative for this kind of comparison is to validate on a completely different technology. For example, 454 or Ion Torrent or even SOLiD.

      That said, I'm certain (from having a look at such data) that our conclusions will not change after doing so.

  2. Yea, I think 454 makes sense too. Not sure if Ion Torrent and SOLiD are good for validation with their relatively higher error rates.