Two major experiments covered:
tumor/normal ovarian cancer
monozygotic twins (schizofrenia)
Even at low error rate of CG data (1/10000), that's ~30k errors per genome, which is too much for twins.
Detected 2.7M shared variants, but 46k discordant variants between twins
Individual filters for quality/genomic complexity/bioinformatic errors used...
Quality: low read depth, low variation score, snp clusters, indel proximity to snp (5bp from snp)
Complexity: simple repeats, segdups, homopolymer stretches
Bioinformatic errors: Collaboration with RealTimeGenomics, re-analysis
1.7M shared variants, 846 discordant variants
So basically swung the error rate from type 1 to type 2.
All 846 discordancies here were validated by Sanger sequencing.
Also 2 of the shared variants were found to actually be discordant.
Reduced error rate down to 4.3x10^-7 (from 1.79x10^-4).
Of the 846, 541 were false positives.
NA19240, 1000 genomes Illumina sequencing versus CG
Before filtering, CG had more false negatives, Illumina had more false positives.
After filtering, they were both down to about 1% error rates.
As for tumor/normal, adding the filtering made a little difference... from 437 down to 21. But of course, this kills some true positives.
Summary: Very good talk, I liked this one. I was pleasantly surprised to see RealTimeGenomics get a shout out as one of their filter approaches. I used their software myself, it's very good and I hope to collaborate again with them, especially after seeing it helping other groups with their filters. Also, I think there's a lot to be said about the different error rates with CG versus BWA/GATK et cetera. I'm leaning toward combined approaches... for example, why not do exome-seq on Illumina as a validation of CG and to adjust error rates?
Bonus: Hilarious pic of a desk covered in hard drives. In our lab's case, I think they're stuffed under people's desks. Someone needs to do something about the drive overload.