Thursday, June 16, 2011

Joke Reumers Talk

Two major experiments covered:
tumor/normal ovarian cancer
monozygotic twins (schizofrenia)

Even at low error rate of CG data (1/10000), that's ~30k errors per genome, which is too much for twins.
Detected 2.7M shared variants, but 46k discordant variants between twins

Individual filters for quality/genomic complexity/bioinformatic errors used...
Quality: low read depth, low variation score, snp clusters, indel proximity to snp (5bp from snp)
Complexity: simple repeats, segdups, homopolymer stretches
Bioinformatic errors: Collaboration with RealTimeGenomics, re-analysis

1.7M shared variants, 846 discordant variants

So basically swung the error rate from type 1 to type 2.

All 846 discordancies here were validated by Sanger sequencing.

Also 2 of the shared variants were found to actually be discordant.

Reduced error rate down to 4.3x10^-7 (from 1.79x10^-4).

Of the 846, 541 were  false positives.

NA19240, 1000 genomes Illumina sequencing versus CG

Before filtering, CG had more false negatives, Illumina had more false positives.

After filtering, they were both down to about 1% error rates.

As for tumor/normal, adding the filtering made a little difference... from 437 down to 21. But of course, this kills some true positives.

Summary: Very good talk, I liked this one. I was pleasantly surprised to see RealTimeGenomics get a shout out as one of their filter approaches. I used their software myself, it's very good and I hope to collaborate again with them, especially after seeing it helping other groups with their filters. Also, I think there's a lot to be said about the different error rates with CG versus BWA/GATK et cetera. I'm leaning toward combined approaches... for example, why not do exome-seq on Illumina as a validation of CG and to adjust error rates?

Bonus: Hilarious pic of a desk covered in hard drives. In our lab's case, I think they're stuffed under people's desks. Someone needs to do something about the drive overload.


  1. This Michael, do you know where one might find more specific information on the exact filtering? Like what were they using for there segdup and homopolymer bounds?

  2. Joke was not too specific, but it seemed the filtering was purposely made very stringent.

    Their "quality" filter (in addition to low read depth and variation score) included:
    SNP cluster: I believe it was 5 SNPs within 20 bases of each other (though I admit that may have been me, not her...)
    Indel proximity to SNP: An indel within 5bp of a SNP.

    Complexity: Seems like they simply excluded variants in repeat elements (simple repeats, microsatellites, etc.), segmental duplications, homopolymer stretches (not certain the exact definition for their homopolymer bounds). I think it might have been based on UCSC's tracks, but that's a guess.

    Bioinformatics: They re-analyzed the data using RealTimeGenomics and I'm guessing they kept variants called by both CGI and RTG.