Friday, December 9, 2011

GATK's available annotations

Perhaps because it changes too often, GATK's available annotations for VCF files does not seem to be online anywhere that I've seen. The GATK site says to run GATK with the "--list" parameter to list them. Doing that requires putting in valid input files and such. Basically, it's a pain.

So here's the list from GATK v1.3-21-gcb284ee

Available annotations for the VCF INFO field:

Available annotations for the VCF FORMAT field:

Available classes/groups of annotations:

No promises about how accurate this is for any other version.

Wednesday, November 16, 2011


I'm not a huge politics guy, so I don't want to go on a tirade about the Stop Online Piracy Act. Sufficed to say, it's a huge censorship bill parading as a bill to protect intellectual property. While I think the majority of us support protecting IP, I can't imagine the best way to do so is to monitor everything we do and censor websites based on some government-backed list of sensitive content.

As an example, if someone were to post copyrighted material in the comments section of my blog without my notice, my blog could potentially be shut down (censored) because of it. If I were to link to a site that had, somewhere on it, shared copyrighted material (even if I had no idea it was there and didn't intend for anyone to go there and see it or download it), my blog could be shut down (censored).

Moreover, the bill basically forces both providers and hosting services to strongly monitor content and shut down sites that potentially "infringe" on protected IP. Ever posted a picture of something that was copyrighted? Ever shared a link to a YouTube video with a copyrighted song in the background? That could be enough to get you shut down (because your host or provider doesn't want to be sued).

My wife and I often talk about the situation in Japan (she's Japanese) regarding the Fukushima nuclear reactor situation and how censorship and control of the media in Japan is so strong that the general public has no idea how dire the situation is. We even see that censorship and media control bleed over to here in America, where the general public is under the impression the nuclear situation isn't as bad as it is, in large part because the government and big media have an incentive to see nuclear power as an industry succeed.

Now here's the scary thing: If I were to start posting excerpts from copyrighted articles about that topic to respond to, if SOPA were to pass, my blog could potentially be shut down. I could personally be denied access. It's unclear to me exactly how much power this bill would give big media and the government. And that's the major problem.

In our field of genomics, a lot of us utilize freedom of sharing information and media to rapidly advance the science. I understand that this bill is meant to limit piracy of software and other digital media, but it represents a foot-in-the-door to all sorts of censorship. Could SEQanswers, for example, be sued for having a post up that contains the Illumina adaptor sequences? It certainly has been threatened in the past based on such things, but with SOPA passed, SEQanswers very well could have been shut down for that. What a detriment that would have been to the genomics and bioinformatics community.

Anyway, I just wanted to share this on my outlet to the world, as it is a very important issue generally and to our field in particular.

If you are an American and you do not support SOPA, please send a notice to your congresspeople telling them not to support it, either.


A secure, offline storage solution with automagical backing up. Forget the cloud for sensitive data.


Thursday, November 3, 2011

There's something funny about this quality score


Thursday, October 13, 2011

Exome Sequencing Q&A

In case you missed it, the most recent issue of Genome Biology is focused on exome-sequencing. Included among a number of exome-seq platform comparison papers (like this one and this one, which I intend to discuss in comparison with my own recent publication in Nature Biotech in a future post) is a fun Q&A article entitled "Exome sequencing: the expert view". The article asked three experts (Dr. Leslie G. Biesecker of NHGRI, Dr. Kevin V. Shianna of Duke University, and Dr. Jim C. Mullikin of NHGRI) a series of questions related to exome-seq ranging from what we've learned to whether exome-seq will be around in a couple years.

Since I like to think of myself as a bit of an expert on the topic (mostly because I dedicated the past year of my life to it), I thought it might be fun to answer the questions myself and discuss a bit what I think of the answers in the article.

How can exome sequencing contribute to our understanding of the dynamic nature of the genome?
Exome sequencing is quite literally the best way we currently have to assess the most interpretable part of the genome: the protein-coding sequence. While it accounts for only about 1% of the whole genome, that 1% is the part we understand the best, because it's the part that we can assess through central dogma. Basically, when we see a mutation in an exon, we can then determine how the RNA will be affected, and subsequently how the protein will be mutated. We can determine precisely which amino acids will be mutated, and which amino acids they will turn into. And thanks to decades of work on proteins and amino acids, we can predict fairly reasonably how damaging a particular amino acid change will be. We can also predict nonsense mutations, which almost always result in loss of the protein as well as frameshifts (which typically lead to a nonsense mutation), splice site variations, and regulatory site mutations (assuming the UTRs/upstream and downstream regions are enriched).

I put it like this to someone not long ago: What would you look at if you sequenced someone's genome? Exons. That is literally the first thing, because it's the low hanging fruit. If you find nothing there, you might zoom out and look at splice sites (which exome-seq generally captures), micro RNAs (which exome-seq generally captures), and regulatory elements (which are becoming recognized as increasingly important--be ready for regulatory element/transcription factor binding site enrichment kits in the near future). As for the intergenic regions, they're such a mystery with regards to function that I feel secure saying we basically don't know what to do with variants in those regions.

At the very least, I have no doubt exome-sequencing will drive a revolution in Mendelian disorder research (it already is). We've already seen a number of them published and as an increasing number of samples are sequenced, that will only improve. Because Mendelians are most often caused by changes in the protein coding region of genes, exome-seq is a prime way of solving them. The one major barrier I see for solving Mendelians is the (perhaps surprising) prevalance of dominant low-penetrance disorders. Mendelian disorders with this mode of inheritance are inherently difficult to solve, and will require a large number of samples sequenced. That said, this again primes exome-seq as a major method, and I feel secures it as a major technology for at least a few more years.

In the paper, Kevin Shianna mentions some shortcomings. In particular, he mentions that structural variations (SVs) are difficult to detect by exome-seq, which is certainly true. I would say there is some light at the end of that tunnel, at least with regards to copy-number variations (CNVs), as some successful work is already showing positive results with detecting CNVs in exome-seq.

How much has exome sequencing been driven by cost alone?
Certainly I think the increasing prevalence of exome-seq has been almost completely driven by cost. Targeted sequencing (and subsequently, exome-seq) were developed precisely because labs wanted to assess particular regions by sequencing without paying for an entire genome sequence (most of which they wouldn't even look at the data from). That said, were the exome-seq and WGS assays the same cost, there would still be significantly higher cost both in terms of time and money associated with the bioinformatics, analysis and storage of WGS data compared with exome-seq data. Even at equivalent assay cost, exome-seq is cheaper. Will that ever change? Unlikely, I think.

Consider the value of an exome versus the value of a whole genome currently. One could certainly argue that because the information in the exome is so much more meaningful (please don't write me letters for saying that), every exome base is worth significantly more than intergenic bases. In the paper, LGB and JCM make the point that an exome costs about 1/6th what a whole genome costs. That seems pretty accurate to me. And that does demonstrate how much more valuable we consider the exome compared to the other regions. Because of that I would not say that exome-seq has been driven by cost alone, but also because it is considered more valuable per base than whole genome.

What are the major limitations of exome sequencing?
Obviously, any meaningful variation outside of the exome is missed by exome-seq, and that's a major drawback. Just because they're less interpretable does not mean that non-exonic variations are meaningless. Missing SVs is another large drawback, although I feel it's important to point out that SV detection does not require a deep whole genome sequencing experiment--a low depth WGS that is not adequate for small variant calling is more than adequate for SV calling. Therefore, exome-seq could be supplemented by low depth WGS for SV calling without going as far as doing a full WGS. I'd estimate the cost of an exome-seq plus low-depth SV-WGS to still be less than 1/4th the cost of a full WGS.

The other major issue, which is also brought up in the paper, is the fact that exome-seq misses almost everything it doesn't target. If the mutation causing a particular disorder is in an exon that doesn't happen to be on your target list, you're going to miss it. WGS, on the other hand, would likely catch it. That's a major limitation. However, if the exon is poorly annotated and that's why it's not on the exome-seq design, there's a good chance you'll miss it with WGS as well. So while this is a major limitation, it's as much a major limitation of sequencing in general. Specifically, that mutational analysis is so heavily reliant on annotation, which is incomplete at best.

What lessons from exome sequencing studies can be applied to whole genome sequencing studies?
Most of what we've learned from exome sequencing studies is directly applicable to whole genome sequencing studies. Currently, the field is in a state where exomic variation is highly detectable and interpretable, and that's due in no small part to exome-seq. Coding region annotation in particular has become significantly more powerful as a result of all the exome-seq work that's been done. We have a much stronger appreciation for the limitations of sequencing as well. Although we are very sensitively detecting variants, even at miniscule false positive rates we have trouble detecting disease-causing variants with single samples. As far as I'm aware, not a single Mendelian disorder has been solved by sequencing of a single sample (exome or otherwise). Because we need some way to eliminate false positives, and we use family members and additional samples for that.

And that brings up the other major lesson we've learned that we'll need to apply to WGS in the future: Small sample groups are simply not adequate for most purposes. Even for monogenic conditions, numerous samples and multiple families are required. Much like GWAS and linkage studies, we're going to need to sequence a lot of people to find meaning. This has a lot to do with sensitivity and specificity. A false positive rate of 1% sounds okay in a lot of fields, but in genomics, that's 400 false positives in the exome and 30,000 in the whole genome. Good luck finding your one disease causing mutation in there! But we're doing it through clever decision-making with regards to which family members to sequence, which individuals to sequence, et cetera. And that is perhaps the most useful lesson from exome-seq so far.

How do exome sequencing studies contribute to our mechanistic understanding of disease?
The most obvious answer is that exome-seq is very likely to be how we find the genetic etiology of the vast majority of Mendelian disorders. How this will translate to complex disease has yet to be seen, though it's certainly being pursued, particularly in the field of autism.

I would like to respond to a statement made by Kevin Shianna in the paper:
However, exome studies will have very limited power to identify causative variants in regulatory regions spread across the genome (transcription binding sites, enhancers, and so on). Implementing a WGS approach would allow detection of variants in these regions, thus increasing our knowledge of disease beyond the coding region of the genome.
While it is certainly true that exome-seq may be unable to identify mutations in the regulome (yes, regulome), WGS is not the only alternative to it. Why not instead supplement the exome-seq with a regulome-seq? I guarantee you companies are going to create such a thing, and even if you don't want to wait for them to be commercially available, one could create a custom target enrichment that covers the regulome as well.

Back to the cost issue, at about 1/6th the cost of WGS each, exome-seq + regulome-seq would still only come to about 1/3rd the cost of WGS. Outside the exome and regulome, is there much more in the genome that we can even comprehend by sequencing at this moment? Structural variation, perhaps, but as I've already mentioned, SVs can be detected with a low depth WGS. So for half the cost, we may be able to obtain the same biological meaning. That's something to seriously consider when deciding how to budget your sequencing. If you can squeeze out double the samples and obtain practically the same info by doing exome-seq + regulome-seq + low depth WGS, wouldn't you?

I would also add that while it may not be mechanistic, the rise of exome-seq has led to a realization that that we are not adequately prepared from a policy standpoint yet. More than anything, I think it has led to a general realization that medical genetics, genetic counselors and DTC genomics are arriving before society is really ready for them. And that's an issue that I think is going to need increased attention as we researchers charge forward with sequencing everybody.

Does exome sequencing have a limited 'shelf life'?
Not as limited as many people think. Even in the $1000 genome world, exome-seq will still have a place. I'll discuss this a bit more shortly, but there are genomic elements that are more sensitively determined by exome-seq. I feel strongly that once sequencing is sufficiently cheap, exome-seq will become the de facto standard for genomic information much as microarrays were for a good decade. This is due to a combination of factors--the majority of the genome being uninterpretable, the storage cost of genomes versus exomes, the bioinformatic challenges of whole genome versus whole exome, et cetera. The article goes into good depth on many of those issues. The one I'd focus on, however, is that until WGS is actually cheaper than exome-seq, it will always have a place in our field.

I have to disagree strongly with the following statement:
Yes, as soon as the difference in cost between exome and whole genome diminishes (which will be soon) and issues with data management and storage are resolved, whole genome sequencing will be the method of choice.
First of all, he knows as well as the rest of us that "issues with data management and storage" are not trivial to resolve. But beyond that, this is the stance that says, basically, "what's a thousand dollars for a whole genome compared to $300 for just the exome"? I state these numbers because that's where we're heading. At some point exome-seq will stop getting cheaper because the enrichment assay will always cost something. Same for WGS. And my answer to that question is simple: You'll still be able to get three exomes for the price of one whole genome.

The real thing that will limit exome-seq's lifespan is the interpretability of the rest of the genome. If we can utilize the intergenic data to a greater degree, then those bases become more valuable. Then WGS becomes more appealing. Until then, it's all exome (and regulome--I swear, it is coming soon!).

How much do you think that future research will be restricted by the IT-related costs of the analysis?
A great deal. Clouds are not cheap. I'm not convinced they're even the answer at this point. But are clusters? I'm not so sure there, either. I'm not convinced we have the IT solution to this yet because, frankly, I don't think the hardware manufacturers have realized what a lucrative market it's going to be.

I had a conversation about this with somebody not long ago. I said, we've gone from producing gigabytes to terabytes to petabytes over the span of four years in the sequencing world. By now I think there's a good chance we're at or nearing a worldwide exabyte of genomic information. As we start sequencing more and more people, we're going to start hitting the limit for storage capacity worldwide. Sounds crazy, right? But is it?

I also had another rather humorous thought that evolved from that one. How much energy will the number of hard drives needed to store the entire human population's genomic data require? How much heat will those drives produce? In the future, could sequencing be a major cause of global warming? Think about it!

Frankly, I've started to take the issue seriously. One of the tasks on my plate is a thorough assessment of alternative storage formats for genomic data. But that's a post for another day.

Are there any advantages of whole exome over whole genome sequencing?
Yes. Let me say that in less uncertain terms:
And by that I don't mean what Leslie Biesecker and Jim Mullikin said in their response in the paper. There are literally exonic regions that are better resolved by exome-seq than WGS. We demonstrate that in our recent Nature Biotech paper. A typical WGS will miss a small but meaningful number of exonic variations that are detected by exome-seq. To be fair, the opposite is also true: WGS will detect some exonic variations missed by exome-seq. To me, that says one thing: To be truly comprehensive at this point in time, we need to do both. Naturally, budgets prevent such a thing, but it's important to recognize that targeted enrichment can allow sequencing of regions missed by WGS, and that this is an advantage of exome-seq.


Biesecker LG, Shianna KV, Mullikin JC. Exome sequencing: the expert view. 2011. Genome Biol. Sep 14; 12(9):128 [Epub]

Tuesday, October 4, 2011

And he's back with a site redesign!

You may have been wondering: Where have I been?

Choose from the following:

Getting married.
Driving my wife's family from Japan around the Bay Area.
On my honeymoon.
Finishing up a huge paper.
Sick with meningitis.
Adopting a cat.

If you checked any of the above boxes, you're correct! Congratulations.

Yes, it's been a very busy time in my life, but I am finally back and promise to update more often than ever before.

Also, I hope readers like the site redesign. This is a new offering from Blogger. A dynamic theme that allows you to choose how you'll view it from the dropdown menu on the left. I'm a big fan of the "Magazine" look, so that's what I've set to default.

Also, here's a cool site: [] Basically, they let you sign up and will host your DTC SNP chip data with them free (with all the consequences* that comes with).

Finally, this offer from 23andMe is interesting. $999 for an 80x human exome. That's raw data only, folks. No analysis. I've got a long post about this whole thing in mind that I'll put up in the coming days. Still, it's a great opening salvo in the DTC/PGM era. Exome-seq DTC is truly here. Finally.

*Consequences unspecified. You should be protected by GINA regardless and, honestly, if someone wanted to know your genotypes so bad they could just pick up that coffee cup you threw away last week and do it themselves. But whatever, you still better ask dad before you post half of his DNA on the web.

Tuesday, August 2, 2011

$50 off 23andMe Coupon Code

Hello friends!

23andMe just sent me an email with a coupon code in it for $50 off! Apparently it's share-able, so if you've been waiting for a genomic deal, here you go!

To use this coupon, visit our online store and add an order to your cart. Click "I have a discount code" and enter the code below.
$50 Off
Coupon code: 4DPGQP
Share with your friends!
(Valid for new customers only) 

If you do use it, please share your data with me!

Thursday, July 28, 2011

Intersecting Indels with VCFtools

Indel detection in is not what I'd call accurate at this point in our history. I, along with probably every other bioinformatician and genomicist looking at next-gen data, have noticed that immediately adjacent indels called as separate events but which are really the same variant called differently due to sequence context and the nature of our variant callers get called all the time.

A band-aid approach is to simply look for overlap in indel calls within a window. Even a tiny window can make a big difference to small indels.

To do this, I currently use VCFtools, which makes it very simple. Specifically, use the vcf-isec command with the -w parameter.

If I compare two libraries sequenced from the same individual that had indels called independently (using the same method), I end up with a few thousand overlapping indels that would have been assessed as independent from one another if I looked for exact overlap.

Exact overlap:
vcf-isec -f -n =2 -o indels1.vcf.gz indels2.vcf.gz | wc -l
Overlap +/- 5b:
vcf-isec -f -n =2 -o -w 5 indels1.vcf.gz indels2.vcf.gz | wc -l
I realize it's not that astounding a difference, but keep in mind this is looking at two different libraries from the same individual. If you're comparing calls from two completely different sequencing platforms or variant callers, these numbers jump quite a bit.

Tuesday, July 26, 2011

The Ion Torrent Paper (Nature)

An integrated semiconductor device enabling non-optical genome sequencing

I'm just going to discuss my thoughts and comments on the paper, their findings, and how they relate to claimed specs for the IonTorrent PGM.

And just for some comparisons later on, the current Ion Torrent PGM product sheet is also attached [click].

I'm going to try to step around some of the issues with the paper that have been well covered at Daniel MacArthur's blog Genetic Future. I think he is pretty fair to the paper in criticizing its "validation rate" and so forth. Sufficed to say, a year and a half to two years ago, perhaps a 15x human genome would have been considered adequate, but in a paper coming out of LifeTech, manufacturer of the SOLiD sequencer, if you're going to use the whole genome sequence off the SOLiD as validation, let's go for at least 30x coverage.


"..there is a desire to continue to drop the cost of sequencing at an exponential rate consistent with the semiconductor industry's Moore's Law..."
They bring up Moore's law repeatedly, and they sequenced Moore himself in the paper. But wait a second... sequencing costs are dropping significantly faster than Moore's law! I suppose it's a minor complaint, but let's give sequencing the credit it's due--per-base cost of sequencing is dropping much faster than Moore's law!

Also, let me complain very briefly about the use of a few buzz terms:
"To overcome these limitations and further democratize the practice of sequencing, a paradigm shift based on non-optical sequencing on newly developed integrated circuits was pursued."
If there is any term that is going to supplant "paradigm shift" as the default excessively pretentious term in scientific papers, it has to be "democratize". Look, unless this device is offering $100 genomes, it's not democratizing sequencing. Can we leave sensationalist buzz words for the advertisements and stick to reality for the Nature papers? (Wait, what am I saying?) I appreciate that we're talking about a non-light system here, but the observation of protons rather than photons released upon base incorporation isn't really a paradigm shift. Once we're taking pictures of DNA with electron microscopes and reading the entire genome instantaneously in one shot, then we can start talking about paradigm shifts.


Okay, let's look at the scalability based on their data, and what's being touted in their product sheet.

A typical 2-h run using an ion chip with 1.2M sensors generates approximately 25 million bases.

That pretty much throws out the Ion 314 (1.3M wells) for human genome sequencing. A 30x diploid human genome would require 640 days on the 1.2M sensor chip in the paper. Even just 1x coverage would take three weeks. Yikes.

Later there is a rather astounding statement:
"At present, 20-40% of the sensors in a given run yield mappable reads."
Room for improvement there, methinks.

In Table 1, they test the 11M chip with E. coli and Human. The E. coli yields 273.9Mb of sequence off that chip. At about 20-40% of sensors yielding mappable reads, that gives an average read length of 62b-125b. This is consistent with their finding that 2.6M reads are >=21b and that 1.8M are >= 100b. Also with Figure S15, where it appears the majority of read lengths are around 110b-120b. So at least the read lengths are not disappointing.

Their Ion 318 is the 12 million wells chip. I think this is similar to their 11M chip in the paper. Going back to Table 1, they got 273.9Mb off the 11M chip. At issue is the promised "[starting] 1 Gb of high-quality sequence" off the Ion 318 chip in the Ion Torrent product sheet. Now, I completely believe that advancements have been made since the paper's acceptance on May 26th, 2011, but four times the yield? Not so sure about that claim. I'm not doubting it can get there, but I'll put it this way: This is a paper from the company that makes the product--if anyone can make it work optimally, it should be them. And their optimal report here has it at significantly lower than what they're advertising. Oh, and the specs sheet has small print next to the Ion 318 chip entry that says "the content provided herein [...] is subject to change without notice". Let's just say I'm skeptical.

Anyway, the rest of the paper is pretty vague. One issue on everyone's mind is the homopolymer issue, which is addressed in a single sentence stating the accuracy of 5-base homopolymers (97.328%--not terrible, but not overly good either) and that it's "better than pyrosequencing-based methods" (read: 454).  How about longer ones? No idea. Figure S16b only goes out to 5b also with a curve that isn't looking too encouraging, though.

Apparently the Ion 316, their 6.3M well chip is currently available, and they claim at least 100Mb of sequence per run. This is consistent with their mapped bases in the paper (169.6Mb off a 6.1M ion chip).  With this chip, you're talking about 3 days on one machine for 1x diploid human coverage, and about 90 days for 30x coverage. Better, but still not there when it comes to human sequencing.  Still, it's at the level of completing entire bacterial genomes in 2 hours. If you're into that sort of thing (and don't have access to a HiSeq or something else...).

Not Quite "Post-Light"

You know, there's a lot of "post-light" jibber jabber in the paper. It's Ion Torrent's favorite buzz phrase, and I'm a fan, actually (much more so than I am of "democratize" and "paradigm shift"). But at this point, with this performance, I'm not sure we're "post-light" yet. The technology is there, but it isn't scaled up enough yet. There's a claim made a the end of the paper that's interesting:
"The G. Moore genome sequence required on the order of a thousand individual ion chips comprising about one billion sensors. ...our work suggests that readily available CMOS nodes should enable the production of one-billion-sensor ion chips and low-cost routine human genome sequencing."
Doubtless this is long in the works already, and I hope it is a reality. Because making the leap from the things in this paper to a functional 1B sensor chip would make a huge difference.

I'd say I was a bit disappointed with this paper. It felt half done. I'm confused about the way the comparison to SOLiD was done--why wasn't the SOLiD WGS of G. Moore done to an adequate depth? I'm a bit annoyed at the lack of comprehensive information, as well. The homopolymer issue is known--why hide behind homopolymers of 5b or smaller? Just give the whole story in your paper--it's an article in Nature, not an advertisement.

Anyway, to quote a very smart man I know, "it is what it is." Ion Torrent is here to stay and it's only going to improve. I certainly hope it does--I'd love to see it pumping out 1Gb/2hrs with long read lengths.

Wednesday, July 20, 2011

Mendel Google Doodle

Google's current doodle is a very neat nod to the father of genetics, Gregor Mendel! They're celebrating his 189th birthday with a representation of his original experiment--crossing green and yellow pea plants and tracing the color trait.

Monday, July 4, 2011

Accurate genome-wide read depth calculation (how-to)

I'm currently working ferociously on revisions to a paper and need to calculate mean genome-wide read depth as a fine point in the paper. My first inclination was genomeCoverageBed from BEDtools. Trying it out on chromosome 22 first, I noted a huge number (>30%) of the bases had 0 coverage. Of course, this must be because genomeCoverageBed is including the massive centromere (chr22 is acrocentric--the entire p-arm is unmappable heterochromatin). I kind of already knew genomeCoverageBed wasn't meant for this purpose anyway, but I was hoping to stumble upon something.

I decided to Google "genomeCoverageBed centromere" and "genomeCoverageBed not include centromere" and came up with bunk (well, not quite bunk, I came across a blog post about genomeCoverageBed that happens to have been posted tomorrow... yes, tomorrow!).

As I explained in a comment there, I find genomeCoverageBed's approach lacking in meaning. Is it fair to include unmappable regions in a calculation of coverage? That's what led me to asking myself if "genome-wide coverage" really has a meaningful use as a statistic. We all know you're going to have a ton of bases with 0 coverage in the centromere because they're unmappable, but that says nothing about how your sequencing performed or aligner worked. All it says is that those bases are missing from the reference assembly. And that confounds any other meaning you might get from the number, really.

Not to say that genomicCoverageBed is useless. To the contrary, it does exactly what it's supposed to: generates a histogram of genome-wide coverage. But I do not think it's overly useful beyond that histogram. A lot of people like to see the "mean coverage" or "mean read depth" statistic when you talk about a project, and you'd certainly be selling yourself short if you're generating that number with genomicCoverageBed.

I think the author of BEDtools (Hi Aaron if you ever read this!) would be the first to say, "do not use genomeCoverageBed for calculating mean read depth". But BEDtools can, fortunately, give us a very strong way of doing just that through intersectBed and coverageBed.

My solution is quite simple. I take the "gaps" track from UCSC (Tables->All Tracks->Gap) and create a BED file of all regions that are NOT in gaps. You can easily generate this file by obtaining the gap track from UCSC as a BED file and then using BEDtools subtractBed to subtract those regions from a whole genome BED file.

$ subtractBed -a hg19.bed -b ~/resources/hg19_gaps.bed > hg19.gapless.bed

Then, take your gap-less whole genome, run intersectBed with your input BAM file and pipe it to coverageBed again using the gap-less whole genome bed file as your -b. Make sure to use the -hist option in coverageBed.

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist > coverageHist.bed

If you want to save yourself some trouble, you can pipe that to grep and grep out "all"--that's your  histogram of genomic coverage across the gap-less genome. Actually, you can calculate mean read depth across this gapless genome in one line without generating any output with some awk magic:

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist | grep all | awk '{NUM+=$2*$3; DEN+=$3} END {print NUM/DEN}'

Not everything outside the gaps is mappable, but at least those bases are present in the reference genome.

Here's an example from a real WGS experiment restricted to chr22:

genomeCoverageBed mean read depth: 0.68
gap-less genome coverageBed mean read depth: 30.637

And yes, calculated by other means, average read depth was right around 30x. For the sake of accuracy, I call it "mean read depth of the reference genome assemby". Doesn't roll off the tongue, but that's what it is, and it means a lot more than "genomic read depth" or "genomic coverage", in my opinion.

(All of that said, genomeCoverageBed will run a lot faster than the way above, but the info it reports is different and really serves a different purpose.)

Friday, July 1, 2011


I've been super busy writing a paper lately, so I apologize for the lack of updates. I do intend to comment on the recent 23andMe paper soon™.

A new feature on the blog is my Gist feed and a link to my Gist page. Gist is basically a quick-and-dirty code-sharing site from Github. Since I'm a quick-and-dirty bioinformatics programmer, I'll do my best to keep random code snippets that might have general application for others up there and publicly available. Also, please feel free to fork my code, fix it, et cetera.

My first Gist is an interesting one: A shell script for calculating mean heterozygous allele balance from a VCF4 file (basically GATK output with the "AB" INFO field). Very simplistic, but fast and easy to use for people wondering what the overall reference bias is in their variant calls.

Tuesday, June 21, 2011

RealTimeGenomics Goes Free, Provides Alternative for CG Users

I've mentioned RealTimeGenomics (RTG) in the past, and Joke Reumers mentioned their software recently in her talk at the Complete Genomics user group meeting. Today, RTG announced that they were going free to individual researchers with their RTG Investigator 2.2 package. I kind of knew this was going to happen after some chats with them about the pricing models they were considering and the hard sell it would be to academia.

The Hard Sell
I think that it's a hard sell to get academic researchers to pay for something they can get for free. I've been in a lab that preferred to make its own Taq polymerase rather than pay for a commercial enzyme even when that meant using it at a 1:1 ratio with the rest of the PCR reaction (and no, I didn't stay in that lab for too long, but the point stands).

In a world where BWA, TopHat, Samtools, SoapSNP, and GATK are free and fairly well documented, selling an aligner and variant caller is going to be difficult unless it does something particularly special. Plus, a major focus of RTG's strategy is providing an alternative to Complete Genomics own analysis that comes "free" when you buy a whole genome from them. Very hard sell.

So, again, unless your software does something particularly special, like being more sensitive and/or specific, like being faster, like being significantly easier to use, like including a bunch of bells and whistles in the form of visualization tools or fancy reports, you're going to have trouble selling your product.

But how, as a company, do you prove that your software has something like this to offer? Traditionally trial licences have been the way, but that's with software that doesn't have a strong free alternative. A company lets you try the software and see if you like it, then you buy it if you do. But most sequencing labs have their pipelines done already. And comparing and contrasting two softwares isn't really worth the time unless the claims have been substantiated by other groups.

That's where this foot-in-the-door approach comes in. Basically, you give the software away to academia and let them do your leg work for you. If your software offers something special and academia can prove it, you'll start to be able to sell to the bigger corporate entities and sequencing cores. So the solution to the hard sell is to not sell at all! Brilliant!

So, how is the software?
I tested RTG's software on Illumina data because at the time that's all I was using. My findings were that it was easy to use (in a native, parallel environment), ran fast, mapped about 80% of reads (similar to Novoalign/BWA), and found a similar number of variants to GATK. Basically, it worked and seemed to work pretty well. I admit I have yet to go too in depth on comparing the findings. However, when I have some time, I intend to do more comprehensive assessment of its performance.

I also ran it in a mode that combined the Complete Genomics and Illumina data I had from the same patient. I found this to be a pretty cool option that I enjoyed using.

Really, if you're dealing with Complete Genomics data, this is your only option (as far as I know, let me know if this isn't true) for an alternative alignment and variant caller to theirs. You could also align using the RTG mapper and then try variant calling with YFA (your favorite algorithm).

What's unique?

They have a cool program called "mapx" that does a translated nucleotide alignment against protein databases. You can then take that and use their "similarity" tool to basically create phylogenetic clusters based on your reads alone. Very cool for metagenomics. I'm planning to try this out with a whole genome sample I have derived from saliva.

Why does this matter?

Well, frankly, there's the chance they just may be on to something. They make a lot of claims about their sensitivity and specificity. They have some killer ROC curves. They have that cool metagenomics tool that I honestly haven't heard about from anywhere else.

And now there's no fear of losing access to it when the trial licence expires.

I fully admit it: I wanted this to be free. Because I am one of those people who likes trying new programs and seeing if I can squeeze a bit more information out of my data set. I was just thinking today about how I should go back to our old U87MG dataset and call variants using GATK and the new SV pipeline we have.

Finally, I think it really has implications for users of Complete Genomics. Joke Reumers showed that CG variants detected by RTG as well were highly accurate. That's key as an in silico validation step. Plus, it empowers us to analyze the data ourselves. I love CG, but I also want the ability to adjust my alignment and variant calling settings myself. I also want to be able to update my analyses to be compatible with each other without having to pay a couple thousand dollars more on top of my original investment.

I do wonder how it's going to pan out. I hope, of course, that it ends up helping them out. As I tell all my corporate friends: I want them to succeed, because their success is my success.

At the very least, the software is now out in the wild. It's now on the users to figure out if it's worth using. I'll be doing my part over the coming months and I promise to share!

Saturday, June 18, 2011

Complete Genomics User Conference 2011 Day 2 Recap

Yesterday was a "half-day" at the CG User Conference and my netbook was inconveniently out of juice, so I didn't get to live blog. However, I took plenty of juicy notes at the interesting and useful talks of the day. The rundown included:

  1. Dr. Sek Won Kong, Children's Hospital Boston, about a downstream analysis and annotation pipeline for WGS/exome-seq called gKnome.
  2. Dr. Jay Kasberger of the Gallo Research Center at UCSF about their tools and scripts for analyzing CG data that they will soon make available.
  3. Dr. Stephan Sanders from Matthew State's lab at Yale talking about identification of de novo variants in CG/WGS data.
  4. Dr. Andrew Stubbs, Erasmus Medical Centre, about HuVariome, a database of variations common to the human genome for use as a filtering/informative resource. 
  5. Panel discussion that went into especially issues related to making sequence data and variants public and how to filter data that was quite interesting.
I'll summarize each of the talks and some of my thoughts below.

Dr. Sek Won Kong. gKnome: An analysis and annotation pipeline for whole-genome/exome sequencing. 

Dr. Kong presented a sequence analysis and annotation pipeline Michael Hsing and others in his group have developed for variant analysis. Although it certainly takes CG data, it looks like it'll take Illumina data as well.

Looked very nice. I think behind the scenes was a MySQL and Python based annotation pipeline. It utilizes the CG diversity panel (69 public genomes released by CG) to filter out systematic CG variants. Actually, this was a major theme at the conference that I'll go into at the panel discussion part.

In its first version, the gKnome pipeline will be able to annotate from RefSeq, CCDS, Ensembl and UCSC known genes.

The other cool part is the web front-end for the whole thing. Their system auto-generates a number of reports including box plots of #rare/NS variants/genome, which they demonstrated is closely tied to ethnicity. It also has built-in pathway analysis and disease risk summaries (including utilizing HGMD if one has a subscription). Finally they showed a nice R-based plot of CNV results that are auto-generated.

There was also a quick slide of hypervariable genes shown that was a point of much conversation generally. Basically, everyone agreed there's a set of specific genes and gene families that always end up with variants in them. Dr. Kong showed the list to include HYDIN, PDE4DIP, MUC6, AHNAK2, HRNR, PRIM2, and ZNF806. I've seen most of these pop up in my many exome-seq experiments as well. I've even had PDE4DIP, AHNAK2, PRIM2 and many of the ZNF and MUC genes look like disease-causing variants before.

So where can you get it? Well, it's not available yet, but should be completed by September. In the meantime, you can check out what they have going for them at:

Let me just throw out there that this looks superior to the alternative currently used by most people, which is Annovar. Annovar is a great tool and getting better all the time, but with its rather clunky input and output formats, lack of any downstream stats or visuals, and some notable bugs in its conversion scripts, gKnome is looking pretty nice.

Dr. Jay Kasberger. Integrating tools and methods into analytical workflows for Complete Genomics data.

This one was pretty nice because it was kind of an overview of tools used with CG data and how a lab with a lot of CG data might implement them. Dr. Kasberger especially presented the way they assess the variants and CNVs provided by CG.

There was a tool for auto-comparison with Illumina Omni SNP chip data, which is worth a look (although I should note that you have to deal with the Illumina SNP chip problems like the ambiguous calls at some percent of spots that don't tell you the correct ref/alt alleles, etc. yourself... and frankly, I'm not sure which ones those are myself--I usually just compare heterozygous calls between the platforms for validation).

It also demonstrated a tool that goes from the CG masterVAR format to PLINK format for downstream IBD estimates.

Finally, they have a number of Circos generating SNPs for variant density, coverage (using heatmap tracks), etc. And we all know I love Circos, so cool on them for that.

These tools are available from them at: (but you'll need a login by being a collaborator/member of their group, so you'll have to contact them for access).

This talk was interesting because it showed off tools that probably every genomics lab with CG or WGS data has developed for themselves. This stuff needs to be packaged up, published and shared openly in my opinion. For example, I don't like that there's this gateway through their page for it all. Put it up GitHub or SourceForge or something, guys!

Dr. Stephan Sanders. Identifying de novo variants in Complete Genomics data.

Dr. Sanders focused on assessing de novo variants. By this he does not mean what we typically call novel variants. Rather, he's talking about variants not inherited from parents (which coincidently are not likely to be known variants either). He claimed that (based on the Roach et al. paper), de novo variants are extremely rare, somewhere around 1x10^-8 chance per base, or about 0.5 disruptive events per exome.

That equates to fewer than 100 de novo variants per genome. But when you actually assess the number from real data with standard filters, you end up with around 20,000 candidates. That's a problem.

He demonstrated that the rare de novo candidate variants (the 20,000) have a much lower distribution of quality scores than true variants.

He then went into a fairly extensive discussion of how to estimate the specificity needed to narrow down to those candidates to the correct number, which was great. BUT, the bottom line is really that you just have to move up the quality cut-off. At a high enough level, the de novo candidate total drops off and the very few leftovers are highly enriched for true positives. He showed that he narrowed down to ~70 candidates and that they validated a little more than thirty of them. This matched well with the expected number he calculated earlier.

Cool talk, but the take-home is that all you have to do is sacrifice sensitivity for specificity and you'll throw away the vast majority of false-positive de novo variants. So for the ~20k or so candidates, apply a more stringent filter and voila!

So, pretty cool idea for what to do with alleles that don't follow Mendelian inheritence errors. Just apply a stringent base quality and read depth filter to them and highly enrich for true variants. Kind of an obvious conclusion, but not something many have been doing, I'd bet.

Andrew Stubbs. HuVariome: A high quality human variation study and resource for rare variant detection and validation.

This one is going to be pretty brief, though it was a good talk especially for the conversation points it brought up.

HuVariome is a database they're putting together of well-annotated known variants. The goal is to use it as an alternative to dbSNP for filtering (which honestly shouldn't be used for filtering in the first place). It will be available at: Definitely a possible alternative to using dbSNP, especially if it's as well annotated as they suggest it will be.

Panel Discussion
Worth mentioning because the same themes kept coming up. Specifically, there was quite a bit about everyone developing their own list of false positive variants/genes and how to filter SNPs adequately. Generally, it's agreed that (1) there is a subset of variants and genes that show up in nearly every sequencing experiment and therefore are more likely false positives than anything else (they don't tend to validate, either) and (2) that dbSNP is too inclusive of disease samples and lacking adequate phenomic info to be used as a blind filter. I've personally always told people to use dbSNP as a guide. Never dismiss a variant just because it's in dbSNP. You can look at the non-dbSNP variants first if you want, since those might be the "jackpot" spots, but if you find nothing there, try looking in those in dbSNP.
That then leads to the need for things like HuVariome and lists of "bad" genes/variants (like hypervariable genes mentioned by Dr. Kong). But the problem then becomes how to share variants publicly that are present in protected samples, even if they're artifacts, because consent wasn't given to make any variants publicly available.
Personally, I see that as an issue that will solve itself, but of course, we want it sooner rather than later. Solutions such as projects specifically intended to produce these lists are possible, though (and some in the audience said they were doing just that).

That's a Wrap
Anyway, that's a wrap on this conference. I hope it was informative for everyone. Also, props to Complete Genomics for putting on a pretty decent corporate conference. I didn't think it was overly biased, I found it useful and interesting, and the venue and food were quite good.

Thursday, June 16, 2011

Complete Genomics Community

Apparently they just released this publicly:

Pretty cool, actually. It has a forum, a wiki-ish knowledgebase (or at least, hopefully it'll be like a wiki in the future) and a tools section.

Currently the tools section has CGAtools and some other scripts written by Steve Lincoln's team at CG. Seems like in the future it will include community-derived data as well.

Also, I'm just gonna throw this out there: It seems like the community is really not satisfied with annotation tools. And by that I do mean Annovar and the Ensembl annotation tools. So I think that's really an open area for development for an ambitious bioinformaticist or two out there.

Cancer Grant Program
Will be officially announced week of June 20th. Basically, it'll be an abstract plus some simple questions. There will be two winners in US and Europe. Applications will be due July 29th, winners decided by August 12th and samples will need to be submitted by September 16th.
Bonus: All applicants will get a future discount regardless.

Pretty cool... time to find some cancer samples to sequence.

Live Polling
Kinda cool (and totally unrelated to genomics/CG... oh well). Never seen this before. Hosted by Basically, you can host a live poll during a Powerpoint presentation. People can text message in a response to the poll and it'll update on the fly within the presentation.

Simon Lin Talk

Sequencing Outsourcing: Northwestern Experience

An overview of the needs and costs of running your own sequencing rather than simply outsourcing.

In-house: Numerous costs, need of additional staff, need of storage and analysis computers, et cetera.

Outsource: lower cost, access to bioinformatics, etc.

Need to consider cost, quality, and the impact to "internal culture/image/students"

Lose at least some of the control by outsourcing. (However, re-analysis of the data itself is obviously possible.) But tweaks and changes can't really be made with outsourced analyses.

They have a 454 (four-fifty-four... or is it four-five-four? I've always called it four-five-four.)

Zachary Hunter Talk

Paired Whole Genome Sequencing Studies in Waldenstrom's Macroglobulinemia

This type of lymphoma is pretty rare and also not much is known about it.

Approach: 10 paired genomes plus 20 unpaired

For normal they used CD19 depleted PBMCs and also took buccal cells as a backup.

They actually did do exome sequencing as well. 

They found a large number of strong candidates (>7?)... one was present in 100% of the 10 paired and 87% (26/30) in all 30 patiants. Strangely, it was the same exact SNP in all individuals, but it's a very good functional candidate.

Interestingly, with Sanger sequencing they saw a very small peak of the same variant in the trace from an individual with a very weak case (one of the four that didn't have it). Perhaps all patients with it have this variant?

Detailed circos plots. Mapped out zygosity, CN, allele balance, CGI coverage levels and then testing results. At this detailed level, there were many notable CN/zygosity regions and UPD regions. (Everyone LOVES Circos.)

They made their own wrapper for Annovar because the input/output for Annovar is something the "don't like". (Hey, guess what? You're not alone! I made just such a wrapper myself!)

Nice talk. I loved the Circos. Again, I have to ask whether a lot of these findings couldn't have been done solely through exome-seq, though...

View from the CG user conference

Fantastic view and great San Francisco weather today!

Joke Reumers Talk

Two major experiments covered:
tumor/normal ovarian cancer
monozygotic twins (schizofrenia)

Even at low error rate of CG data (1/10000), that's ~30k errors per genome, which is too much for twins.
Detected 2.7M shared variants, but 46k discordant variants between twins

Individual filters for quality/genomic complexity/bioinformatic errors used...
Quality: low read depth, low variation score, snp clusters, indel proximity to snp (5bp from snp)
Complexity: simple repeats, segdups, homopolymer stretches
Bioinformatic errors: Collaboration with RealTimeGenomics, re-analysis

1.7M shared variants, 846 discordant variants

So basically swung the error rate from type 1 to type 2.

All 846 discordancies here were validated by Sanger sequencing.

Also 2 of the shared variants were found to actually be discordant.

Reduced error rate down to 4.3x10^-7 (from 1.79x10^-4).

Of the 846, 541 were  false positives.

NA19240, 1000 genomes Illumina sequencing versus CG

Before filtering, CG had more false negatives, Illumina had more false positives.

After filtering, they were both down to about 1% error rates.

As for tumor/normal, adding the filtering made a little difference... from 437 down to 21. But of course, this kills some true positives.

Summary: Very good talk, I liked this one. I was pleasantly surprised to see RealTimeGenomics get a shout out as one of their filter approaches. I used their software myself, it's very good and I hope to collaborate again with them, especially after seeing it helping other groups with their filters. Also, I think there's a lot to be said about the different error rates with CG versus BWA/GATK et cetera. I'm leaning toward combined approaches... for example, why not do exome-seq on Illumina as a validation of CG and to adjust error rates?

Bonus: Hilarious pic of a desk covered in hard drives. In our lab's case, I think they're stuffed under people's desks. Someone needs to do something about the drive overload.

Complete Genomics User Conference 2011

Today I'm attending the Complete Genomics User Conference in San Francisco at the Fairmont. Nice venue, particularly for the guests that had to come from far away (actually, being from Palo Alto, I would have preferred if they had it down in Mountain View).

First talk I saw was from Dr. Kevin Jacobs of the National Cancer Institute. Thought it was very nice, and I'll provide a run-down in a few minutes.

Wednesday, May 4, 2011

The Earwax Trait Story

My first post related to 23andMe is going to be about something near and dear to my heart: the earwax type SNP rs17822931! Before I start, let me say that I do have wet earwax. Yes, I'm carrying the CC genotype at rs17822931. Not a big shocker considering I'm of 100% European descent.

You may wonder why this particular variant and trait is important at all to me. The real reason is because when I taught human genetics at UCLA, we used the original Nature article as an example of a "modern" genetic study. The students read it and (hopefully) found it interesting to learn that a single SNP (rs17822931) was dictating whether they had wet or dry earwax.

In fact, most of the students were basically unaware that there were two types of earwax! At UCLA, we had students of all different ethnic backgrounds, and it turned out about half of my class had dry and half had wet earwax. That should give you a hint at the ethnic mix there.

Anyway, the point of this post is to explain why, before jumping the gun and shouting that you have dry earwax when you have a CC at rs17822931, you should consider whether or not you're accurately self-diagnosing it, and then why you might have dry earwax when your genotype says you should have wet.

I remember clearly that one student of European descent claimed to have "dry" earwax. She may have (it's not impossible), but another student wisely pointed out something in the paper that may explain why she thought she had "dry" earwax.

In the original study that identified this variant as the one causing the Mendelian earwax type trait (Yoshiura et al., 2006), they explain that they actually had to use two different groups of patients to identify the variant.

The first group consisted of 64 "dry" and 54 "wet" control individuals. This group was "self-declared", meaning they basically checked off a box stating whether they had wet or dry earwax. This first pass group resulted in inconclusive results. Basically, they narrowed down the region with this group, but found some "phenotype-genotype inconsistency" in some of the samples and could not therefore narrow down exactly which was the causative SNP.

So they followed up with an association study on a second group of 126 individuals (88 dry and 38 wet) whose earwax types were identified by a medical practitioner.  In this set, 87/88 individuals with dry earwax were AA. All 38 with wet earwax were GA or GG.

That one GA individual with dry earwax turned out to have a deletion in exon 29 of the ABCC11 gene (downstream of rs17822931 and his G allele).

So this really teaches us two things:
1) Self-diagnosis is not accurate. You may have wet earwax and not know it! Just because it's flaky and seems dry to you does not mean it is dry earwax. And vice-versa.
2) Other mutations do exist! In this case, particularly if you are CT but truly have dry earwax, it's possible you're carrying around a secondary mutation not unlike the unique case in the original paper. It's exceedingly unlikely a CC individual would have such a secondary mutation damaging both alleles without some sort of inbreeding, though, so... keep that in mind if you're going to claim #2 with a CC genotype.

Much of this could be said for nearly any trait determined by SNPs, but the earwax trait makes a convenient real-world example of a simple trait where ascertainment problems (basically, the inability for the layman to know what his true trait is) and the rare secondary mutation can cause perceived discrepancies.

More in-depth stuff on my own 23andMe experiences as I go through them. So far I'm having a blast.

Friday, April 15, 2011

Happy DNA Day! Celebrate with 23andMe unboxing!

I'd like to start by wishing all of you a happy DNA Day! That's right, it's already April 15th again, which can mean only one thing: DNA Day is back! (...because it's not tax day in 2011, so you can celebrate DNA Day and get back to filling out your taxes tomorrow.)

Anyway, I got myself a gift for DNA Day, which arrived yesterday (and which you might be aware of if you read my last post). That's right, my 23andMe package arrived! So I thought I'd share the unboxing in case you were curious what you get in the mail when you order 23andMe.

It comes in a box a little bigger than a CD jewel case that looks like something from frog design. Very nice. The box is actually plastic wrapped with a sticker having your name and serial number on it (removed prior to taking these pics).

Opening it up, there's a set of instructions that look like something from Ikea explaining how to use the kit and send it back. (That's not a bad thing--they clearly put effort into making it dummy-proof.)
Savvy observers may have noted that the kit is a simple Oragene saliva kit. Spit in the tube. Close the lid. Remove the top. Screw the cap on and shake. (Again, dummy-proof instructions inside the plastic box containing the vial.)

Very straightforward. And I think the design is very well done. The box is the same one you send back to them. Just put the vial in the biohazard bag, put it in the box, re-seal it and throw it in the mail (it's postmarked already). The Oragene box does contain an instruction booklet with details in it for those interested.

You probably noticed in the second picture that they want you to go to their site to register. I did indeed do that. Pretty standard set of consent documents to go through.

I did find it interesting at the bottom of their consent form, they have three options about consent. One is to consent for yourself, one is to consent for another adult, and the third one is to convey your child's consent and authorize it as the parent/guardian. There goes my plan to genotype my children before they're old enough to deny me! (Kidding, kidding.)

You also get to choose whether to allow them to Biobank the sample. I said yes, of course. Then you provide a few other general details (DOB, sex).

Took no more than five minutes! The caveat is that you can't eat or drink for a half hour before spitting in the collection vial. So I will be doing that in, oh, about a half hour.

I'll update more on my 23andMe experience as it happens. Coming up next: Surveys and more surveys. They seem to have a ton of phenotyping through surveys, which I think is fantastic.

Monday, April 11, 2011

23andMe: FREE* genotyping, today only!

23andMe is offering FREE* genotyping today until 11:59PM PST to celebrate DNA Day early (apparently).

The asterisk is because it's not actually free. It's $108 plus shipping/handling costs because you have to pay for a 12 month subscription to their services.

Still, for about 1,000,000 SNP genotyping, this is a darn good deal.

I shot Dr. Wu (the one who posted the blog article on the 23andMe blog announcing the deal) a note asking about whether we got raw data back:
I’m a little curious about the subscription.
We get the raw data to keep and the subscription is to be able to view the data using your tools? So even after the year subscription is up, we’ll have the data for ourselves, right?
Her response:
Hi M.J. Clark,
Yes, you will always be able to download the raw data and retain access to content you had while you were subscribed. Some features, like the ability to Browse your raw data using our website, receipt of updates to your health reports and Relative Finder matches, and storage of your saliva sample (if you choose to biobank) may, however, be discontinued.
Sounds to me like we get whatever they qualify as "raw data" permanently regardless of the subscription, which is fantastic. Plus, gives those of us in the field something fun to play with.

I'm not actually sure yet what we receive back in terms of raw data (or if they actually give you back raw data). It's currently their version 3 platform, which is a modified Illumina OmniExpress Plus Genotyping BeadChip. Seems their v2 "raw data" took the form of simply genotype calls, positions and rsids. Not sure if that's all you get with v3 (but I'll ask and update later).

I'm interested for myself, of course, but if your lab has five or fewer human samples (per lab member) you've been meaning to get 1,000,000 SNP genotyped and you don't care about which specific platform it is, this is probably the best deal you'd get for a while. Food for thought! (Then again, why are you still genotyping? Go exome-seq those things!)

Update (18:40): Answer regarding the nature of raw data.

Dr Clark,
The raw data for v3 is formatted exactly the same as for v2, and is also against build 36 of the human reference assembly. That information is in the header of the raw data file; not sure if it is documented elsewhere except where people have made their raw data files public.
So it's NCBIv36 at least. I'm hoping they'll be willing to provide raw data in whatever basic format for those of us with interest later. Guess we'll find out!

Wednesday, March 9, 2011

Rising gas prices and their effect

This post is in response to an article I read on Now, let's be frank, CNN is in the business of "gotcha news", posting articles that are often inaccurate or misrepresented with the intent to draw views and reap chatter. Typically, I wouldn't let it bother me, but this particular article is heavily mis-representing things to make a point. That point is: "Stop whining, California."

Here's the article. It presents some data in an interesting fashion. It shows states shaded by the average cost for a gallon of gas and then it shows states shaded by the amount spent on gas as a percent of income. The conclusion? Despite having dramatically higher gas prices, Californians should "quit whining" because they on average spend a smaller percent of their income on gas.

Gas expenditure as a percent of income
Average gas price per gallon
The article accounts for this phenomenon in a series of theories about why it occurs (rather than, you know, going ahead and proving any of them). Example:

"When you live in California or New York, where things are close by and there's public transit, that's great -- but we don't have those options here," said Hilary Hamblin, a Tupelo, Miss. resident who said her family typically spends about 10% of their monthly income on gas alone.

That's not uncommon in the state. Mississippi families earn a median household income of about $37,000 -- the lowest in the country -- but spend a whopping $402 per month, or 13.2% on gas.

In contrast, Californians earn a median of $59,000 per household, and spend about $380, or 7.8% of their income on gas each month
What we see here is a way of manipulating the facts to put together a dramatic article that will piss off Californians and New Yorkers and rally red-staters. What we don't see is any semblance of meaning.

Here's a little tid-bit the article doesn't tell you: California is a major agricultural center, including significantly more farms than Mississippi (or, it appears, any other state). Check out this fun little tool from the US Government's Economic Research Service.

How does having the highest gas prices in America impact those rural workers? What percent of their income is spent on gas? Lumping all of California's 37M people together and calculating mean gas expenditure as a percent of income, then comparing to Mississippi's 3M people is like comparing apples to oranges.

Personally, my gas expenditure is very low. I usually ride a bike to work. I don't drive far or often. But I guarantee you farmers down in Kern County or in other rural counties of California are driving quite a bit and paying similarly inflated gas prices to what I pay. So myself and people like me in California are bringing down that "gas expenditure as a percent of income" mean value down quite a bit. That doesn't mean the cost to the farmers here in Cali is any lower, though!

I'm not saying Mississippi doesn't have it bad now. To the contrary. What I'm saying is that this article is misrepresenting facts to make a false point. Farmers and farm communities in California likely have just as much if not more reason to complain than those elsewhere because they're penalized by being in a state with huge urban areas that cause gas prices to be higher (due to increased taxes, more expensive gas blends to deal with urban air pollution, et cetera).

I grabbed the data from the ERS and played around a little bit with it. Here are some interesting stats:

Total number of farms:
California: 81,033
Mississippi: 41,959

Mean percent of land per county dedicated to agriculture:
California: 33.28%
Mississippi: 38.36%

Mean percent of farmland per county (for counties with >50% farmland)
California: 70.23%
Mississippi: 67.67%

Total Population Size
California: 36,961,664
Mississippi: 2,951,996

Population of rural counties (counties with >50% farmland)
California: 4,933,655 (13.35%)
Mississippi: 482,751 (16.35%)

Let these numbers settle in. Similar ratios of the population live in rural areas in both states. Rural counties appear very similar broken down by percentages here. But California has more than twelve times as many people as Mississippi. In rural counties, California has about ten times as many people. However, there are only about twice as many farms in California as there are in Mississippi.

But what does it all mean? It means there are ten times as many people in California paying significantly higher gas prices while doing the same job as those in Mississippi. Californians may on average make more than people in Mississippi, but does that mean the farmers in California make more than the farmers in Mississippi? This is a hard number to determine. We can use what the ESR calls the "average value of agricultural products sold". In this case, I'm taking those "rural counties" I talk about earlier with >50% farmland and then just calculating the straight mean of these valued.

Mean for rural counties of average value of agricultural products sold
California: 535,808
Mississippi: 291,167

In this light, Mississippi farmers appear to make much less on average than California farmers. But if we look on a county-by-county basis, we note that there are a few counties skewing California's rosy outlook. Let's look at that distribution to see if there aren't a lot of California farmers in more dire straits than simply taking a mean would show.

Mississippi is in blue, California is in red (didn't expect that, did you?). We can see pretty clearly here that there are two counties skewing the California number. These are Kings County and Monterey County. If we remove those, the value of farming in California counties is down to 373,475. Not exactly making up for the fact that ten times as many people live in those counties.

Now, I don't want to make generalizations. The only point I truly want to make is that an article telling Californians to "stop whining" because farmers in Mississippi have it bad is ignoring the fact that there are many farmers in California in the same situation as those ones in Mississippi. I don't want to say they have it worse, either, but my guess (emphasis on guess, as I'm not proving it) is that these folks spend a similar percent of their income on gas as well.

Basically, I'm going to go out on a limb and say that given the makeup of farming counties in California and Mississippi are very similar (despite a difference in scale, given California's much larger size), the California farmer has just as much a right to be distraught about rising gas prices as the Mississippi. In fact, the California farmer has to pay significantly more per gallon, so even if they do make ten thousand dollars a year more (an estimate based on mean income in Kings County, California, which is down around 44k per year at this point), they probably spend a similar percent of income.

All of that said, you won't find me complaining about gas prices. But I may end up complaining about produce prices, which is obviously directly tied to gas prices on the farmer's end.