Thursday, July 28, 2011

Intersecting Indels with VCFtools

Indel detection in is not what I'd call accurate at this point in our history. I, along with probably every other bioinformatician and genomicist looking at next-gen data, have noticed that immediately adjacent indels called as separate events but which are really the same variant called differently due to sequence context and the nature of our variant callers get called all the time.

A band-aid approach is to simply look for overlap in indel calls within a window. Even a tiny window can make a big difference to small indels.

To do this, I currently use VCFtools, which makes it very simple. Specifically, use the vcf-isec command with the -w parameter.

If I compare two libraries sequenced from the same individual that had indels called independently (using the same method), I end up with a few thousand overlapping indels that would have been assessed as independent from one another if I looked for exact overlap.

Exact overlap:
vcf-isec -f -n =2 -o indels1.vcf.gz indels2.vcf.gz | wc -l
Overlap +/- 5b:
vcf-isec -f -n =2 -o -w 5 indels1.vcf.gz indels2.vcf.gz | wc -l
I realize it's not that astounding a difference, but keep in mind this is looking at two different libraries from the same individual. If you're comparing calls from two completely different sequencing platforms or variant callers, these numbers jump quite a bit.

Tuesday, July 26, 2011

The Ion Torrent Paper (Nature)

An integrated semiconductor device enabling non-optical genome sequencing

I'm just going to discuss my thoughts and comments on the paper, their findings, and how they relate to claimed specs for the IonTorrent PGM.

And just for some comparisons later on, the current Ion Torrent PGM product sheet is also attached [click].

I'm going to try to step around some of the issues with the paper that have been well covered at Daniel MacArthur's blog Genetic Future. I think he is pretty fair to the paper in criticizing its "validation rate" and so forth. Sufficed to say, a year and a half to two years ago, perhaps a 15x human genome would have been considered adequate, but in a paper coming out of LifeTech, manufacturer of the SOLiD sequencer, if you're going to use the whole genome sequence off the SOLiD as validation, let's go for at least 30x coverage.


"..there is a desire to continue to drop the cost of sequencing at an exponential rate consistent with the semiconductor industry's Moore's Law..."
They bring up Moore's law repeatedly, and they sequenced Moore himself in the paper. But wait a second... sequencing costs are dropping significantly faster than Moore's law! I suppose it's a minor complaint, but let's give sequencing the credit it's due--per-base cost of sequencing is dropping much faster than Moore's law!

Also, let me complain very briefly about the use of a few buzz terms:
"To overcome these limitations and further democratize the practice of sequencing, a paradigm shift based on non-optical sequencing on newly developed integrated circuits was pursued."
If there is any term that is going to supplant "paradigm shift" as the default excessively pretentious term in scientific papers, it has to be "democratize". Look, unless this device is offering $100 genomes, it's not democratizing sequencing. Can we leave sensationalist buzz words for the advertisements and stick to reality for the Nature papers? (Wait, what am I saying?) I appreciate that we're talking about a non-light system here, but the observation of protons rather than photons released upon base incorporation isn't really a paradigm shift. Once we're taking pictures of DNA with electron microscopes and reading the entire genome instantaneously in one shot, then we can start talking about paradigm shifts.


Okay, let's look at the scalability based on their data, and what's being touted in their product sheet.

A typical 2-h run using an ion chip with 1.2M sensors generates approximately 25 million bases.

That pretty much throws out the Ion 314 (1.3M wells) for human genome sequencing. A 30x diploid human genome would require 640 days on the 1.2M sensor chip in the paper. Even just 1x coverage would take three weeks. Yikes.

Later there is a rather astounding statement:
"At present, 20-40% of the sensors in a given run yield mappable reads."
Room for improvement there, methinks.

In Table 1, they test the 11M chip with E. coli and Human. The E. coli yields 273.9Mb of sequence off that chip. At about 20-40% of sensors yielding mappable reads, that gives an average read length of 62b-125b. This is consistent with their finding that 2.6M reads are >=21b and that 1.8M are >= 100b. Also with Figure S15, where it appears the majority of read lengths are around 110b-120b. So at least the read lengths are not disappointing.

Their Ion 318 is the 12 million wells chip. I think this is similar to their 11M chip in the paper. Going back to Table 1, they got 273.9Mb off the 11M chip. At issue is the promised "[starting] 1 Gb of high-quality sequence" off the Ion 318 chip in the Ion Torrent product sheet. Now, I completely believe that advancements have been made since the paper's acceptance on May 26th, 2011, but four times the yield? Not so sure about that claim. I'm not doubting it can get there, but I'll put it this way: This is a paper from the company that makes the product--if anyone can make it work optimally, it should be them. And their optimal report here has it at significantly lower than what they're advertising. Oh, and the specs sheet has small print next to the Ion 318 chip entry that says "the content provided herein [...] is subject to change without notice". Let's just say I'm skeptical.

Anyway, the rest of the paper is pretty vague. One issue on everyone's mind is the homopolymer issue, which is addressed in a single sentence stating the accuracy of 5-base homopolymers (97.328%--not terrible, but not overly good either) and that it's "better than pyrosequencing-based methods" (read: 454).  How about longer ones? No idea. Figure S16b only goes out to 5b also with a curve that isn't looking too encouraging, though.

Apparently the Ion 316, their 6.3M well chip is currently available, and they claim at least 100Mb of sequence per run. This is consistent with their mapped bases in the paper (169.6Mb off a 6.1M ion chip).  With this chip, you're talking about 3 days on one machine for 1x diploid human coverage, and about 90 days for 30x coverage. Better, but still not there when it comes to human sequencing.  Still, it's at the level of completing entire bacterial genomes in 2 hours. If you're into that sort of thing (and don't have access to a HiSeq or something else...).

Not Quite "Post-Light"

You know, there's a lot of "post-light" jibber jabber in the paper. It's Ion Torrent's favorite buzz phrase, and I'm a fan, actually (much more so than I am of "democratize" and "paradigm shift"). But at this point, with this performance, I'm not sure we're "post-light" yet. The technology is there, but it isn't scaled up enough yet. There's a claim made a the end of the paper that's interesting:
"The G. Moore genome sequence required on the order of a thousand individual ion chips comprising about one billion sensors. ...our work suggests that readily available CMOS nodes should enable the production of one-billion-sensor ion chips and low-cost routine human genome sequencing."
Doubtless this is long in the works already, and I hope it is a reality. Because making the leap from the things in this paper to a functional 1B sensor chip would make a huge difference.

I'd say I was a bit disappointed with this paper. It felt half done. I'm confused about the way the comparison to SOLiD was done--why wasn't the SOLiD WGS of G. Moore done to an adequate depth? I'm a bit annoyed at the lack of comprehensive information, as well. The homopolymer issue is known--why hide behind homopolymers of 5b or smaller? Just give the whole story in your paper--it's an article in Nature, not an advertisement.

Anyway, to quote a very smart man I know, "it is what it is." Ion Torrent is here to stay and it's only going to improve. I certainly hope it does--I'd love to see it pumping out 1Gb/2hrs with long read lengths.

Wednesday, July 20, 2011

Mendel Google Doodle

Google's current doodle is a very neat nod to the father of genetics, Gregor Mendel! They're celebrating his 189th birthday with a representation of his original experiment--crossing green and yellow pea plants and tracing the color trait.

Monday, July 4, 2011

Accurate genome-wide read depth calculation (how-to)

I'm currently working ferociously on revisions to a paper and need to calculate mean genome-wide read depth as a fine point in the paper. My first inclination was genomeCoverageBed from BEDtools. Trying it out on chromosome 22 first, I noted a huge number (>30%) of the bases had 0 coverage. Of course, this must be because genomeCoverageBed is including the massive centromere (chr22 is acrocentric--the entire p-arm is unmappable heterochromatin). I kind of already knew genomeCoverageBed wasn't meant for this purpose anyway, but I was hoping to stumble upon something.

I decided to Google "genomeCoverageBed centromere" and "genomeCoverageBed not include centromere" and came up with bunk (well, not quite bunk, I came across a blog post about genomeCoverageBed that happens to have been posted tomorrow... yes, tomorrow!).

As I explained in a comment there, I find genomeCoverageBed's approach lacking in meaning. Is it fair to include unmappable regions in a calculation of coverage? That's what led me to asking myself if "genome-wide coverage" really has a meaningful use as a statistic. We all know you're going to have a ton of bases with 0 coverage in the centromere because they're unmappable, but that says nothing about how your sequencing performed or aligner worked. All it says is that those bases are missing from the reference assembly. And that confounds any other meaning you might get from the number, really.

Not to say that genomicCoverageBed is useless. To the contrary, it does exactly what it's supposed to: generates a histogram of genome-wide coverage. But I do not think it's overly useful beyond that histogram. A lot of people like to see the "mean coverage" or "mean read depth" statistic when you talk about a project, and you'd certainly be selling yourself short if you're generating that number with genomicCoverageBed.

I think the author of BEDtools (Hi Aaron if you ever read this!) would be the first to say, "do not use genomeCoverageBed for calculating mean read depth". But BEDtools can, fortunately, give us a very strong way of doing just that through intersectBed and coverageBed.

My solution is quite simple. I take the "gaps" track from UCSC (Tables->All Tracks->Gap) and create a BED file of all regions that are NOT in gaps. You can easily generate this file by obtaining the gap track from UCSC as a BED file and then using BEDtools subtractBed to subtract those regions from a whole genome BED file.

$ subtractBed -a hg19.bed -b ~/resources/hg19_gaps.bed > hg19.gapless.bed

Then, take your gap-less whole genome, run intersectBed with your input BAM file and pipe it to coverageBed again using the gap-less whole genome bed file as your -b. Make sure to use the -hist option in coverageBed.

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist > coverageHist.bed

If you want to save yourself some trouble, you can pipe that to grep and grep out "all"--that's your  histogram of genomic coverage across the gap-less genome. Actually, you can calculate mean read depth across this gapless genome in one line without generating any output with some awk magic:

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist | grep all | awk '{NUM+=$2*$3; DEN+=$3} END {print NUM/DEN}'

Not everything outside the gaps is mappable, but at least those bases are present in the reference genome.

Here's an example from a real WGS experiment restricted to chr22:

genomeCoverageBed mean read depth: 0.68
gap-less genome coverageBed mean read depth: 30.637

And yes, calculated by other means, average read depth was right around 30x. For the sake of accuracy, I call it "mean read depth of the reference genome assemby". Doesn't roll off the tongue, but that's what it is, and it means a lot more than "genomic read depth" or "genomic coverage", in my opinion.

(All of that said, genomeCoverageBed will run a lot faster than the way above, but the info it reports is different and really serves a different purpose.)

Friday, July 1, 2011


I've been super busy writing a paper lately, so I apologize for the lack of updates. I do intend to comment on the recent 23andMe paper soon™.

A new feature on the blog is my Gist feed and a link to my Gist page. Gist is basically a quick-and-dirty code-sharing site from Github. Since I'm a quick-and-dirty bioinformatics programmer, I'll do my best to keep random code snippets that might have general application for others up there and publicly available. Also, please feel free to fork my code, fix it, et cetera.

My first Gist is an interesting one: A shell script for calculating mean heterozygous allele balance from a VCF4 file (basically GATK output with the "AB" INFO field). Very simplistic, but fast and easy to use for people wondering what the overall reference bias is in their variant calls.