Monday, July 4, 2011

Accurate genome-wide read depth calculation (how-to)

I'm currently working ferociously on revisions to a paper and need to calculate mean genome-wide read depth as a fine point in the paper. My first inclination was genomeCoverageBed from BEDtools. Trying it out on chromosome 22 first, I noted a huge number (>30%) of the bases had 0 coverage. Of course, this must be because genomeCoverageBed is including the massive centromere (chr22 is acrocentric--the entire p-arm is unmappable heterochromatin). I kind of already knew genomeCoverageBed wasn't meant for this purpose anyway, but I was hoping to stumble upon something.

I decided to Google "genomeCoverageBed centromere" and "genomeCoverageBed not include centromere" and came up with bunk (well, not quite bunk, I came across a blog post about genomeCoverageBed that happens to have been posted tomorrow... yes, tomorrow!).

As I explained in a comment there, I find genomeCoverageBed's approach lacking in meaning. Is it fair to include unmappable regions in a calculation of coverage? That's what led me to asking myself if "genome-wide coverage" really has a meaningful use as a statistic. We all know you're going to have a ton of bases with 0 coverage in the centromere because they're unmappable, but that says nothing about how your sequencing performed or aligner worked. All it says is that those bases are missing from the reference assembly. And that confounds any other meaning you might get from the number, really.

Not to say that genomicCoverageBed is useless. To the contrary, it does exactly what it's supposed to: generates a histogram of genome-wide coverage. But I do not think it's overly useful beyond that histogram. A lot of people like to see the "mean coverage" or "mean read depth" statistic when you talk about a project, and you'd certainly be selling yourself short if you're generating that number with genomicCoverageBed.

I think the author of BEDtools (Hi Aaron if you ever read this!) would be the first to say, "do not use genomeCoverageBed for calculating mean read depth". But BEDtools can, fortunately, give us a very strong way of doing just that through intersectBed and coverageBed.

My solution is quite simple. I take the "gaps" track from UCSC (Tables->All Tracks->Gap) and create a BED file of all regions that are NOT in gaps. You can easily generate this file by obtaining the gap track from UCSC as a BED file and then using BEDtools subtractBed to subtract those regions from a whole genome BED file.

$ subtractBed -a hg19.bed -b ~/resources/hg19_gaps.bed > hg19.gapless.bed

Then, take your gap-less whole genome, run intersectBed with your input BAM file and pipe it to coverageBed again using the gap-less whole genome bed file as your -b. Make sure to use the -hist option in coverageBed.

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist > coverageHist.bed

If you want to save yourself some trouble, you can pipe that to grep and grep out "all"--that's your  histogram of genomic coverage across the gap-less genome. Actually, you can calculate mean read depth across this gapless genome in one line without generating any output with some awk magic:

$ intersectBed -abam in.bam -b hg19.gapless.bed | coverageBed -abam stdin -b hg19.gapless.bed -hist | grep all | awk '{NUM+=$2*$3; DEN+=$3} END {print NUM/DEN}'

Not everything outside the gaps is mappable, but at least those bases are present in the reference genome.

Here's an example from a real WGS experiment restricted to chr22:

genomeCoverageBed mean read depth: 0.68
gap-less genome coverageBed mean read depth: 30.637

And yes, calculated by other means, average read depth was right around 30x. For the sake of accuracy, I call it "mean read depth of the reference genome assemby". Doesn't roll off the tongue, but that's what it is, and it means a lot more than "genomic read depth" or "genomic coverage", in my opinion.

(All of that said, genomeCoverageBed will run a lot faster than the way above, but the info it reports is different and really serves a different purpose.)

1 comment:

  1. Today I noticed that genomeCoverageBed appears to have been fleshed out significantly and can probably now accomplish this quite well. I still recommend utilizing the gaps track to make a "gapless" reference before doing it. Then report the "zero" positions in your calculation. You can still utilize the awk command listed in the post.