Thursday, July 28, 2011

Intersecting Indels with VCFtools

Indel detection in is not what I'd call accurate at this point in our history. I, along with probably every other bioinformatician and genomicist looking at next-gen data, have noticed that immediately adjacent indels called as separate events but which are really the same variant called differently due to sequence context and the nature of our variant callers get called all the time.

A band-aid approach is to simply look for overlap in indel calls within a window. Even a tiny window can make a big difference to small indels.

To do this, I currently use VCFtools, which makes it very simple. Specifically, use the vcf-isec command with the -w parameter.

If I compare two libraries sequenced from the same individual that had indels called independently (using the same method), I end up with a few thousand overlapping indels that would have been assessed as independent from one another if I looked for exact overlap.

Exact overlap:
vcf-isec -f -n =2 -o indels1.vcf.gz indels2.vcf.gz | wc -l
468136
Overlap +/- 5b:
vcf-isec -f -n =2 -o -w 5 indels1.vcf.gz indels2.vcf.gz | wc -l
471047
I realize it's not that astounding a difference, but keep in mind this is looking at two different libraries from the same individual. If you're comparing calls from two completely different sequencing platforms or variant callers, these numbers jump quite a bit.