I currently use Annovar for annotating VCF files. The output from Annovar is not particularly intuitive, so I wrote a perl script that generates a VCF-based report. I thought I would share the annotations I've been using and ones I plan to add, and see if anyone else has any other annotation ideas. These could be useful for us to annotate our own genomes (and potentially for 23andMe to provide in the future).
The annotations I've been including are:
- Gene annotation (type of mutation--exonic, intronic, splicing, etc.)
- Gene name
- Mutational description (i.e. specific amino acid change, etc.)
- WashU Exome Variant DB (EVS)
- Transcription Factor Binding Site (TFBS)
- SIFT score
- PolyPhen 2 score (PP2)
- GWAS presence
- Segmental duplication
(The reason I include both dbSNP130 and 135 is that 135 contains quite a few SNPs that are potentially meaningful from a disease and trait standpoint while 130 is mostly markers not directly affecting diseases and traits. 130 is a subset of 135. Also, the EVS is potentially more useful than either of them as a filtering device.)
Ones that I would like to include in the future:
- MIE sites/scores (Mendelian inheritance errors)
- 23andMe annotations (anything from 23andMe's SNP databases--can 23andMe help with that?)
Any other ideas for great annotations that should be included?
The idea behind these types of annotations is to give us a way to sift through the data and extract biologically meaningful results. For example, we are most interested in mutations that actually cause a protein coding change, that are uncommon in the population, and that are predicted to have a dramatic effect on function.
So far these types of annotations have allowed me to narrow very long lists of results in exomes (think on the order of 30-50,000 mutations) down to just a handful (1-20) candidate mutations for particular Mendelian disorders.
Anything I missed?