VPOT:A Customizable Variant Prioritization Ordering Tool for Annotated Variants
2019-03-07EddieIpGavinChapmanDavidWinlawSallyDunwoodieEleniGiannoulatou
Eddie Ip ,Gavin Chapman ,David Winlaw ,Sally L.Dunwoodie,Eleni Giannoulatou *
1 Victor Chang Cardiac Research Institute,Sydney 2010,Australia
2 St Vincent’s Clinical School,University of New South Wales,Sydney 2052,Australia
3 Heart Centre for Children,The Children’s Hospital at Westmead,Sydney 2145,Australia
4 Sydney Medical School,University of Sydney,Sydney 2050,Australia
5 School of Biotechnology and Biomolecular Sciences,University of New South Wales,Sydney 2033,Australia
KEYWORDS Next-generation sequencing;Pathogenicity predictions;Variant prioritization;Customizable ranking;Genomic annotation
Abstract Next-generation sequencing(NGS)technologies generatethousandsto millions of genetic variants per sample.Identification of potential disease-causal variants is labor intensive as it relies on filtering using variousannotation metricsand consideration of multiple pathogenicity prediction scores.We have developed VPOT(variant prioritization ordering tool),a python-based command line tool that allows researchers to create a single fully customizable pathogenicity ranking score from any number of annotation values,each with a user-defined weighting.The use of VPOT can be informative when analyzing entire cohorts,as variants in a cohort can be prioritized.VPOT also provides additional functions to allow variant filtering based on a candidate gene list or by affected status in a family pedigree.VPOT outperforms similar tools in terms of eff icacy,f lexibility,scalability,and computational performance.VPOT is freely available for public use at GitHub(https://github.com/VCCRI/VPOT/).Documentation for installation along with a user tutorial,a default parameter f ile,and test data are provided.
Introduction
With the increasing use of next-generation sequencing(NGS)methods,researchers are now faced with many genetic variants,from hundreds of thousands to millions,to evaluate.Software such as ANNOVAR and VEP[1,2]use databases that provide functional consequences,pathogenicity predictions,and population frequencies to annotate genetic variants.
There are many pathogenicity-prediction algorithms available,such as CADD,PolyPhen-2,SIFT,and MutationTaster2[3—6],but thereisno single algorithm that has been universally accepted as the best.Genetic variants predicted to be deleterious by multiple methods are likely to be of greater interest in disease studies[7].In practice,multiple pathogenicity prediction scores are utilized to increase the likelihood of identifying a disease-causing variant.Thus,to determine if a variant is likely to be disease-causal,all prediction scores are often considered together in addition to variant filtering based on other annotation metrics(such as variant frequency in control databases).This makes the prioritization of genetic variants a labor-intensive and cumbersome task.
To facilitate this process,several variant prioritization tools have been developed.However,they are either web-based(such as Variant Ranker[8]),making the analysis of wholegenome data diff icult,or they do not provide an aggregated score across all annotation values(such as VaRank[9]).We have developed variant prioritization ordering tool(VPOT),a python-based command line program that creates a single aggregated pathogenicity ranking score from any number of annotation values via customizable weighting.Using this score,VPOT ranks variants,allowing researchers to prioritize variants based on all annotation data and pathogenicityprediction outcomes.
Methods
The VPOT workf low consists of two main steps:variant prioritization and post-processing of thevariant priority ordered list(Figure 1,Figures S1 and S2).
Prioritization of variants
Creation of the prioritization parameter f ile(PPF)
Using ANNOVAR-annotated VCFs or tab-separated-values f iles(TSV,which can be annotated by any software)as input the VPOT priority function creates a prioritization parameter f ile(PPF)based on all the annotation elements found.The PPF will determine if the annotation fields are characters or numeric.It will list the range of values found within that field to aid customization by the user.By modifying the PPF,the user can select which annotation fields to use in the prioritization process and the weighting to apply to a specific range of values for each annotation field.Additionally,the PPF allows users to filter variants on fields attributes;for example,a population frequency threshold can be defined for Exome Aggregation Consortium(ExAC)/Genome Aggregation Database(gnomAD)[10].A PPF only needs to be set up once as it can be applied repeatedly to prioritize variants in different samples if the annotation fields used within the PPF are available.While utilization of new prediction annotations would require modification of the PPF,VPOT will still run successfully without PPF modification,but it would utilize only the annotations indicated in the PPF.
Figure 1 Variant prioritization ordering tool(VPOT)workf lowA
VPOT is designed to allow the user to customize their prioritization processbased on annotationsrelevant to thedisease of study.However,we also provide a default PPF with a list of recommended annotations based on our experience with one complex disease,congenital heart disease[11,12].The default PPF eliminates variants with a minor allele frequency higher than 0.1%with respect to control databases(ExAC,gnomAD,and 1000 Genomes Project),low quality variants(with coverage less than 8 times or 25%allelic balance),and synonymous variants.The weighting criteria for eachin silicopredictor used in the default PPF are set to identify pathogenic variants based on pathogenicity threshold recommended by the individual algorithms or informed by the literature(e.g.,CADD,PolyPhen-2,MutationTaster2,LRT,MCAP,GERP++,MetaSVM[13—16]).The default PPF also weighs the most disruptive variants such as stop-gain,frameshift indels,and splicing variants highly.
Creation of the variant priority ordered list(VPOL)
Annotated VCFs/TSV f iles and a PPF are passed as input to VPOT to perform the prioritization function on all variants.Using the PPF-customized weights,each variant is scored by aggregating all the user-defined values.This is done by calculating the sum of all encoded weights for each variant.A normalized score is also calculated by dividing by the maximum score found across all variants.All variants by default are returned and ordered in the output,which we call the variant priority ordered list(VPOL).Variants with low score(e.g.,synonymous variants)can be removed at this stage by providing a cutoff within the PPF so that only variants with scores greater or equal to the cutoff are included in the VPOL.
For each variant,VPOT performsquality control checkson each sample’s genotype based on coverage(number of reads at variant position)and allele balance(percentage of alternate allelereadsat variant position).Theuser,via the PPF,can customize the quality control check thresholds.If the sample genotype call fails these quality control checks,then it is marked in the VPOL.For each variant line in the VPOL each sample’s genotype is denoted as,‘‘0’’for reference,‘‘1”for heterozygous,‘‘2”for homozygousalternate,or‘‘.”for quality control failure.Thisprioritization step can be easily performed in parallel across many samples or repeated for new samples by using the same PPF as part of the input.
Post-processing of the VPOL
VPOT provides several post-prioritization options to explore the VPOL(Figure 1).A summary statistics option(stats)generates a quick and simplevariant report for thesupplied VPOL highlighting the number of scored variants,and a list of genes that scorein thetop 75th percentile(default)of variantsfound for each sample in the VPOL.VPOT allows researchers to apply a user-defined candidate gene list to filter any VPOL using the gene filtering option(genef).
VPOT can filter variants in the VPOL based on inheritance or absence from controls via the use of the sample filtering option(samplef).This option utilizes a ped(pedigree)format f ile.The sample filtering option can filter variants based on their case-control status by extracting variants that exist in case samples and not in control samples of a large cohort.The VPOT samplef option can also filter variants based on different Mendelian inheritance models.A complete family trio,defined by the presence of parents and proband,is required for thisoption.Thedenovo(DN)model identifiesvariantsthat only exist in the proband and not in any of the parents.The autosomal dominant(AD)model identifies variants that exist in both the proband and affected parent but not in the unaffected parent.The autosomal recessive(AR)model identifies variants that are homozygous for the alternative allele in the proband and heterozygous in both parents.The compound heterozygous(CH)model provides a filter that returns heterozygous variants in genes that have both probandpaternal and proband-maternal specific variants.
For large cohort studies,it is recommended to run multiple VPOT processes for small subsets of samples in parallel to reduce computational time.To facilitate the ability to view all thesamplesin a single VPOL f ile,VPOT hasa mergeoption(merge)to consolidate multiple numbers of VPOL f iles back to one VPOL.
Results
Application of VPOT to disease cohorts
We used VPOT to identify potentially pathogenic gene variants in a family with a proband that had multiple congenital malformations(family B in Shi,et al.[17]).The family was subjected to whole-genome sequencing(WGS)and over 7.7 million variants identified.Following filtering and prioritization by VPOT using the default PPF the number of candidate variants decreased to 587.Based on the family pedigree which shows that the parents were consanguineous,we used VPOT’s inheritance model filtering(within samplef option)to refine the number of candidate variants based on an autosomal recessive inheritance model(AR)(Figure 2).After application of inheritance model filtering,14 variants remained with aHAAOhomozygous variant ranked first,consistent with the reported genetic cause in this family(Table 1 and Table S1)[17].The identification of theHAAOvariant demonstrates the ability of VPOT to facilitate monogenic disease variant discovery in a systematic way.
VPOT has been successfully used to prioritize variants in a congenital heart disease(CHD)cohort of 30 families that were whole-exome sequenced[11],with the disease-causing variants in the three solved families ranked within the top 2%of all variants found.In another cohort of 97 CHD families that underwent WGS[12],clinically actionable variants were identified in 28 families,and VPOT ranked the majority of these variants within the top 1%of all variants found.Only two variants were not ranked within the top 1%of variants due to large disagreement in pathogenicity prediction between different methods.We have provided the PPF f ile used for the prioritization of variants in these studies as a default PPF for the study of complex diseases like CHD.
Figure 2 HAAO CHD family pedigree
Comparison with existing variant prioritization tools
VPOT’s approach to variant prioritization is to aggregate pathogenicity predictor scores since no single pathogenicity predictor score has been shown to predict pathogenic mutations reliably.Other packages have utilized this same approach,and we identified two for evaluation comparison that are most similar to VPOT,Variant Ranker[8]—a webbased tool,and VaRank[9]—a command line program.Both Variant Ranker and VaRank create a ranking value for variants based on a set of user-defined scores for pathogenicity predictors like VPOT.
We compared the overall features and functionality between the tools(Table 2).Both VPOT and VaRank have no restriction on the input f ile size,which is important for the analysis of variants resulting from whole genome sequencing.Annotation is controlled by the user for both VPOT and VaRank,although it is a separate process for VPOT and part of the tool for VaRank.This provides greater f lexibility for the user to adopt newer releases of the human reference genome,and novel pathogenicity predictors,such as for splicing and non-coding genetic variants.For Variant Ranker,the variant annotation process is embedded within its workf low and cannot be modif ied by the user.All threetools rank variants based on the scores of multiple pathogenicity prediction methods.However the number of predictors vary,with the lowest seen in VaRank that uses only three fixed pathogenicity prediction tools(phastCons[18],SIFT,and PolyPhen-2),then Variant Ranker that uses seven fixed tools(PolyPhen-2,SIFT,LRT,MutationTaster2, MutationAssessor, RadialSVM, and FATHMM[16,19,20]),and finally VPOT where the number is limited only by the predictors included in the annotation.Accounting for differences in the genetic architectures of different diseases,VPOT allows expert users to apply their specialized knowledge of disease to stratify results fromin silicopredictors.The user can select higher weighting for specific predictors to enhance the accuracy for the disease or study design in question.VPOT also allows f ine-tuning of variant ranking as the user can define any number of scoring intervals for an annotation category.This allows the user to define different pathogenicity thresholds instead of a binary nondamaging/damaging scenario.Finally,both VPOT and VaRank are local machine tools,so there is no security concern with sensitive study data being stored in the cloud.
We evaluated VPOT,VaRank,and Variant Ranker by prioritizing variants from an exome sequencing dataset on idiopathic hemolytic anemia(MIM:266200)[21]used previouslyby Variant Ranker to demonstrateitseffectiveness[8].Following as close as possible the default variant scoring criteria of Variant Ranker,VPOT also ranked the most likely causative genePKLRin the fourth position like Variant Ranker.We were not able to replicate the same scoring parameters as Variant Ranker using VaRank due to the limited number of pathogenicity predictors scoring options.With VaRank,using its default scoring parameters thePKLRvariant was ranked in 199th position with an annotation impact value of‘‘Moderate”.Both VaRank and Variant Ranker provide CADD Phred score annotation but do not include it in their final ranking.CADD score is a commonly used pathogenicity predictor,and a minimum score of 20 has been used as a lower threshold for variants considered to be possibly pathogenic[14].Utilizing the f lexibility of VPOT we added CADD into our annotation and PPF with a weighting for CADD Phred score above 20.Under these new ranking criteria,thePKLRvariant was ranked first by VPOT.This demonstrates the benef it of VPOT’s customizability to allow the users to refine and tune the variant prioritization process.
Table 1 Top ten variants for family B following autosomal recessive inheritance model filtering(Samplef—AR)
Table 2 Feature comparison of VPOT with similar variant prioritization tools
Finally,we compared the computational performance of the three tools when ranking f iles with different number of variants(Figure 3 and Table S2).The processing time for VPOT and VaRank includes the annotation of the input VCF(to emulate the Variant Ranker processing which includes its annotation).VPOT was consistently faster than both VaRank and Variant Ranker,and asthe number of variants increased the time difference between VPOT and the others were magnif ied.Additionally,VPOT was the only tool able to complete variant prioritization task for samples containing up to four million variants.In comparing the amount of central processing unit memory usage for the local machine tools,VPOT required a significantly smaller amount of memory to perform the prioritization tasks compared to VaRank.
Figure 3 Comparison of computational performance of VPOT with similar variant prioritization tools
Conclusion
VPOT provides a convenient way to prioritize genetic variants in disease sequencing studies.It is fully customizable,allowing researchers to filter on any annotation metrics and set weights for pathogenicity predictions that ref lect their specific diseasevariant hypothesis in question.The use of VPOT can be especially informative when analyzing sequencing cohorts containing many families,as the prioritization of variants can allow researchers to identify most likely disease-causal candidate variants quickly across all families.
VPOT is highly scalable for large genome analysis.Wholegenome sequencing generates very large variant f iles,and there are now increasing requirements for prioritization of noncoding variants that make up~98%of the genome.As larger sequencing studies are performed,VPOT will further prove to be an extremely valuable tool.
Availability
VPOT is freely available for public use at GitHub(https://github.com/VCCRI/VPOT/).Documentation for installation along with a user tutorial,default parameter f ile,and test data areprovided.Additional datasetsanalyzed in thecurrent study are available upon request from the corresponding author.
Authors’contributions
EI developed the application tool,performed the analyses and drafted the paper.GC,DW and SLD participated in the design of the tool.EG participated in the design and helped to draft the manuscript.All authors read and approved the final manuscript.
Competing interests
The authors have declared no competing interests.
Acknowledgments
This work was supported by an Australian Postgraduate Award(University of New South Wales)to EI,Chain Reaction(The Ultimate Corporate Bike Challenge),the Off ice of Health and Medical Research,NSW Government,Australia,the National Health and Medical Research Council Principal Research Fellowship(Grant No.1135886)to SLD,NSW Government,Australia and the National Heart Foundation of Australia Future Leader Fellowship(Grant No.101204)to EG.
Supplementary material
Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2019.11.001.
杂志排行
Genomics,Proteomics & Bioinformatics的其它文章
- In Memory of Vladimir B.Bajic(1952—2019)
- Deep Learning Deciphers Protein—RNA Interaction
- DeepCPI:A Deep Learning-based Framework for Large-scale in silico Drug Screening
- Gclust:A Parallel Clustering Tool for Microbial Genomic Data
- I3:A Self-organising Learning Workf low for Intuitive Integrative Interpretation of Complex Genetic Data
- CIRCexplorer3:A CLEAR Pipeline for Direct Comparison of Circular and Linear RNA Expression