APP下载

复杂疾病的组学数据挖掘方法研究

2017-04-14李雄

关键词:组学上位基因组

李雄

(华东交通大学 软件学院,江西 南昌,330013)

复杂疾病的组学数据挖掘方法研究

李雄

(华东交通大学 软件学院,江西 南昌,330013)

目前针对单独某一类型的组学数据,已挖掘出部分与肿瘤真实相关的遗传因素及环境因素等信息,但仍然可能仅是隐藏于复杂遗传机制背后的冰山一角,导致这种局限性的关键原因可能是疾病模型过于简化即忽略多层次组学数据之间的相互关系。研究认为在加深理解全基因组SNP数据的基础上,进一步融合多源组学数据,加深理解上位性、异质性等现象,从而提高肿瘤风险评估能力,有利于实现个体化医疗目标。本文从SNP数据和多源组学数据分析的角度比较分析现有复杂疾病的组学数据挖掘方法。

SNP;全基因组关联研究;系统生物学;机器学习

1 背景及意义

复杂疾病是一类由多种因素导致且形成机制尚未明晰的人类健康杀手,如精神失常、多发性硬化症及肿瘤等常见疾病,而肿瘤是复杂疾病中最为常见的疾病之一。据中国肿瘤登记年报最新统计结果显示,全国每分钟约有6人被确诊为癌症,并且患者已呈现出年轻化趋势,因此,肿瘤对国民生活质量造成了巨大威胁。单核苷酸多态性(SNP)是一类DNA序列层次的遗传变异,其可能导致调控元件、基因、蛋白质结构等生物分子发生重大改变,使得个体患肿瘤风险增加。目前,全球研究者针对不同肿瘤开展了全基因组关联分析(GWAS),已准确识别了部分重要SNP并已收录至GWAS Catalog[1]。但随着深入分析发现,传统GWAS存在研究结果难以重现,可解释性低及遗传力缺失等不足。缺乏深入理解易感位点之间相互作用(上位性)及孤立地考察SNP数据是导致这些不足的关键因素,从计算机学科角度可大致可归结为三点:第一,全基因组SNP数据中包含有上百万个位点,对生物信息处理中计算方法及硬件资源带来巨大挑战,难以深入挖掘[2];第二,对肿瘤等复杂疾病缺乏系统、完整的认知,导致其定义存在模糊性甚至歧义性,使得病例样本中呈现多种不同的遗传结构(异质性),一定程度上掩盖了遗传变异与肿瘤不同亚型之间相关性[3];第三,肿瘤发生、发展涉及多种生物分子相互作用,仅分析某一层次组学数据将加剧偏离真实疾病模型,从而难以发现真实完备的风险因素,导致遗传力缺失[4]。

高通量生物数据生成技术使得全基因组SNP数据、表观基因组、转录组及代谢组等组数据得以显著丰富,从而有利于充分发挥基于大数据驱动的研究模式应用于肿瘤研究中。癌症基因组图谱(TCGA)及国际癌症基因组联盟(ICGC)计划分别针对多种癌症提供多层次组学数据,为系统地探索肿瘤背后多源组数据之间交互机制提供了坚实的数据基础。此外,陈洛南研究员团队[5,6]对基于分子标记、网络标记的复杂疾病分类及预测方法进行比较分析,指出基于网络标记的多分子、多源数据相互作用分析方法能更系统、更完备地反映复杂系统。

2 全基因组SNP数据分析方法

单一组学数据的深入研究,不仅有利于进一步发现新的遗传变异基础,同时为后续多源数据融合提供保障。以下将以SNP数据分析中数据降维及疾病-对照研究等关键步骤为例加以概括。

数据降维:全基因组范围内存在有上百万个甚至更多的SNP,但通常待研究疾病样本数量相对非常有限,这种高维、小样本数据将导致过拟合等现象。尽管交叉验证、置换校验等方法能一定程度缓解该现象,但在学习模型训练之前实施数据降维,不仅可以提取更具有代表性的样本特征,而且能大大提升后续分析效率及显著降低多重假设检验成本[7]。例如,对包含有500万个位点的全基因组SNP数据开展穷举式两位点相互作用分析,那么需要对1.25*1013个SNP-SNP相互作用对作统计检验,而对3个位点相互作用分析时则增加至2.09*1019,可以发现随着模型复杂度增加后续关联研究的运算成本(存储消耗、运行时间等)呈现指数级增长。假设某运算设备一秒钟能处理100万次检验,那么处理完所有2个位点及3个位点相互作用的时间分别长达3400小时和5.7*109,即使采用GPU等计算设备,运算成本仍然难以承受。因此,有效的数据降维手段是具有实际应用意义的[8-15]。

易感位点识别:传统基于单位点的全基因组关联研究为避免多重假设检验误差,而设置严格的显著性水平如p<5×10-8,使得一些弱效甚至中等效应的关联信号被忽略,从而导致复杂疾病的遗传力缺失、研究结果难以重现等现象[16]。研究指出忽略致病因素之间的上位性等都将过于简化疾病模型,导致遗传力缺失[17]。

全基因组易感位点识别方法优势在于能对数据集中所有待考察位点进行分析,从而避免人为忽略真实的易感位点,但在高阶上位性分析时将产生组合爆炸现象,从而导致计算成本巨大等挑战。全基因组上位性识别可进一步分为穷举搜索法、随机搜索法、启发式搜索及机器学习法等四类[18],其中穷举搜索策略所需要考察的上位性组合空间最大。多因子降维法(MDR)[19]作为穷举搜索法中最具有代表性的方法之一,它将多种多位点基因型组合划分为高风险及低风险两类,从而将高维基因型预测问题转换为一维,以实现高阶上位性分析。由于MDR仅适用于较小规模数据集,因此Yang等[20]基于MDR提出一种快速MDR计算框架提升了计算效率。为进一步提升穷举策略适用性,Hemani等[21]及Kam-Thong等[22]分别基于GPU提出穷举式上位性搜索方法,而文献[23-25]中分别基于高性能计算架构如云计算等提高了计算效率,但这些方法一定程度忽略了异质性或缺乏联系其它层次生物数据加以分析。基于随机采样技术的随机搜索法能提高上位性识别过程的效率,如BEAM[26]结合了贝叶斯位点组合划分模型及马尔可夫链蒙特卡罗采样策略保证模型后验概率最大化,基于随机森林[27]及蚁群算法的AntEpiSeeker[28]等方法也一定程度改善了上位性组合搜索效率,但随着组合空间的迅速增大,随机策略稳定性将大大降低,相同实验环境下可能产生差异较大的上位性组合,降低了实际应用价值,因而Jing等[18]提出基于准则互补的多目标优化方法MACOED,然后利用蚁群优化算法寻找非占优Pareto解,从而增强了研究鲁棒性。启发式搜索策略结合其它信息引导上位性搜索过程,避免了随机性及穷举性,如CART依据信息熵等指标优化SNP子集的分类性能,迭代划分SNP直到生成满足条件的分类树[29],该方法难以适用于纯上位性现象,而MSCD[30]则基于能量分布差异启发式搜索高阶上位性组合空间,能有效识别出更显著的高阶上位性,该类方法有效性取决于启发信息与研究目标的相关性。基于机器学习的上位性识别方法主要特点在于该方法无需事先了解基因型与个体表型之间的关系,而是通过训练学习模型以捕捉基因型与表型之间复杂关系,如Zhang等[31]利用函数回归模型整体考察两个基因组区域所有两两位点之间的上位性,其主要优势在于能有效处理稀罕SNP之间上位性。但是,机器学习模型就像一个黑匣子,难以让研究者理解其背后的生物意义并且对于易感位点之间的相对重要性也知之甚少[32-39]。

3 多源组学数据融合分析

目前,多源组学数据融合方法可大致分为两类:多阶段融合策略以及多特征融合策略[7]。多阶段融合策略中每阶段仅利用两个不同层次组学数据构建模型,以层次状分阶段考察多源组学数据,而多特征融合策略则是利用所有多源数据所对应特征同时融合以构建多源生物数据与复杂疾病之间关联模型。

多阶段融合策略是一种类似于过滤机制的分析方法,初始多源数据中大规模遗传变异经过分层次、分阶段的过滤,使得与待考察性状无关的遗传变异得以剔除,该策略中过滤机制通常依据统计显著值或先验知识等信息。Holzinger及Ritchie[40]提出一种三阶段分析方法以融合基因组上基因表达谱及SNP等数据,该方法首先基于全基因组统计显著性阈值剔除与疾病不存在显著关联的位点,接着考察保留的显著关联位点与基因表达谱值之间关系以识别eQTL,最后考察易感基因或位点与待研究性状之间的关系,其中与易感基因表达相关的eQTL也被用于药物分析[41]。针对三阶段分析方法中eQTL识别过程,已有一些研究成果分别从改进统计检验方法[42]、深入考察SNP与基因间调控关系[43]及加速分析效率[44]等角度加以改进。可见,多阶段融合策略有效性取决于统计检验及先验信息的可靠性,并受限于及偏袒于先验知识,同时一定程度上忽略了多源数据之间相互作用[45,46]。

多特征融合策略根据特征信息集成方式分为数值集成、特征转换集成及模型集成[7,47,48]。Kim等[49]则基于特征转换集成法首先将不同数据所对应的特征转换为子图,接着利用子图之间的关系融合。该类方法优势集中体现在能保留原数据的特有性质,并无需统一不同数据之间测量尺度,但特征转换可能导致部分信息丢失。基于模型集成的方法首先单独将每层数据分别训练多个模型,然后将多个模型进行集成,常见多模型集成方法有语义进化神经网络[50]、投票算法[51]及深度学习模型[52]等,其非常适用于异质数据,但模型之间的重叠将可能导致偏袒性或过拟合现象。

4 结论及展望

研究认为理解复杂疾病背后多层次生物分子相互作用机制,深化云计算技术在复杂疾病大数据挖掘中的应用,其研究内容涉及多个研究领域的交叉,对解释复杂疾病形成机制具有一定理论意义。同时,识别与复杂疾病真实相关的分子网络标记并建立风险评估模型,具有实用价值。因此,在加深理解全基因组SNP数据的基础上,进一步融合多源组学数据,加深理解上位性、异质性等现象,从而提高复杂疾病风险评估能力,有利于实现个体化医疗目标。

[1]Welter D,MacArthur J,Morales J,et al.The NHGRI GWAS Catalog,a curated resource of SNP-trait associations[J].Nucleic Acids Research,2014,42(Database issue):1001-6.

[2]Xiong H Y,Alipanahi B,Lee L J,et al.The human splicing code reveals new insights into the genetic determinants of disease[J].Science,2015,347(6218):1254806.

[3]Urbanowicz R J,Andrew A S,Karagas M R,et al.Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome:a learning classifier system approach[J].Journal of the American Medical Informatics Association,2013,20(4):603-612.

[4]Li P,Guo M,Wang C,et al.An overview of SNP interactions in genome-wide association studies[J].Briefings in functional genomics,2014,14(2):143-55.

[5]Zeng T,Zhang WW,Yu X T,et al.Edge biomarkers for classification and prediction of phenotypes[J].Science China Life Sciences,2014,57(11):1103-1114.

[6]Liu R,Wang X,Aihara K,et al.Early diagnosis of complex diseases by molecular biomarkers,network biomarkers,and dynamical network biomarkers[J].Medicinal research reviews,2014,34(3):455-478.

[7]Ritchie M D,Holzinger E R,Li R,et al.Methods of integrating data to uncover genotype-phenotype interactions[J].Nature Reviews Genetics,2015,16(2):85-97.

[8]Patil N,Berno A J,Hinds D A,et al.Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21[J].Science,2001,294(5547):1719-1723.

[9]Ting C K,Lin W T,Huang Y T.Multi-objective tag SNPs selection using evolutionary algorithms[J].Bioinformatics,2010,26(11):1446-1452.

[10]Liao B,Li X,Zhu W,et al.A novel method to select informative SNPs and their application in genetic association studies[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(5):1529-1534.

[11]Liao B,Li X,Cai L,et al.A Hierarchical Clustering Method of Selecting Kernel SNP to Unify Informative SNP and Tag SNP[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(1):113-122.

[12]Li X,Liao B,Cai L,et al.Informative SNPs selection based on two-locus and multilocus linkage disequilibrium:Criteria of max-correlation and min-redundancy[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2013,10(3):688-695.

[13]Hung C L,Chen W P,Hua G J,et al.Cloud computing-based tag SNP selection algorithm for Human Genome Data[J].International journal of molecular sciences,2015,16(1):1096-1110.

[14]Wu C,Cui Y.Boosting signals in gene-based association studies via efficient SNP selection[J].Briefings in bioinformatics,2014,15(2):279-291.

[15]Mooney M,Wilmot B,McWeeney S.The GA and the GWAS:using genetic algorithms to search for multilocus associations[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(3):899-910.

[16]Jia P,Zhao Z.Network-assisted analysis to prioritize GWAS results:principles,methods and perspectives[J].Human genetics,2014,133(2):125-138.

[17]Gibson G.Hints of hidden heritability in GWAS[J].Nature genetics,2010,42(7):558-560.

[18]Jing P J,Shen H B.MACOED:a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies[J].Bioinformatics,2015,31(5):634.

[19]Ritchie M D,Hahn L W,Roodi N,et al.Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer[J].The American Journal of Human Genetics,2001,69(1):138-147.

[20]Yang C H,Lin Y D,Yang C S,et al.An efficiency analysis of high-order combinations of gene-gene interactions using multifactor-dimensionality reduction[J].BMC Genomics,2015,16(1):489.

[21]Hemani G,Theocharidis A,Wei W,et al.EpiGPU:exhaustive pairwise epistasis scans parallelized on consumer level graphics cards[J].Bioinformatics,2011,27(11):1462-1465.

[22]Kam-Thong T,Pütz B,Karbalai N,et al.Epistasis detection on quantitative phenotypes by exhaustive enumeration using GPUs[J].Bioinformatics,2011,27(13):i214-i221.

[23]Sluga D,Curk T,Zupan B,et al.Heterogeneous computing architecture for fast detection of SNP-SNP interactions[J].BMC bioinformatics,2014,15(1):216.

[24]Kässens J C,Wienbrandt L,González-Domínguez J,et al.High-speed exhaustive 3-locus interaction epistasis analysis on FPGAs[J].Journal of Computational Science,2015,9:131-136.

[25]Guo X,Meng Y,Yu N,et al.Cloud computing for detecting high-order genome-wide epistatic interaction via dynamic clustering[J].BMC bioinformatics,2014,15(1):102.

[26]Zhang Y,Liu J S.Bayesian inference of epistatic interactions in case-control studies[J].Nature genetics,2007,39(9):1167-1173.

[27]Mao W,Lee J.A combinatorial analysis of genetic data for Crohn’s disease[C]//Bioinformatics and Biomedical Engineering,2007.ICBBE 2007.The 1st International Conference on.IEEE,2007:1031-1034.

[28]Wang Y,Liu X,Robbins K,et al.AntEpiSeeker:detecting epistatic interactions for case-control studies using a two-stage ant colony optimization algorithm[J].BMC research notes,2010,3(1):117.

[29]Chattopadhyay A S,Hsiao C L,Chang CC,et al.Summarizing techniques that combine three non-parametric scores to detect disease-associated 2-way SNP-SNP interactions[J].Gene,2014,533(1):304-312.

[30]Ding X,Wang J,Zelikovsky A,et al.Searching high-order SNP combinations for complex diseases based on energy distribution difference[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2015,12(3):695-704.

[31]Zhang F,Boerwinkle E,Xiong M.Epistasis analysis for quantitative traits by functional regression model[J].Genome research,2014,24(6):989-998.

[32]Kam-Thong T,Azencott C A,Cayton L,et al.GLIDE:GPU-based linear regression for detection of epistasis[J].Human heredity,2012,73(4):220-236.

[33]Beam A L,Motsingerreif A,Doyle J.Bayesian neural networks for detecting epistasis in genetic association studies[J].BMC bioinformatics,2014,15(1):368.

[34]Lee I,Blom U M,Wang P I,et al.Prioritizing candidate disease genes by network-based boosting of genome-wide association data[J].Genome research,2011,21(7):1109-1121.

[35]Chen L S,Hutter C M,Potter J D,et al.Insights into colon cancer etiology via a regularized approach to gene set analysis of GWAS data[J].The American Journal of Human Genetics,2010,86(6):860-871.

[36]Braun R,Buetow K.Pathways of distinction analysis:a new technique for multi-SNP analysis of GWAS data[J].PLos Genetics,2011,7(6):e1002101.

[37]Askland K,Read C,O’Connell C,et al.Ion channels and schizophrenia:a gene set-based analytic approach to GWAS data for biological hypothesis testing[J].Human genetics,2012,131(3):373-391.

[38]Yang C H,Lin Y D,Chaung L Y,et al.Evaluation of breast cancer susceptibility using improved genetic algorithms to generate genotype SNP barcodes[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),2013,10(2):361-371.

[39]Li X,Liao B,Chen H.A new technique for generating pathogenic barcodes in breast cancer susceptibility analysis[J].Journal of theoretical biology,2015,366:84-90.

[40]Holzinger E R,Ritchie M D.Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies[J].Pharmacogenomics,2012,13(2):213-222.

[41]Huang R S,Duan S,Bleibel W K,et al.A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity[J].Proceedings of the National Academy of Sciences,2007,104(23):9758-9763.

[42]Huang Y T,VanderWeele T J,Lin X.Joint analysis of SNP and gene expression data in genetic association studies of complex diseases[J].The annals of applied statistics,2014,8(1):352.

[43]Kang M,Zhang C,Chun H W,et al.eQTL epistasis:detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways[J].Bioinformatics,2015,31(5):656-664.

[44]Shabalin A A.Matrix eQTL:ultra fast eQTL analysis via large matrix operations[J].Bioinformatics,2012,28(10):1353-1358.

[45]Giacalone G,Clarelli F,Osiceanu A M,et al.Analysis of genes,pathways and networks involved in disease severity and age at onset in primary-progressive multiple sclerosis[J].Multiple Sclerosis,2015:21(11).

[46]王吉光.复杂疾病的分子网络模型研究[J].中国科学:数学 (中文版),2014,44(4):317-328.

[47]Fridley B L,Lund S,Jenkins G D,et al.A Bayesian integrative genomic model for pathway analysis of complex traits[J].Genetic epidemiology,2012,36(4):352-359.

[48]Mankoo P K,Shen R,Schultz N,et al.Time to recurrence and survival in serous ovarian tumors predicted from integrated genomic profiles[J].PLoS One,2011,6(11):e24709.

[49]Kim D,Shin H,Song Y S,et al.Synergistic effect of different levels of genomic data for cancer clinical outcome prediction[J].Journal of biomedical informatics,2012,45(6):1191-1198.

[50]Holzinger E R,Dudek S M,Frase A T,et al.ATHENA:the analysis tool for heritable and environmental network associations[J].Bioinformatics,2014,30(5):698-705.

[51]Drăghici S,Potter R B.Predicting HIV drug resistance with neural networks[J].Bioinformatics,2003,19(1):98-107.

[52]Liang M,Li Z,Chen T,et al.Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2014,12(4):928-937

Methods for mining omics data of complex diseases

LI Xiong

(School of Software,East China Jiaotong University,Nanchang 330013,China)

At present, for a single type of omics data, part of the real genetic and environmental factors associated with the tumor has been excavated, but some still may only be hidden in the complex genetic mechanism behind the tip of the iceberg, The key reason to lead to the limitations may be that the disease model is too simplistic, namely, to ignore the interrelationships between multi-level histological data. Studies thank that deepening the understanding of genome SNP data, further integrating omulti-source histological data, deeply understanding epistasis, heterogeneity and other phenomena, and thereby enhancing the ability of cancer risk assessment, is conducive to the realization of personalized medical goals. This paper analyzes the present data mining methods of complex diseases from the perspective of SNP data and multi-source data analysis.

SNP;genome-wide association study;system biology;machine Learning

1672-7010(2017)02-0012-07

2017-02-01

国家自然科学基金资助项目(61602174)

李雄(1985-),湖南邵阳人,讲师,博士,从事数据挖掘、生物信息处理研究,E-mail:lx_hncs@163.com

TP311

A

猜你喜欢

组学上位基因组
牛参考基因组中发现被忽视基因
口腔代谢组学研究
特斯拉 风云之老阿姨上位
基于UHPLC-Q-TOF/MS的归身和归尾补血机制的代谢组学初步研究
“三扶”齐上位 决战必打赢
基于ZigBee和VC上位机的教室智能监测管理系统
以新思路促推现代农业上位
代谢组学在多囊卵巢综合征中的应用
基因组DNA甲基化及组蛋白甲基化
有趣的植物基因组