动态需求跟踪中多义关键词的语义判断方法
2019-08-01唐晨李勇华饶梦妮胡钢俊
唐晨 李勇华 饶梦妮 胡钢俊
摘 要:虽然与信息检索(IR)方法相比,基于本体的动态需求跟踪方法能提高跟踪链的精度,但构建一个合理、有效的本体特别是领域本体是一个相当复杂和繁琐的过程。为了减小构建领域本体带来的时间成本和人力成本,通过将修饰词和通用本体相结合,提出基于修饰词本体的关键词语义判断方法(MOKSJM)。首先,对关键词和修饰词的搭配关系进行分析;然后,采用修饰词本体结合规则的方式来确定关键词的语义,以避免关键词的多义性对动态需求跟踪结果造成的偏差;最后,根據上述分析的结果,对关键词语义作出调整,并通过相似度得分来体现其语义。修饰词在需求文档、设计文档等中数量较少,因此建立修饰词本体所带来的时间成本和人力成本相对较小。实验结果表明,MOKSJM与基于领域本体的动态跟踪方法在召回率相当时,精度差距更小;与向量空间模型(VSM)方法相比,MOKSJM能有效提高需求跟踪结果的精度。
关键词:动态需求跟踪;本体;修饰词;需求工程;软件工程
中图分类号:TP311.5
文献标志码:A
Abstract: Although ontologybased dynamic requirement traceability methods can improve the accuracy of trace links compared with Information Retrieval (IR), but it is rather complicated and tedious to construct a reasonable and effective ontology, especially domain ontology. In order to reduce time cost and labor cost brought by the domain ontology construction, a Modifier Ontologybased Keyword Semantic Judgment Method (MOKSJM) which combined modifiers with general ontology was proposed. Firstly, the collocation relationship between keywords and modifiers was analyzed. Then, the semantics of keywords were determined by combining modifier ontologies with rules, so as to avoid the bias of dynamic requirements traceability results caused by the polysemy of keywords. Finally, based on results of the above analysis, the semantics of keywords were adjusted and reflected by similarity scores. The number of modifiers is small in the requirements document, design documents, etc., so the time cost and labor cost brought by establishing the modifier ontology is relatively small. The experimental results show that compared to domain ontologybased dynamic requirement traceability method, MOKSJM has a small gap in precision with the same recall rate, and when compared to Vector Space Model (VSM) method, MOKSJM can effectively improve the accuracy of the requirements traceability result.
英文关键词Key words: dynamic requirements traceability; ontology; modifier; requirements engineering; software engineering
0 引言
语义问题是目前动态需求跟踪[1]中的关键问题。本体研究的深入和本体技术的广泛应用使得其关注度不断提升,越来越多的学者采用本体解决动态需求中的语义问题:Chen等 [2]提出了一种评估语义挖掘的方法,将WordNet中九种语义关系和一体化医学语言系统(Unified Medical Language System, UMLS)中的同义词关系相结合得到一个标准数据集,并通过这个标准数据集来评估嵌入词, 该方法适用于大部分的语义关系,但是测量方法采用余弦相似度的计算方式,结果并不足够准确; Kolhe等[3]为了方便对大型文本数据库进行数据检索和管理,采用潜在语义索引(Latent Semantic Index, LSI)聚类并创建标签,然后将WordNet扩展查询和余弦相似度相结合计算相似度, 该方法通过 WordNet 的语义算法解决了多义词等问题,但是当矩阵变换的数量增多时,对于内存的需求就会增大; Besbes 等[4]为了帮助用户理解或表达医学术语,通过自动提取用户查询概念并构建医疗本体,然后考虑分类关系及用户个人资料信息,对本体进行模糊化,最后将本体纳入查询的重定义中, 但实验结果和所应用的领域密切相关; Matei等[5]通过WordNet计算单词之间的语义距离,然后根据动态时序来计算文本之间的相似度,所提出的时间序列模型,与传统的向量空间模型相比,考虑了单词的顺序对于语义的影响,提高了结果的准确性; Kulathunga等[6]通过将本体和聚类方法相融合来识别金融文本中含糊的单词含义,该方法虽消除了文本的语义歧义并提高了聚类算法的性能,但并未使用金融数据集去验证该方法的有效性; Mai等[7]提出了一种基于统计和本体的语义核函数,并将语义核函数嵌入到支持向量机中进行中文文本分类,充分地利用了文本中的语义关系来改善文本分类性能,但是构建与语义核函数相关联的特征矩阵是非常耗时的; 巩皓等[8]以微博短文为素材,构建安全领域本体知识库,利用本体知识对初始查询词进行扩展,并结合局部查询反馈对候选扩展词进行筛选,最后进行二次查询和迭代操作得到最后结果。微博以短文为主且关键词和信息量较稀疏,因此该方法随查询结果不断增多,准确性会降低。
根据相关研究表明,需求文档中78%的词和名词相关[9],因此动名词成为了动态需求跟踪中的主要研究对象。而名词具有多义性,以动名词为研究对象,容易因语义分歧造成动态需求跟踪的误差。信息检索(Information Retrieval, IR)方法便无法解决名词的“一词多义”和“一义多词”的这类问题[10],虽然基于领域本体的动态跟踪方法能够有效解决此类问题,但是该方法必須构建相关的领域本体,而构建领域本体是一个相当复杂和繁琐的过程。由于修饰词在需求文档中的数量较少,因此与建立领域本体相比,建立修饰词本体代价较小。为此,本文提出了一种基于修饰词本体的关键词语义判断方法(Modifier Ontologybased Keyword Semantic Judgment Method, MOKSJM),在通用本体WordNet的基础上,通过与修饰词本体相结合的方式,共同决定名词在素材中的语义,减少因“一词多义” 和“一义多词”造成的语义混淆,降低因构建领域本体带来的时间成本和人力成本。
3 结语
领域本体已成为了动态需求跟踪的重要研究手段,但目前构建领域本体的方法并不能实现自动化,且构建领域本体的质量和规模上受到了一定程度上的限制。
本文提出了一种基于修饰词本体的关键词语义判断方法(MOKSJM)。该方法在通用本体WordNet的基础上,根据修饰词类别和修饰词语义距离,以及通过调整关键字的相似度来体现语义选择的目的,消除语义分歧,实验证明了该方法的有效性。
下一步工作将集中于如何将句式结构和修饰词相结合,利用浅层语义分析的方法,从句式层面上,集中的体现句子语义的中心含义,提高推荐跟踪链的准确性。
参考文献 (References)
[1] CLELANDHUANG J, SETTIMI R, DUAN C, et al. Utilizing supporting evidence to improve dynamic requirements traceability[C]// Proceedings of the 13th IEEE International Conference on Requirements Engineering. Piscataway, NJ: IEEE, 2005: 135-144.
[2] CHEN Z, HE Z, LIU X, et al. An exploration of semantic relations in neural word embeddings using extrinsic knowledge[C]// Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine. Washington, DC: IEEE Computer Society, 2017:1246-1251.
[3] KOLHE S R, SAWARKAR S D. A concept driven document clustering using WordNet[C]// Proceedings of the 2017 International Conference on Nascent Technologies in Engineering. Piscataway, NJ: IEEE, 2017:1-5.
[4] BESBES G, BAAZAOUIZGHAL H. Fuzzy ontologybased medical information retrieval[C]// Proceedings of the 2016 IEEE International Conference on Fuzzy Systems. Piscataway, NJ: IEEE, 2016:178-185.
[5] MATEI L S, MATU S T. Document semantic distance based on the time series model[C]// Proceedings of the 2016 15th RoEduNet Conference: Networking in Education and Research. Piscataway, NJ: IEEE, 2016:1-4.
[6] KULATHUNGA C, KARUNARATNE D D. An ontologybased and domain specific clustering methodology for financial documents[C]// Proceedings of the 17th International Conference on Advances in ICT for Emerging Regions. Piscataway, NJ: IEEE, 2018:1-8.
[7] MAI F J, HUANG L, TAN J, et al. The research of semantic kernel in SVM for Chinese text classification[C]// Proceedings of the 2nd International Conference on Intelligent Information Processing. New York: ACM, 2017: Article No. 8.
[8] 巩皓, 杜军平, 赖金财,等. 基于本体和局部查询反馈的微博查询扩展算法[J]. 南京大学学报(自然科学版), 2017, 53(6):1004-1011.(GONG H, DU J P, LAI J C, et al. Microblog query expansion algorithm based on ontology and local query feedbace[J]. Journal of Nanjing University (Natural Sciences), 2017, 53(6):1004-1011.)
[9] CUNNINGHAM H, MAYNARD D, BONTCHEVA K, et al. Developing language processing components with GATE version 7 (a user guide)[EB/OL].[2018-03-20]. http://gate.ac.uk/sale/tao/tao.pdf.
[10] 李引, 李娟, 李明树. 动态需求跟踪方法及跟踪精度问题研究[J]. 软件学报, 2009, 20(2):177-192. (LI Y, LI J, LI M S. Research on dynamic requirement traceability method and traces precision[J]. Journal of Software, 2009, 20(2):177-192.)
[11] Stanford University. The Stanford parser: a statistical parser[CP/OL]. [2018-07-21]. https://nlp.stanford.edu/software/lexparser.shtml.
[12] 徐健, 張智雄. 基于词语软匹配和修饰词权重差异化的术语相似度算法[J]. 情报学报, 2011, 30(11):1145-1151.(XU J, ZHANG Z X. An term similarity algorithm based on word soft matching and weight difference of modifying words[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(11):1145-1151.)
[13] LI Y, CLELANDHUANG J. Ontologybased trace retrieval[C]// Proceedings of the 2013 7th International Workshop on Traceability in Emerging Forms of Software Engineering. Washington, DC: IEEE Computer Society, 2013: 30-36.
[14] SALTON G, WONG A, YANG C S. A vector space model for automatic indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[15] MANNING C D, RAGHAVAN P, SCHUTZE H. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 142-145.