APP下载

融合萤火虫方法的多标签懒惰学习算法

2019-08-01程玉胜钱坤王一宾赵大卫

计算机应用 2019年5期

程玉胜 钱坤 王一宾 赵大卫

摘 要:已有的多标签懒惰学习算法(IMLLA)在利用近邻标签时因仅考虑了近邻标签相关性信息,而忽略相似度的影响,这可能会使算法的鲁棒性有所降低。针对这个问题,引入萤火虫方法,将相似度信息与标签信息相结合,提出一种融合萤火虫方法的多标签懒惰学习算法(FFMLLA)。首先,利用Minkowski距离来度量样本间相似度,从而找到近邻点;然后,结合标签近邻点和萤火虫方法对标签计数向量进行改进;最后,使用奇异值分解(SVD)与核极限学习机(ELM)进行线性分类。该算法同时考虑了标签信息与相似度信息从而提高了鲁棒性。实验结果表明,所提算法较其他的多标签学习算法有一定优势,并使用统计假设检验与稳定性分析进一步说明所提出算法的合理性与有效性。

关键词:多标签学习;萤火虫方法;标签相关性;多标签懒惰学习算法;极限学习机

中图分类号:TP181

文献标志码:A

Abstract: The existing Improved Multilabel Lazy Learning Approach (IMLLA) has the problem that the influence of similarity information is ignored with only the neighbor label correlation information considered when the neighbor labels were used, which may reduce the robustness of the approach. To solve this problem, with firefly method introduced and the combination of similarity information with label information, a Multilabel Lazy Learning Approach based on FireFly method (FFMLLA) was proposed. Firstly, Minkowski distance was used to measure the similarity between samples to find the neighbor point. Secondly, the label count vector was improved by combining the neighbor point and firefly method. Finally, Singular Value Decomposition (SVD) and kernel Extreme Learning Machine (ELM) were used to realize linear classification. The robustness of the approach was improved due to considering both label information and similarity information. The experimental results demonstrate that the proposed approach improves the classification performance to a great extent compared to other multilabel learning approaches. And the statistical hypothesis testing and stability analysis are used to further illustrate the rationality and effectiveness of the proposed approach.

英文關键词Key words: multilabel learning; firefly method; label correlation; Improved Multilabel Lazy Learning Approach (IMLLA); Extreme Learning Machine (ELM)

0 引言

多标签学习[1]是一种应用非常广泛的学习范式,是机器学习研究的重要热点之一。传统的单标签学习,每个对象只与单个标签相关联;然而,真实世界中的对象往往具有多义性,比如一篇文章可能属于军事、体育、运动等多个主题[2]。

多标签学习作为处理具有丰富语义真实世界对象的学习框架之一,且其研究成果已经广泛应用到文本分类[3]、基因工程[4]、图像识别[5-6]、Web数据挖掘[7]和视频自动标注[8]等多个领域。对此许多学者提出了针对多标签分类的学习算法,例如BR(Binary Relevance)算法、LP(Label Power)算法[9]等,它们通过增加分类器个数或者标签的种类来解决多标签问题,但在一定程度上影响了分类器效率。经典的MLKNN(MultiLabel K Nearest Neighbors)算法[10]利用最大化后验概率(Maximum A Posteriori)来解决多标签学习预测问题,虽提升了分类器的性能,却增加了其计算的复杂度。

此外,现实世界中各样本所含标签并不相互独立,存在相关关系,然而目前绝大多数多标签学习算法未充分考虑其标签相关性。因此,充分利用标签之间的相关性信息,对构建强泛化性能的多标签分类学习算法具有重要意义[11]。

而针对标签间的相关性,许多学者提出了相关算法,取得了不错的效果。例如,RankSVM(Ranking Support Vector Machine)算法[12]采用最大间隔策略以适应多标签学习,采用类似BR策略构建SVM(Support Vector Machine)多标签分类器,但其时间消耗较大。由于极限学习机(Extreme Learning Machine, ELM)[13]训练速度快,MLRKELM(MultiLabel algorithm of Regression Kernel Extreme Learning Machine)算法[14]使用回归模式的核ELM,缩短了算法的运行时间。MLASRKELM(MLRKELM with Association Rules)算法[14]在MLRKELM算法的基础上引入了关联规则,保留了标签之间的信息。针对标签之间的相关性,张敏灵[15]在MLKNN算法基础上提出一种新型的多标签懒惰学习算法(Improved Multilabel Lazy Learning Approach, IMLLA)。IMLLA利用近邻的标签信息构建一个标记计数向量来进行分类, 此算法在构建标签计数向量时使用了近邻标签信息,认为近邻的标签具有相同的重要性。然而,近邻与样本间的相似度越大,此近邻的标签越重要, IMLLA因未考虑近邻相似度信息所以其泛化性有所降低。

在上述研究成果上,对于样本分布问题,本文在IMLLA的基础上引入萤火虫方法[16-17]。萤火虫方法作为模仿自然界中萤火虫发光行为而构造出的元启发式算法,具有操作简单、易于并行处理、鲁棒性强等特点。故利用萤火虫方法将近邻的标签信息与近邻的相似度信息相融合,以提高算法的鲁棒性,而提出一种融合萤火虫方法的多标签懒惰学习算法(Multilabel Lazy Learning Approach based on FireFly method, FFMLLA)。本文通过萤火虫方法根据相似度来计算样本与近邻间的吸引度,吸引度越大则该近邻的标签越重要。然后将吸引度作为权重与标签信息相结合,对IMLLA中的标签计数向量进行重构。由于Huang等提出的极限学习机算法[13]具有训练速度快、泛化能力强等优点,所以在使用线性分类器进行分类时,引入ELM进行权重求解。此外,还使用了奇异值分解(Singular Value Decomposition, SVD)求解权重。为了验证本文算法的有效性,本文将FFMLLA与标准IMLLA,以及其他经典的多标签算法在多个公开数据集上进行实验对比。实验结果表明,本文算法较其他对比算法具有一定优势。

1 理论介绍

1.1 多标签学习

多标签学习是针对现实生活中普遍存在的多义性对象而提出的一种学习框架。在这个框架之下,样本由多个特征和多个标签构成,学习的目标是将未知的实例对应更多正确的标签。在单实例多标签学习中,假设X={x1,x2,…,xn}T∈Rn*d表示有n個样本且每个样本的特征数为d,Y={1,2,…,Q}表示可能的概念构成的集合。T={(x1,Y1),(x2,Y2),…,(xm,Ym)}(xi∈X,Yi∈Y)表示训练集,多标签学习的目标就是得到映射关系f:X→{-1,1}Q,并对标签未知而特征已知的样本进行标签预测。

1.2 IMLLA

MLKNN是一种经典的多标签分类算法,它先获取近邻样本的标签信息,再通过“最大化后验概率”的方式推理未见实例的标签集合, 但它未充分考察标签之间的相关性。基于此问题,张敏灵提出一种新型的多标签懒惰学习算法IMLLA。IMLLA首先将测试样本在训练集中找出k个近邻及其k个近邻的标记信息,然后根据k个近邻的标记信息生成各标签计数向量,并提交给已训练的分类器进行标签预测。

4 结语

本文针对基于k近邻的多标签相关性算法未考虑样本分布问题进行了研究,运用萤火虫方法的思想将相似度信息与近邻标签信息进行融合。萤火虫方法是源于模拟自然界萤火虫在晚上的群聚活动的自然现象而提出的,其计算效率高、鲁棒性强,能够很好地将相似度信息与标签信息融合。本文算法在重构标签计数向量后分别使用了奇异值分解与核极限学习机进行权重求解,再进行线性分类。实验结果表明了本文提出的FFMLLA算法具有不错的效果和较好的稳定性。

虽然将相似度信息与近邻标签信息相结合的方法一定程度上提升了模型的分类精度,但与预期效果之间还存在一定差距,因此如何从近邻空间提取出比相似度信息更为有效的信息来辅助分类器进行分类是今后研究的重点。

参考文献 (References)

[1]     GIBAJA E, VENTURA S. A tutorial on multilabel learning[J]. ACM Computing Surveys, 2015,47(3):1-38.

[2]     何志芬, 杨明, 刘会东. 多标记分类和标记相关性的联合学习[J]. 软件学报, 2014, 25(9):1967-1981. (HE Z F, YANG M, LIU H D. Joint learning of multilabel classification and label correlations[J]. Journal of Software, 2014, 25(9):1967-1981.)

[3]     LIU J, CHANG W, WU Y, et al. Deep learning for extreme multilabel text classification[C]// Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. New York: ACM, 2017:115-124.

[4]     KORDMAHALLEH M M, HOMAIFAR A, DUKKA B K C. Hierarchical multilabel gene function prediction using adaptive mutation in crowding niching[C]// Proceedings of the 13th IEEE International Conference on BioInformatics and BioEngineering. Piscataway, NJ: IEEE, 2013:1-6.

[5]     ZHU X, LI X, ZHANG S. Blockrow sparse multiview multilabel learning for image classification[J]. IEEE Transactions on Cybernetics, 2016, 46(2):450.

[6]     WANG Z, CHEN T, LI G, et al. Multilabel image recognition by recurrently discovering attentional regions[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. Washington, DC: IEEE Computer Society, 2017:464-472.

[7]     OZONAT K M, YOUNG D E. Towards a universal marketplace over the Web: statistical multilabel classification of service provider forms with simulated annealing[C]// Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2009:1295-1304.

[8]     HOU S, ZHOU S, CHEN L, et al. Multilabel learning with label relevance in advertising video[J]. Neurocomputing, 2016, 171(C):932-948.

[9]     BOUTELL M R, LUO J, SHEN X, et al. Learning multilabel scene classification [J]. Pattern Recognition, 2004, 37(9):1757-1771.

[10]    ZHANG M, ZHOU Z. MLKNN: a lazy learning approach to multilabel learning[J]. Pattern Recognition, 2007, 40(7):2038-2048.

[11]    LEE J, KIM H, KIM N R, et al. An approach for multilabel classification by directed acyclic graph with label correlation maximization[J]. Information Sciences, 2016, 351(C):101-114.

[12]    ELISSEEFF A E, WESTON J. A kernel method for multilabelled classification[C]// Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. Cambridge, MA: MIT Press, 2002: 681-687.

[13]    HUANG G, ZHU Q, SIEW C K. Extreme learning machine: theory and applications[J]. Neurocomputing, 2006, 70(1/2/3):489-501.

[14]    王一賓, 程玉胜, 何月,等. 回归核极限学习机的多标记学习算法[J]. 模式识别与人工智能, 2018, 31(5):419-430. (WANG Y B, CHENG Y S, HE Y, et al. Multilabel learning algorithm of regression kernel extreme learning machine[J]. Pattern Recognition and Artificial Intelligence, 2018, 31(5): 419-430.)

[15]    张敏灵. 一种新型多标记懒惰学习算法[J]. 计算机研究与发展, 2012, 49(11):2271-2282. (ZHANG M L. An improved multilabel lazy learning approach[J]. Journal of Computer Research and Development, 2012, 49(11):2271-2282.)

[16]    YANG X, HE X. Firefly algorithm: recent advances and applications[J]. International Journal of Swarm Intelligence, 2013, 1(1):36-50.

[17]    HIDALGOPANIAGUA A, MIDUEL A V, JOAQUIN F, et al. Solving the multiobjective path planning problem in mobile robotics with a fireflybased approach[J]. Soft Computing, 2017, 21(4):1-16.

[18]    LEI Y, ZHAO D, CAI H B. Prediction of lengthofday using extreme learning machine[J]. Geodesy and Geodynamics, 2015, 6(2):151-159.

[19]    WANG Z, XIN J, TIAN S, et al. Distributed and weighted extreme learning machine for imbalanced big data learning[J]. Tsinghua Science and Technology, 2017, 22(2):160-173.

[20]    LUO F F, GUO W Z, YU Y L, et al. A multilabel classification algorithm based on kernel extreme learning machine[J]. Neurocomputing, 2017, 260: 313-320.

[21]    楊明极, 马池, 王娅, 等. 一种改进Kmeans 聚类的FCMM 算法[J/OL]. 计算机应用研究, 2019, 36(7)[2018-04-12]. http://www.arocmag.com/article/02201907006.html.(YANG M J, MA C, WANG Y, et al. Algorithm named FCMM to improve Kmeans clustering algorithm[J/OL].Application Research of Computers, 2019, 36(7)[2018-04-12]. http://www.arocmag.com/article/02201907006.html.)

[22]    WANG H, WANG W, ZHOU X, et al. Firefly algorithm with neighborhood attraction[J]. Information Sciences, 2017, 382/383:374-387.

[23]    程美英, 倪志伟, 朱旭辉. 萤火虫优化算法理论研究综述[J]. 计算机科学, 2015, 42(4):19-24.(CHENG M Y, NI Z W, ZHU X H. Overview on glowworm swarm optimization or firefly algorithm[J]. Computer Science, 2015, 42(4):19-24.)

[24]    ZHANG M L, ZHOU Z H. A Review on multilabel learning algorithms[J]. IEEE Transactions on Knowledge & Data Engineering, 2014, 26(8):1819-1837.

[25]    DEMSAR J. Statistical comparisons of classifiers over multiple data sets[J]. Journal of Machine Learning Research, 2006, 7(1):1-30.

[26]    ZHANG M, WU L. Lift: Multilabel learning with labelspecific features[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(1): 107-120.

[27]    LIN Y, LI Y, WANG C, et al. Attribute reduction for multilabel learning with fuzzy rough set[J]. KnowledgeBased Systems, 2018,152:51-56.