面向不平衡数据的逻辑回归偏标记学习算法

2017-04-07周瑜，顾宏

大连理工大学学报 2017年2期

关键词：集上逻辑精度

周瑜，顾宏

( 大连理工大学电子信息与电气工程学部, 辽宁大连 116024 )

面向不平衡数据的逻辑回归偏标记学习算法

周瑜，顾宏*

( 大连理工大学电子信息与电气工程学部, 辽宁大连 116024 )

偏标记学习是近几年提出的新机器学习框架，已有的逻辑回归偏标记算法尚不能解决数据不平衡问题．建立了一种可以解决数据不平衡的逻辑回归模型偏标记学习算法．基本思想是在多元逻辑回归模型中定义新的似然函数以达到处理不平衡数据的目的．算法先根据训练集中各个类别样本所占比例定义了一个新的似然函数，之后通过逼近和求导等数学手段推导得到了能够求解的光滑的逻辑回归偏标记学习模型．在UCI数据集和真实数据集上的仿真实验表明，所提算法在数据存在不平衡问题时提高了样本的平均分类精度．

偏标记学习；数据不平衡；逻辑回归；阻尼牛顿法

0 引言

偏标记学习是近几年提出的一种新的机器学习框架，国内外学者对它的研究已经有了一定的成果．最早的文献是Grandvalet对逻辑回归模型进行的拓展研究[1]，其提出了一种偏标记学习算法；随后Jin等[2]将偏标记学习归结为一种新的机器学习框架．新的学习框架的提出促进了众多学者对偏标记学习的研究，k近邻[3]、最大间隔[4]、线性支持向量机[5-6]等方法均被用于偏标记学习算法研究．这些方法都是通过定义新的损失函数来改进传统分类模型，使其可以处理偏标记学习问题．但在很多的实际应用问题中，各个类别的样本数量之间是极度不平衡的，如在蛋白质亚细胞定位预测问题中[7]，数据集中两类数目差别近百倍．数据集的这种类不平衡(也称数据不平衡)问题对学习算法性能具有很大的影响，通常会导致算法的分类面偏向少数类一侧，从而造成预测精度大幅下降，特别是对少数类样本的预测精度要远远低于多数类样本[8]．目前已有的偏标记学习算法都没有考虑数据的不平衡性．因此，考虑数据不平衡问题的偏标记学习算法也是将偏标记学习技术推向更加实用化所需要解决的关键问题．本文建立一种逻辑回归偏标记学习算法，以期提高不平衡数据的平均分类精度．

1 逻辑回归偏标记学习模型

1.1 模型建立

偏标记学习的定义如下：

设X为样本的特征空间，Y={1，2，…，l}为类别标记集合．利用训练集D={(x1，Y1)，(x2，Y2)，…，(xn，Yn)}(其中xi∈X是样本的特征向量；Yi≡{yi1，yi2，…，yini}⊂Y，是含样本xi真实标记的一个集合)确定一个函数f：X→Y，使得f可以正确输出新(待预测)样本x*∈X的类别标记．

(1)

(2)

由于max(·)函数不可导，用凝聚函数逼近最大值似然函数．当p→+∞时，有

则

(3)

当p→∞时，

当s≠t，s∈Yi时，

当s=t，s，t∈Yi时，

则当p→∞时，

则Z(W)对W一阶和二阶导数可写成矩阵形式：

(4)

1.2 模型求解

本文应用阻尼牛顿法对模型进行求解，阻尼牛顿法的迭代公式如下：

Wk+1=Wk-λk(▽▽Z(Wk))-1▽Z(Wk)

图1 阻尼牛顿法求解W

2 数值实验

表1 算法验证所用的数据集

表2 两个算法在UCI数据集上的预测精度

表3 两个算法在UCI数据集上的平均预测精度

表4 两个算法在真实数据集上的预测精度

3 结语

本文提出了可以处理数据不平衡问题的逻辑回归偏标记学习算法，在数据集上的实验结果验证了本文算法的有效性以及在处理不平衡问题方面的优势．下一步的工作是定义新的似然函数，应用更好的适合偏标记学习的机器学习算法，使其能够更好地处理数据不平衡偏标记学习问题．

[1] GRANDVALET Y. Logistic regression for partial labels [C] // Proceeding of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Annecy: IPMU, 2002:1935-1941.

[2] JIN R, GHAHRAMANI Z. Learning with multiple labels [C] // Advances in Neural Information Processing Systems 15-Proceedings of the 2002 Conference, NIPS 2002. Vancouver: Neural Information Processing Systems Foundation, 2003.

[3] HÜELLERMEIER E, BERINGER J. Learning from ambiguously labeled examples [J]. Intelligent Data Analysis, 2006, 10(5):419-439.

[4] LUO J, ORABONA F. Learning from candidate labeling sets [C] // Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, NIPS 2010. Red Hook: Curran Associates Inc., 2010:1504-1512.

[5] COUR T, SAPP B, TASKAR B. Learning from partial labels [J]. Journal of Machine Learning Research, 2011, 12:1501-1536.

[6] NGUYEN N, CARUANA R. Classification with partial labels [C] // KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery, 2008:551-559.

[7] HE J, GU H, LIU W. Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites [J]. PLoS One, 2012, 7(6):e37155.

[8] LIU X Y, ZHOU Z H. Imbalanced Learning:Foundations, Algorithms, and Applications [M]. Hoboken:Wiley-IEEE Press, 2013:61-82.

[9] HORN R, JOHNSON C. Topics in Matrix Analysis [M]. Cambridge:Cambridge University Press, 1991:239-297.

[10] BACHE K, LICHMAN M. UCI machine learning repository [EB/OL]. (2013-04-04) [2016-08-12]. http://archive.ics.uci.edu/ml.

[11] 周瑜,贺建军,顾宏,等. 一种基于最大值损失函数的快速偏标记学习算法[J]. 计算机研究与发展, 2016, 53(5):1053-1062.

ZHOU Yu, HE Jianjun, GU Hong,etal. A fast partial label learning algorithm based on max-loss function [J]. Journal of Computer Research and Development, 2016, 53(5):1053-1062. (in Chinese)

Partial label learning algorithm for imbalanced data based on logistic regression

ZHOU Yu， GU Hong*

( Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China )

Partial label learning is a new machine learning framework proposed in recent years, but existing partial label learning algorithms based on logistic regression have not solved the problem of data imbalance. A partial label learning algorithm for data imbalance is presented based on logistic regression model. The basic idea is to define a new likelihood function in the multiple logistic regression models to deal with imbalanced data. Firstly, a new likelihood function is defined according to the proportion of each class sample in the training set; then, the smooth and logistic regression-based partial label learning model is derived through derivation and approximation method. Simulation experiments on UCI data sets and real world data sets show that the proposed algorithm improves the average classification accuracy of sample for data imbalance problem.

partial label learning; data imbalance; logistic regression; damped Newton method

2016-09-05；

2016-11-07．

国家自然科学基金资助项目(61502074，U1560102).

周瑜(1982-)，女，博士生，E-mail:zhouyu829@163.com；顾宏*(1961-)，男，教授，博士生导师，E-mail:guhong@dlut.edu.cn.

1000-8608(2017)02-0184-05

TP391

10.7511/dllgxb201702011