熵加权聚类挖掘算法在学科竞赛学员选拔中的应用

2019-10-14金媛媛李丹杨明

现代电子技术 2019年19期

金媛媛李丹杨明

摘要：针对现有学科竞赛学员选拔中对评估数据缺少有效利用的问题，提出一种基于熵加权聚类的挖掘算法，对学科数据集合进行聚类，从而实现科学合理的人才挑选机制。采用人工统计对数据进行采集和归一化预处理，并利用稀疏分数进行数据特征选择，实现非必要聚类特征的过滤。通过熵加权聚类算法挖掘具有最优解的竞赛成员分配方案。实例分析结果表明，相比标准的Apriori算法，熵加权聚类算法运行效率更高，验证了提出方法的合理性和有效性。

关键词：聚类分析; 人才评估; 熵加权; 数据挖掘; 归一化预处理; 数据特征选择

中图分类号： TN911.1?34; TP309 文献标识码： A 文章编号： 1004?373X（2019）19?0112?03

Abstract： In order to solve the problem of the lack of effective use of the evaluation data in the selection of existing academic contestants， a mining algorithm based on entropy?weighted clustering is proposed to cluster the subject data sets to achieve a scientific and rational mechanism of talent selection. The data is collected and normalized by manual statistic approach， and the sparse scores are used to select the data features for filtering of the non?essential clustering features. The entropy weighted clustering algorithm is used to mine the competition member allocation scheme with the optimal solution. The example analysis results show that the entropy?weighted clustering algorithm is more efficient than the standard Apriori algorithm， which verifies the rationality and effectiveness of the proposed method.

Keywords： cluster analysis; talent assessment; entropy weighting; data mining; normalization preprocessing; data feature selection

0 引言

数据挖掘作为一种新兴的计算机科学技术，已经逐渐应用到社会的各个行业之中，能够在海量数据中寻找到有价值或关联的科学技术，通常包含三大方面内容：知识发现过程、数据挖掘分类和数据挖掘应用。聚类分析是目前应用较为广泛的数据挖掘方法，可以视为一个划分数据对象集的过程。文献[1]提出适用于轨迹模式和路径挖掘的聚类方法。文献[2]提出基于事件日志的关联、预测和聚类动态行为的事件过程挖掘框架。文献[3]提出基于熵加权聚类算法的自组网优化方案。

随着我国教学质量的不断提升，各种科学竞赛已经成为各大高校展现自身教学实力的平台，对提高学生专业素质和培养学习兴趣有较大的促进作用。由于竞赛准备时间短且学科竞赛学员选拔困难，如何分配相应学科较优的学生参赛就具有比较现实的研究意义，但是，现阶段相关领域的研究十分稀少，且仅限于Apriori关联规则挖掘，例如文献[4]。因此，本文提出一种基于熵加权聚类的挖掘算法，对学科数据集合进行聚类，实现科学合理的人才挑选机制，从而解决多目标的竞赛成员分配求解问题。实验结果表明，熵加权聚类算法同Apriori算法一样均能够有效应用于学科竞赛学员选拔中，但是熵加权聚类算法具有更短的求解时间。

1 数据的预处理

1.1 统计数据的归一化

首先采用人工统计对学科竞赛选拔中涉及的输入样本数据进行采集，并归一化预处理[5?6]，具体方式如下：

可以看出，随着学生数量的提升，两种算法所需的运行时间均不断增加，但是在相同数量条件下，熵加权聚类挖掘算法比标准的Apriori关联规则挖掘算法运行时间更短，即挖掘效率更高。

4 结语

本文提出一种基于熵加权聚类的挖掘算法，对学科数据集合进行聚类，从而实现科学合理的人才挑选机制，解决多目标的竞赛成员分配求解问题。采用稀疏分数表示法，降低数据维度。并通过学习成绩、兴趣指数和潜力指数3个评估指标进行聚类矩阵计算，实例验证了提出算法的有效性和高效性。但是挖掘过程中当支持度设置逐渐增大时，算法的运行效率下降较为严重，后续将对此进行重点研究。

参考文献

[1] HUNG C C， PENG W C， LEE W C. Clustering and aggrega?ting clues of trajectories for mining trajectory patterns and routes [J]. The VLDB journal， 2015， 24（2）： 169?192.

[2] LEONI M D， AALST W M P V D， DEES M. A general process mining framework for correlating， predicting and cluste?ring dynamic behavior based on event logs [J]. Information systems， 2016， 56（3）： 235?257.

[3] FATHIAN M， JAFARIAN?MOGHADDAM A R. New cluste?ring algorithms for vehicular Ad?hoc network in a highway communication environment [J]. Wireless networks， 2015， 21（8）： 2765?2780.

[4] 李毓兰.改进Apriori算法及其在信息学奥赛学员选拔中的应用[D].泉州：华侨大学，2015.

LI Yulan. Improved Apriori algorithm and its application in the selection of informatics students [D]. Quanzhou： Huaqiao University， 2015.

[5] CASTRO P M. Normalized multiparametric disaggregation： an efficient relaxation for mixed?integer bilinear problems [J]. Journal of global optimization， 2016， 64（4）： 765?784.

[6] GLEASON S， RUF C S， CLARIZIA M P， et al. Calibration and unwrapping of the normalized scattering cross section for the cyclone global navigation satellite system [J]. IEEE transactions on geoscience & remote sensing， 2016， 54（5）： 2495?2509.

[7] BORNMANN L， HAUNSCHILD R. Normalization of mendeley reader impact on the reader?and paper?side： a comparison of the mean discipline normalized reader score （MDNRS） with the mean normalized reader score （MNRS） and bare reader counts [J]. Journal of informetrics， 2016， 10（3）： 776?788.

[8] ZHANG C， ZHOU S. Renormalized and entropy solutions for nonlinear parabolic equations with variable exponents and L1 data [J]. Journal of differential equations， 2017， 248： 1376?1400.

[9] BORNMANN L， THOR A， MARX W， et al. The application of bibliometrics to research evaluation in the humanities and social sciences： an exploratory study using normalized Google Scholar data for the publications of a research institute [J]. Journal of the association for information science & technology， 2016， 67（11）： 2778?2789.

[10] 魏霖静，宁璐璐，郭斌，等.大数据中基于熵加权的稀疏分数特征选择聚类算法[J].计算机应用研究，2018，35（8）：2293?2294.

WEI Linjing， NING Lulu， GUO Bin， et al. Sparse?segment feature selection clustering algorithm based on entropy weigh?ting in big data [J]. Application research of computers， 2018， 35（8）： 2293?2294.

[11] YANG M S， NATALIANI Y. A feature?reduction fuzzy cluste?ring algorithm based on feature?weighted entropy [J]. IEEE transactions on fuzzy systems， 2018， 26（2）： 817?835.

[12] KAWAMURA T， SEKINE M， MATSUMURA K. Detecting hypernym/hyponym in science and technology thesaurus using entropy?based clustering of word vectors [J]. International journal of semantic computing， 2017， 11（4）： 17?24.

[13] 李敏，李彩霞，魏霖静.基于熵加权的四叉树分解单帧图像去雾[J].计算机工程与设计，2017，38（6）：1575?1579.

LI Min， LI Caixia， WEI Linjing. Four?tree decomposition of single frame image defogging based on entropy weighting [J]. Computer engineering and design， 2017， 38（6）： 1575?1579.

[14] HAFEZALKOTOB Arian， ASHKAN Hafezalkotob. Extended MULTIMOORA method based on Shannon entropy weight for materials selection [J]. Journal of Industrial engineering international， 2016， 12（1）： 1?13.