APP下载

A Novel Kernel for Least Squares Support Vector Machine

2012-07-25FENGWei冯伟ZHAOYongping赵永平DUZhonghua杜忠华LIDecai李德才WANGLifeng王立峰

Defence Technology 2012年4期
关键词:永平

FENG Wei(冯伟),ZHAO Yong-ping(赵永平),DU Zhong-hua(杜忠华),LI De-cai(李德才),WANG Li-feng(王立峰)

(1.School of Mechanical Engineering,Nanjing University of Science and Technology,Nanjing 210094,Jiangsu,China;2.Xi'an Modern Chemistry Research Institute,Xi'an 710065,Shaanxi,China;3.Heilongjiang North Tool Co.Ltd.,Mudanjiang 157013,Heilongjiang,China)

Introduction

Recently,as an emergent machine learning method,ELM proposed by HUANG,et al in 2004[1-2]has shown the good performances in regression application as well as in the large classification problems.Unlike the traditional slow gradient-descent based methods of which the parameters are all tuned iteratively,ELM with arbitrarily assigned input weights and almost any nonzero activation function boils down to solve the minimum norm least squares solutions of a general system of linear equations set.Hence,ELM has an extreme fast convergence speed.In this case,it is supported by the experiments on the real-world benchmark function approximations and classification problems.Additionally,it has been proved that ELM can universally approximate any continuous functions on any compact input sets and achieve the comparable performance with SVM in regression and classification[3].

Similar to ELM,least squares support vector machine(LSSVM)[4-5],a variant of SVM,needs to solve a linear equation set,after replacing the hinge loss function in SVM with the squared errors,as well as the equality constraints instead of the inequality ones. However, according to the theory no free lunch[6],an algorithm does not always has an advantage over another in every respect.Extensive empirical comparison[7]conveyed that LSSVM can obtain good performance on various classification problems and function approximations,but there exist both obvious limitations.First,the final training problem of LSSVM comes down to solve a set of linear equations.Although this linear equation set is,in principle,solvable,in practice,it is intractable for a large data set by the classical techniques like the Gaussian elimination,because their arithmetic complexity usually scales cubically with the size of training samples,viz.O(N3)operations,whereNis the size of training set.Second,the solution of LSSVM lacks the sparseness[8],thus significantly affecting its test speed.

As ELM learning algorithm emerges,the trend of combining ELM with SVM appears gradually.LIU,et al[9]explicitly constructed a mapping based on ELM so that the data transformation from the input space to the feature space is realized,which leads to an extreme simple and fast nonlinear SVM algorithm.Frénay,et al[10]merged both ELM and SVM by defining a new kernel,viz.ELM based kernel.This kernel is computed by the first layer of an ELM and used to train SVM.HUANG,et al[3]further studied ELM for classification in the aspect of the standard optimization method and extended ELM to a specific type of generalized SLFNs,named support vector network.Recently,a new parameter-insensitive kernel[11]was proposed and used for nonlinear support vector regression(SVR),and the experiments showed that this approach reduces the computational complexity significantly and yields the performance that is very close to that of the state-ofthe-art SVR.In Ref.[10]and [11],the ELM based kernel(the ELM kernel for short)is explicitly constructed using the first layer of an ELM and extended to LSSVR,i.e.,ELM kernel based LSSVR proposed in this paper,named ELM-LSSVR.To validate the effectiveness and feasibility of the proposed ELM-LSSVR,the experiments on four data sets are performed.From these reports,it is supported that ELM-LSSVR can simultaneously reduce the training computational and the test complexity,meantime keeping the comparable accuracy to the existent learning algorithms with the commonly used kernels such as the Gaussian kernel.

The least squares support vector regression is briefly introduced in section 2.the brief theory of extreme learning machine is given in section 3.The ELM kernel is applied to the least squares support vector regression,and the ELM-LSSVR algorithm is proposed in section 4.The experiments implemented in section 5 show that the proposed ELM-LSSVR has the advantages over other algorithms in the training and test time.

1 Brief Description of Least Squares Support Vector Regression

wherewcan control the model complexity,bis the bias,e=[e1,…,eN]Trepresents the training errors,Cis the regularization parameter adjusting the tradeoff between the model flatness and the fitness,φ(·)is called the feature map realizing the transformation from the finite-dimensional input space to the high-dimensional feature space.Eq.(1)can be solved by constructing the Lagrangian function

whereα=[α1,…,αN]Tis the Lagrangian multiplier vector.From the dual theorem,Eq.(2)can be solved by

whereR=K+I/C,Iis an identity matrix of appropriate dimension;d=[d1,…,dN]T,Kij=k(xi,xj)=φ(xi)Tφ(xj);1 is a column vector of all ones of appropriate dimension;k(·,·)is the kernel function im-plicitly computing the inner product of two vectors in the feature space.In theory,Eq.(3)is solvable.However,it is intractable for a large data set by the classical techniques, e.g., Gaussian elimination,scaling cubically with the size of training samples.If Eq.(3)can be solved successfully,for a new input samplex,its target can be predicted by the following regression machine

whereαandbare from Eq.(3).Besides the difficulty of solving Eq.(3),there is another bottleneck which blocks LSSVR to obtain the widespread application.That is to say,the solution of LSSVR is not sparse,i.e.,every training sample is support vector,which hampers its use in those practical applications demanding fast responses.The main reason is that the required cost of LSSVR in the test phase consists ofO(N·#teNum)time as well asO(N ·#teNum)space.

2 Brief Description of Extreme Learning Machine

In this section,the essence of ELM is briefly depicted.The ELM algorithm was originally proposed by HUANG,et al[1,12-13]. The main concept behind ELM lies in the random initialization of the SLFN weights and biases not only independent of the training samples but also of each other,and the input weights and biases do not need to be adjusted during the process of calculating the output weights.The ELM network is obtained with very few steps and very low computational cost.

Assume that a training set consisting ofNdistinct samples(xi,yi)(1≤i≤N)is given,wherexi∈1×nis the input vector,andyi∈1×mis the desired output vector.The output of an SLFN withLhidden nodes can be represented by

whereaiandbiare the randomly generated parameters of hidden nodes,βi,a row vector,is the learning weight connecting thei-th hidden node to the output nodes,g(ai,bi,x)is the output of thei-th hidden node with respect to the inputx.The commonly used hidden nodes include additive hidden node(e.g.,sigmoid),radial basis function node,etc.In the case of which SLFN perfectly approximates the data,the error between the estimated output^yiand the desired ouputyiis zero

Rewrite Eq.(6)as

where

Generally,L<N.That is,Eq.(7)is an overdetermined linear equation set.It can be easily seen from Eq.(7)that the training of an SLFN in ELM is simply equivalent to finding a least squares solution of the linear system,which can be analytically determined by the following expression,

whereH†is a Moore-Penrose generalized inverse of the matrixH.

3 ELM Kernel Based LSSVR

From Eq.(5),we can regard the hidden layer in the ELM algorithm as a mappingφfrom the input spacento a feature spaceL,where we need to solve a linear problem instead of the nonlinear one in the input space.This is very similar to the idea behind the use of kernel.In the kernel based algorithms,the data in the input space is easy to be classified correctly for classification problems or predicted precisely in regression domain in the feature space spanned by a nonlinear mapping induced by the kernel function.Hence,the hidden layer of an ELM can be thought of as defi-ning some kind of randomized kernel,which is temporarily called the ELM kernel.Considering two data pointsxiandxjand an ELM withLhidden nodes defining a mappingφfromntoL,the corresponding ELM kernel function is defined as

whereφ(xi)= [g(a1,b1,xi),…,g(aL,bL,xi)]T.From Eq.(10),based on the training samples set,the kernel matrix,viz.the Gram matrix,corresponding to this ELM kernel is given as

Firstly,Eq.(3)is rearranged as

and

Defining

and

hence Eq.(12)and(13)are reformulated,respectively,

and

The crucial key to solve Eq.(12)to(15)is to calculate the inverse of the matrixR.If this problem is solved,those equations are out of question.Here,we rewrite the matrixR,

After applying the ELM kernel to LSSVR and plugging the kernel matrixKin Eq.(11)into Eq.(18),it becomes

To reduce the computational burden of Eq.(19),the inverse of the matrixRis given via the Woodbury formula[14]

Substituting Eq.(20)into Eq.(14)and(15),the computational cost of calculatingηandνis cut down fromO(N3)toO(N·#FS2).In practice,the inverse ofI/C+HTHdoes not need to be given explicitly.The Cholesky factorization technique can be used to calculate(HTd)/(I/C+HTH)and(HT1)/(I/C+HTH)because the matrixI/C+HTHis positively definite.Especially in the MATLAB environment,the Cholesky factorization is realized with a simple backslash operator,viz.,instead of the direct use of the inv command,thus saving the training time.In sum,after applying the ELM kernel to LSSVR,the training cost is cut down fromO(N3)toO(N·#FS2).In the testing phase,for the Gaussian kernel,normal LSSVR needs the computational burden ofO(N·#teNum),nevertheless the ELM kernel based LSSVR only needs a cost of max{O(N·#FS),O(#teNum·#FS)}.In general,#FS is far less thanNand#teNum,so the computational complexity in the testing phase is reduced.As for the memory space,we only needO(N·#FS)instead ofO(N2)in the training phase.Likewise,the testing space is reduced fromO(N·#teNum)to max{O(N·#FS),O(#teNum·#FS)}.From the discussion above,it is known that the ELM kernel based LSSVR reduces not only the training burden but also the testing cost without extra training tools or additional pruning mechanisms apart from the use of the Woodbury formula.This case is also suitable for memory space,which is the essence of exploring and exploiting the ELM kernel.

4 Experiments

Four data sets drawn from the classical mathematical functions defined in Tab.1 are used for experiment.The dependent relations between input and output data for these functions,viz.their curves,are shown in Fig.1.For each function,10 000 data points generated uniformly with the Gaussian noise(the mean of the noise is zero,and the ratio of the standard deviation of the noise to the mean of signal is 0.1)are used as training data,and 1 000 noiseless data points drawn from the same function are utilized as testing data.As for detailed specifications about these experiments,they are listed in Tab.2.From Tab.2,it is very easily known that ELM-LSSVR has the advantages over other algorithms in both the training time and the testing time with the comparable accuracy.Naturally,the total time of ELM-LSSVR,viz.the training time(trTime)plus the testing time(teTime),is favorable among all the algorithms. MVP-SMO-LSSVR, FG-SMO-LSSVR,CG-LSSVR,and ICG-LSSVR were proposed to alleviate the training burden of LSSVR for medium or large problems.However,their solutions are not sparse,i.e., every training sample is a support vector.Hence,in Tab.2,the number of support vectors(#SV)is equal toNfor the four algorithms.According to CHU,et al's viewpoint[16],ICG-LSSVR accelerates CG-LSSVR,whereas the training burden of ICG-LSSVR is still larger than ELM-LSSVR's,which was validated by our experiments.Though MVP-SMO-LSSVR and FG-SMO-LSSVR are SMO-type algorithms,for most experiments, FG-SMO-LSSVR is superiorto MVP-SMO-LSSVR in terms of the training time.Howbeit,their training costs are more expensive than that needed by ELM-LSSVR.As a pruning method,JIAO,et al[19]proposed FSA-LSSVR to realize the sparse solution.It can be seen from Tab.2 that FSA-LSSVR indeed prunes the solution of LSSVR and shortens the predicted training time,but its testing time is still longer than that of ELM-LSSVR.All in all,compared with other algorithms,ELM-LSSVR needs less computational burdens in both the training and testing phases,which is the main purpose of ELM-LSSVR proposed in the paper.

Fig.1 Curves of the four classical functions

Tab.1 Four defined classical functions

5 Conclusions

In this paper,we apply the ELM kernel to LSSVR,so the ELM-LSSVR algorithm is proposed.ELMLSSVR can mitigate the training burden of normal LSSVR obviously.Meantime,ELM-LSSVR can reduce the testing time and enhance the machine response without extra techniques like pruning mechanism.In contrast to the algorithms accelerating the training of LSSVR,the memory space required by ELM-LSSVR is also saved.To validate the effectiveness of feasibility of the proposed ELM-LSSVR, the experiments show that ELM-LSSVR has the advantages over the other algorithms in terms of the testing and the training time.

[1]HUANG G B,ZHU Q Y,Siew C K.Extreme learning machine:a new learning scheme of feedforward neural networks[C]∥Proceedings of 2004 IEEE International Joint Conference on Neural Network,Budapest,Hungary,2004.

[2]HUANG G B,Slew C K.Extreme learning machine:RBF network case[C]∥8th International Conference on Control,Automation,Robotics and Vision(ICARCV),Kunming,China,2004.

[3]HUANG G B,DING X,ZHOU H.Optimization method based extreme learning machine for classification[J].Neurocomputing,2010,74(1-3):155-163.

[4]Suykens J A K,Vandewalle J.Least squares support vector machine classifiers[J].Neural Processing Letters,1999,9(3):293-300.

[5]Suykens J A K,Van Gestel T,De Brabanter J,et al.Least squares support vector machines[M].Singapore:World Scientific,2002.

[6]Duda R O,Hart P E,Stork D G.Pattern classification[M].UK:Johh Wiley& Sons Inc,2001.

[7]Van Gestel T,Suykens J A K,Baesens B,et al.Benchmarking least squares support vector machine classifiers[J].Machine Learning,2004,54(1):5-32.

[8]Suykens J A K,De Brabanter J,Lukas L,et al.Weighted least squares support vector machines:robustness and sparse approximation[J].Neurocomputing,2002,48(1-4):85-105.

[9]LIU Q,HE Q,SHI Z.Extreme support vector machine classifier[C]∥12th Pacific-Asia Conference on Knowledge Discovery and Data Mining,Osaka,Japan,2008.

[10]Frenay B,Verleysen M.Using SVMs with randomized feature spaces:an extreme learning approach[C]∥Proceedings of the 18-th European Symposium on Artificial Neural Networks,2010.

[11]Frenay B,Verleysen M.Parameter-insensitive kernel in extreme learning for non-linear support vector regression[J].Neurocomputing,2011,74(16):2526-2531.

[12]HUANG G B,ZHU Q,Siew C.Extreme learning machine:theory and applications[J].Neurocomputing,2006,70(1-3):489-501.

[13]HUANG G B,CHEN L,Siew C K.Universal approximation using incremental constructive feedforward networks with random hidden nodes[J].IEEE Transactions on Neural Networks,2006,17(4):879 -892.

[14]ZHANG X.Matrix analysis and applications[M].New York:Springer,2004.

[15]Suykens J A K,Lukas L,Vandooren P,et al.Least squares support vector machine classifiers:a large scale algorithm[C]∥Proceedings of European Conference on Circuit Theory and Design,Torino,Italy,1999.

[16]Chu W,Ong C J,Keerthi S S.An improved conjugate gradient scheme to the solution of least squares SVM[J].IEEE Transactions on Neural Networks,2005,16(2):498-501.

[17]Keerthi S S,Shevade S K.SMO algorithm for leastsquares SVM formulations[J].Neural Computation,2003,15(2):487-507.

[18]BO L,JIAO L,WANG L.Working set selection using functional gain for LS-SVM[J].IEEE Transactions on Neural Networks,2007,18(5):1541-1544.

[19]JIAO L,BO L,WANG L.Fast sparse approximation for least squares support vector machine[J].IEEE Transactions on Neural Networks,2007,18(3):685 -697.

[20]ZHAO Y P,DU Z H,ZHANG Z A,et al.A fast method of feature extraction for kernel MSE[J].Neurocomputing,2011,74(10):1654-63.

[21]AN S,LIU W,Venkatesh S.Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression[J].Pattern Recognition,2007,40(8):2154-2162.

[22]HUANG G B,CHEN L.Convex incremental extreme learning machine[J].Neurocomputing,2007,70(16 -18):3056-3062.

[23]HUANG G B,CHEN L.Enhanced random search based incremental extreme learning machine[J].Neurocomputing,2008,71(16-18):3460-3468.

[24]ZHAO Y P,SUN J G,DU Z H,et al.Pruning least objective contribution in KMSE[J].Neurocomputing,2011,74(17):3009-18.

猜你喜欢

永平
教师节
给爸爸捶背
例谈元素及化合物知识复习策略
踢球
流苏树与美国流苏树园林绿化前景探讨
认识形近字
五绝·晚秋晚风
小刺猬的秘密
EBS在牵引车上的应用及其关键性能分析
段永平:从企业家到幕后教父