APP下载

Clustering Analysis of Interval Data Based on Kernel Density Estimation

2021-01-08LIMengyaoXIALiyunLIUYeCHENJiaolong

LI Mengyao,XIA Liyun,LIU Ye,CHEN Jiaolong

(1.Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing,Hunan Normal University, Changsha 410081, China;2.College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China;3.Hunan Normal University Journals,Hunan Nornal University,Changsha 410081,China)

Abstract:As one of the vital tasks in mining interval data, clustering faces stupendous difficulties on mea-suring similarity or distance between objects. Existing traditional clustering methods have been extended to interval data via geometric distances which mainly consider the bounds of the interval data. These methods neglect information inside the interval data. Therefore, we take the probability distributions of interval value into consideration by using the whole interval data to estimate the probability density function of one cluster. In order to estimate the probability density function of one cluster, we propose a new kernel density estimation approach which is a nonparametric estimation for interval data. Then, we define a distance between interval objects via the probability density function by the new kernel density estimation. Finally, we construct an adaptive clustering method for interval data. Experimental results show that the proposed method is effective and also indicate that it is more reasonable to use probability distribution of interval value than to only consider the endpoints of intervals.

Key words:clustering;interval data;probability distribution;density estimation

0 Introduction

Clustering problem has been deemed as a significant issue in the data mining[1], machine lear-ning[2], data streams[3-4],and information retrie-val[5]. It tries to group data into clusters. Therefore, the objects in the same cluster have high similarity degree and the objects belonging to different clusters have a high diversity degree.

Through cluster analysis, we can discover more knowledge from data. Cluster analysis methods can be divided into hierarchical methods[6], partition methods[7], density-based clustering methods[8]and so on. This article focuses on the partition methods which divide input data into a fixed number of clusters. K-means[9], one representative partition method, divides the samples having high similarity or low distance into a cluster. Due to its excellent speed and good scalability, K-means clustering algorithm is considered as a renowned clustering method.

Uncertainty is an important issue in data ana-lysis[10-11]. Nowadays, in different application fields, the method of representing uncertain data as interval data is increasingly commonplace[12]. Interval data have three characteristics: randomness, vagueness and imprecision. For the above characteristics, more and more scholars pay attention to interval data.The information hidden in interval data is that each point has chance to be true value with different probabilities in real life. So we take kernel density estimation into analyzing pro-bability on interval data. Concerning clustering methods, Souza[13]proposed clustering approach based on City-block distance for interval data. Carvalho introduced a dynamic clustering algorithm for interval data[14]. Bock[15]discussed the probabilistic and reasoning framework of cluster analysis, rather than heuristic algorithm. Jin[16]proposed a method according to mixed interval slope technique and interval calculation. Based on Wasserstein distance, Verde[17]presented a clustering technique in interval data. Based on adaptive Mahalanobis distance, Hajjar[18]proposed a self-organizing mapping approach to realize interval data clustering with topology protection. Carvalho[19]proposed the adaptive Hausdorff distances. Mali defined the similarity and dissimilarity measures between interval dataviatheir position, span and content and then introduced this distance into clustering algorithm[20].

Though there have been some methods about measuring the similarity or distance between intervals, they mainly focus on the endpoints of an interval and ignore the probability distribution on the possible range. As a matter of fact, we can regard an interval as a variable in the interval value range. Each point has chance to be true value with diffe-rent probabilities in real life. We intend to estimate the possibility of each point in the whole interval for respective cluster. Using the probability distribution, we measure the similarity or distance between objects and each cluster. So we extend the kernel density estimation to interval data in order to estimate the probability distribution of a cluster and propose a new method to measure the distance between one object and a cluster. Moreover, we construct a clustering approach for interval data based on the proposed distance measure.

This paper is organized as follows. Section 1 reviews the basic clustering methods about interval-valued data. Then we recall the kernel density estimation on single value. In Section 2, we propose the kernel density estimation on interval data and put it into the adaptive clustering methods with defined distance. Section 3 performs the experiments to show the efficiency of the adaptive clustering algorithm. Section 4 concludes the whole paper.

1 Preliminary knowledge

We review the dynamic clustering of interval data in the first place, and then we introduce the basic knowledge of the kennel density estimation.

1.1 Dynamic clustering of interval data

LetΩbe a set of objects. Every objectxiis represented by a vector of feature valuesxi=(xi1,…,xip), wherexij=[aij,bij]. Dynamic clustering divides the wholeΩintoKclusters {C1,…,CK} and we obtain the result of partitionP=(C1,…,CK) by optimizing the given clustering criteria.

(1)

wheredk(xi,yk) is a criterion to measure the dissimilarity between an objectxi∈CKand the class prototypeykofCKin whichidenotes theith element andKrepresents theKth cluster. In a word, representation steps and allocation steps are the focus of dynamic clustering algorithms:

(a) Representation step

(b) Allocation step

In this step, we fix the vector of prototypesY. ByPwhich can be captured through minimizingW(P,Y,d), we can find the clusterCK={xi∈Ω|d(xi,xj)≤d(xi,yk), ∀k=1,…,K}

1.2 Kernel density estimation on single value

Kernel density estimation, a nonparametric method, can estimate probability density function for continuous variables. Gaussian kernel is widely used because of its continuity and differentiability. In this case, gaussian function is used to weight the data points. Kernel density estimation is quite different from parametric estimation. Kernel density estimation doesn’t need assuming a specific density model in advance. In this paper, we mainly consider the popular Gaussian kernels. For ∀xi∈Ω,xidenotes the object. Each Gaussian kernel function involving bandwidthhtakes a sample pointxi∈Pas its center. Using bandwidth, we can control the level of smoothness. The kernel density estimation is stated as follows:

(2)

Ind-dimensional (d≥2) case, Gaussian functions forhj(1≤j≤d) can be written as follows:

(3)

2 Clustering interval data based on the probability distribution

In this section, we will propose the kernel density estimation on interval value in order to provide a theory to find the density function which can represent a cluster instead of the prototypeykin the basic dynamic clustering methods. Then we are going to define the similarity between a cluster and an object. In the end, an adaptive clustering algorithm is constructed.

2.1 Kernel density estimation on interval value

Kernel density estimation is popular since it is a nonparametric way without assuming probability density model in advance. It uses the kernel density function to weight the data points. Enlightened by this idea, we propose an adaptive strategy to fit interval values.

A set of objectsΩis made up ofxi, which is represented by a vector of feature values,xi=[ai,bi]. In one-dimensional case, we can define the kernel density estimation on interval value as:

(4)

wherenmeans the number of the objects and y is a point from the interval [ai,bi]. We can estimate the density function for the given feature through this way.

Example1.To better understand the proposed kernel density estimation method for interval value, we will analyze the following examples in detail.

(5)

So according to Kernel density estimation, we can get the density function of thekth cluster for featurej.

(6)

wherenmeans the number of the objects which belong to thekth cluster,yis a point from the interval [aij,bij] andjis thejth attribute. We can estimate the density function for the given feature through this way.

2.2 Distance between a cluster and an interval

We can get the density function of a cluster for a feature to represent the cluster instead of the prototypeykin the basic dynamic clustering me-thods. In the next step, we will consider how to measure the distance between the cluster and the sample.Using the estimated probability density function, we know the probability that each point in each cluster will occur within the range of va-lues. For an interval, we don’t know what the true value is. But by integrating the probability density function of a cluster over the interval, we know the probability of the interval appearing in the cluster. The greater the probability of the interval appearing in the cluster means the greater the probability that the interval belongs to this cluster and the greater the similarity between the interval and this cluster. A higher similarity between the interval and the cluster is, the smaller distance between the interval and the cluster is.

A set of objectsΩis made up ofxi, which is represented by a vector of feature values,xi=(xi1,…,xip),wherexij=[aij,bij]. Dynamic clustering divides the wholeΩintoKclusters {C1,…,CK}. We define the distance between a cluster and an object as:

(7)

Through the above distance formula, we can get the dissimilarity between a cluster and an object. The adequacy criterion we used is defined as follows:

W(P)=max{dk(CK,xi),CK∈P},

(8)

wheredk(CK,xi) is an adaptive distance between an objectxi∈CKand clusterCK.

Fig.1 Estimated probability densitydistribution of Example 1

Fig.2 Schematic diagram of Example 2

We can find the fittingkby minimizing the distance between an objectxi∈CKand the clusterCKand assign it into the reasonable cluster.

2.3 An adaptive clustering algorithm

In this part, we construct a clustering algorithm shown in Algorithm 1 by estimating the density function and measuring the above distance formula between the cluster and an object. When initializing the clusters, we can choose a partition randomly. Then we compute the density function of one cluster and minimize the distance between the cluster and the objects. Finally we allocatexito the most possible cluster and repeat this process until the cluster does not change anymore. For showing the detail process, an example is listed as follows.

Example3. We choose the Iris dataset which are from the UCI Machine Learning Repository as an example to show the results of the improved algorithm. The Iris data set has 150 instances, 4 attributes and 3 classes. The three classes are Iris Virginica, Iris Versicolour and Iris Setosa. According to Algorithm 1, samples are randomly grouped into two categories. Then we get the density function of each cluster and calculate the distance from the sample to each cluster. After com-paring the distances, the sample is put into the cluster with the smallest distance. The same operation is then used to classify the second element until all elements are reclassified. Then recalculate the density function of each cluster and repeat the above operation until the allocation stays unchanged. The subfigures (a), (b) and (c) of Figure 3 are the probability density fun-ction images of the first, second and third clusters under the first attribute in the last iteration, respectively. According to subfigure (a), we can find the values of instances belonging to Iris Virginica are more likely to appear between 6.0 and 7.0. Subfigure (b) indicates that values of instances belonging to Iris Versicolour are more likely to occur between 5.5 and 6.0. Values of instances belonging to Iris Setosa are more likely to occur between 4.5 and 5.5 according to the subfigure (c). In Figure 4, we use three colors to represent the result of clustering based on Algorithm 1. We use the value of the second attribute as thexaxis and the value of the fourth attribute asyaxis. Therefore, we can describe the objects as rectangles in Figure 4.

3 Experimental results

In order to evaluate the performance of the proposed method comprehensively and fairly, we take ten interval data sets into account in our comparison experiments. In this section, we will first introduce the data sets used in the experiments and the evaluation of the clustering results. Then we will discuss about the value ofh. At the last part of this section, we will show the efficiency of our adaptive algorithm through comparative experiments.

3.1 Introduction of experimental data sets and evaluation criteria of experimental results

The data sets we used are Wine data set

Fig.3 Density functions for each cluster

Input: D∥the interval database K∥the number of desired clustersOutput: clusters of D1:/*Phase1-Initialization */2: for each i ∈ D do3:assign xi∈ D into C1,…,CK randomly;4: end for5:/*Phase2-Representation */6: for j=1 to m do7: for i=1 to K do8: compute density functions^f(x)jk of Ci for each feature j9: end for10: end for11:/*Phase3-Allocation */12: test← 013: for i=1 to n do14:do define the winning cluster Ck* such that k*=argmink=1,…,K∑mj=1dk(Ck,xi)15:if i ∈ Ck and k* ≠k then16:test← 117:Ck*← Ck*∪X i18:Ck← Ck-Xi19:end if20: end for21:/*Phase4-stoppingcriterion */22: if test=0 then23:stop24: else25: go to/*Phase2-Representation */.26: end if

(Wine), the User 142 Knowledge Modeling data set (Knowledge Modeling), the Seeds data set (Seeds), Electrical Grid Stability Simulated Data Set (Electrical Stability), Image Segmentation Data Set (Image Segmentation), Facebook Live Sellers in Thailand Data Set (Live Sellers), Glass Identification Data Set (Glass), Website Phishing Data Set (Website Phishing), Somerville Happiness Survey Data Set (Happiness Survey), Ionosphere Data Set (Ionosphere), which are obtained from the UCI Machine Learning Repository. The details of these 10 data sets are shown in Table 1. They are all real data sets with numerical attributes. In order to perform the experiments, we use a preprocessing step to turn single-valued data into interval-valued data. For each objectxi, we choose the random numbers which range from 0 to 1 asr1andr2, and we turn the singe value to an interval value according toxi=[xi-r1,xi+r2]. At the beginning of the experiments, we randomly divide all the elements into K clusters. K is the known information according to the real label.

Fig.4 A scatter diagram of the Iris basedon the two characteristics

Table 1 Experiment data sets

Since we have the real labels, the evaluation of the clustering results is based on the accuracy (AC) and Normalized Mutual Information(NMI). The AC evaluates the consistency between a priori partition and a partition accomplished by the clustering algorithm. AC is computed as follows:

(9)

In this case,nis the total number of the samples andδ(xi) is a sign function.δ(xi) gets 1 whenxiis divided correctly. Otherwise, it gets 0. Apparently, AC ranges from 0 to 1. Besides, a bigger AC means the higher accuracy.

The Normalized Mutual Information is defined as follows:

(10)

Note thatC,C’ are two partitions andMI(C,C’) is the Mutual Information, which can measure the relevance between two sets.H(C) andH(C’) are the information entropies ofCandC’, respectively. The information entropy is defined as follows:

H(C)=-∑p(x)log2(p(x)).

(11)

The Mutual Information is defined as follows:

(12)

NMI measures all the average mutual information between clusters and real labels. Note that NMI(C,C’) ranges from 0 to 1. If the two partitions are consistent, then the division is completely correct and NMI=1. On the contrary, NMI=0.

3.2 The discussion about h

In this part, we will discuss about the influ-ence of parameterhin the improved algorithm of clustering using kernel density estimation, and give the suggested value ofh.

Thehin Formula 6 can affect the estimation effect of kernel density function of each cluster and further affect the clustering results. According to the requirement of kernel density estimation, the value ofhshould be small, tending to 0. Therefore, we will discuss the results ofhin the range of [0.1, 0.5] and select the suggested value. In Table 2 and Table 3, we show the clustering results whenhis 0.1, 0.2, 0.3, 0.4 and 0.5 respectively.

It can be seen from Table 2 that the ACs of most data sets are less affected by the change ofh, except for the fourth and fifth data sets, where the ACs are better whenh=0.1 andh=0.2. From Table 3, we can see that the NMIs of most data sets are also less affected by the change ofh, except for the fourth and fifth data sets, where the NMIs are better whenh=0.1 andh=0.2. Considering the overall results, the result ofh=0.1 is more stable. We suggest thathis 0.1. And in the part of comparative experiment, we will takehas 0.1.

Table 2 ACs under different values of h

Table 3 NMIs under different values of h

3.3 Comparative experiments

We show the effect of the improved method by comparing it with other four methods on ten data sets. The details of other methods are shown as follows. The first two methods for comparison are partition methods based on City-block distance and Chebyshev distance which are the most influent distances between intervals, respectively. The other two methods are taken from reference[21], in which the Euclidean Hausdorff (EH) and Span Normalized Euclidean Hausdorff (SNEH) are used.

The formula of City-block distance is as follows[22]:

D(xi,xj)=|ai-aj|+|bi-bj|,

(13)

wherexiis [ai,bi] andxjis [aj,bj].

The Chebyshev distance is defined as follows[23]:

D(xi,xj)=max(|ai-aj|,|bi-bj|),

(14)

wherexiis [ai,bi] andxjis [aj,bj].

Because of the influence of the initial classification, the first three methods are run 10 times to find the average ACs and NMIs.Table 4 and Table 5 give the average ACs and NMIs of the proposed clustering algorithm, clustering algorithm with City-block distance, clustering algorithm with Chebyshev distance, hierarchical clustering methods based on the EH distance and SNEH distance.

The hardware condition and software environment are listed as follows:

· The hardware environment: Intel(R) CPU N3450 @ 1.10 GHz 4.00 GB Memory.

· The software environment: Matlab 9.4.

The first data set is Wine. From Table 2 and Table 3, we can see apparently that the adaptive clustering algorithm is better than other clustering algorithms. From the view of AC, we know the adaptive clustering is better than clustering using City-block about 15 percentage points, and better than clustering using Chebyshev distance over 15 percentage points. Meanwhile, the adaptive clustering is better than hierarchical clustering based on the Euclidean Hausdorff distance over 46 percent-age points and better than hierarchical clustering based on Span Normalized Euclidean Hausdorff distance over 46 percentage points. In the view of NMI, the adaptive clustering is the best, with over 25 percentage points better than clusteringviaCity-block distance and Chebyshev distance, about 60 percentage points better than hierarchical clustering using the EH distance and over 60 percentage points better than hierarchical clusteringviathe SNEH distance.The adaptive clustering runs efficiently on the Wine data set.

Table 4 Results of AC

Table 5 Results of NMI

The second data set is the Seeds Data Set. From Table 2 and Table 3, we can easily find that the adaptive clustering algorithm is much better than clustering algorithms about City-block distance and Chebyshev distance and the hierarchical clusteringviathe EH and the SNEH. From the evaluation criterion of AC, we can get that the new method is better than clustering algorithm concer-ning city-block distance about 3 percentage points, over 7 percentage points than clustering algorithm concerning Chebyshev distance, about 49 percen tage points than hierarchical clusteringviathe EH and over 50 percentage points than hierarchical clusteringviathe SNEH. From the view of NMI, the proposed method runs much better than the other four methods.

The third data set is the User Knowledge Modeling Data Set. From Table 2 and Table 3, we can know the adaptive clustering is better than others. From the viewpoint of AC, our method outperforms the two comparative methods slightly. From the viewpoint of NMI, the adaptive clustering is better than clustering algorithm concerning City-block distance about 6 percentage points, better than clustering concerning Chebyshev distance about 7 percentage points, than hierarchical clusteringviathe EH about 14 percentage point and better than hierarchical clusteringviathe SNEH over 14 percentage points.

The fourth data set is Electrical Grid Stability Simulated Data Set. From Table 2 and Table 3, we can know the adaptive clustering is better than others. From the perspective of AC, our method outperforms the two comparative methods. We can find the average value of ten experiments of the adaptive clustering is 71.599 0 which is better than clustering algorithm concerning City-block distance and Chebyshev distance over 14 percentage points. From the viewpoint of NMI, the adaptive clustering is better than clustering algorithm concerning City-block distance about 20 percentage points, better than clustering concerning Chebyshev distance about 20 percentage points, better than the hierarchical clustering methods using the EH and SNEH about 22 percentage points.

The fifth data set is Image Segmentation Data Set.From Table 2 and Table 3, we can know the adaptive clustering is much better than others. From the perspective of AC, our method outperforms the four comparison methods. We can find the average value of ten experiments of the adaptive clustering is 59.393 9 which is better than clustering algorithm concerning City-block distance and Chebyshev distance over 33 percentage points and better than hierarchical clustering methods using the EH and Span SNEH over 45 percentage points. From the point of NMI, the adaptive clustering is 53.604 8 which is better than clustering algorithm concerning City-block distance and Chebyshev distance about 33 percentage points and better than the hierarchical clustering methods using the EH and SNEH about 53 percentage points. We can clearly see the adaptive clustering performs much better. It is more reasonable to use probability distribution of interval values than only considering the endpoints of intervals.

The sixth data set is Facebook Live Sellers in Thailand Data Set. From Table 2 and Table 3, we can know the adaptive clustering is better than others. From the perspective of AC, our method is better than the two comparison methods concer-ning City-block distance and Chebyshev distance over 12 percentage points and is better than hierarchical clustering methods using the EH and SNEH about 4 percentage point. From the point of NMI, the adaptive clustering is better than the hierarchical clustering methods using the EH and SNEH distances over 4 percentage points.

The seventh data set is Glass Identification Data Set. From Table 2 and Table 3, we can know the adaptive clustering is much better than others. From the perspective of AC, our method outperforms the four comparison methods. We can find the average value of ten experiments of the adaptive clustering is 47.149 5 which is better than clustering algorithm concerning City-block distance and Chebyshev distance over 10 percentage points and better than hierarchical clustering methods using the EH and SNEH over 12 percentage points. From the point of NMI, the adaptive clustering is 31.121 6 which is better than clustering algorithm concerning City-block distance about 15 percentage points, better than clustering algorithm concerning Chebyshev distance about 10 percentage points and better than the hierarchical clustering methods using the EH and SNEH about 30 percentage points. We can clearly see the adaptive clustering performs much better. It is more reasonable to use probabi-lity distribution of interval values than only considering the endpoints of intervals.

The eighth data set is the Website Phishing Data Set. From Table 2 and Table 3, we can know the adaptive clustering is better than others. From the perspective of AC, our method is better than other four methods about 8 percentage point. From the viewpoint of NMI, the adaptive clus-tering is better than clustering algorithm concerning City-block distance and Chebyshev distance about 8 percentage points and better than the hierarchical clustering methods using the EH and SNEH distances over 19 percentage points.

The ninth data set is the Somerville Happiness Survey Data Set. From Table 2 and Table 3, we can know the adaptive clustering performs better than others. From the viewpoint of AC, our me-thod outperforms the four comparative methods about 3 percentage points. From the viewpoint of NMI, the adaptive clustering is better than clus-tering algorithm concerning City-block distance and Chebyshev distance about 2 percentage points, and better than hierarchical clusteringviathe EH and the SNEH about 4 percentage point.

The tenth data set is the Ionosphere Data Set. From Table 2 and Table 3, we can know the adaptive clustering is better than others four methods. Our method is better than the two comparison methods concerning City-block distance and Chebyshev distance over 5 percentage points from the perspective of AC. From the viewpoint of NMI, the adaptive clustering is better than clustering algorithm concerning City-block distance and Chebyshev distance about 1 percentage points and better than the hierarchical clustering methods using the EH and SNEH distances over 4 percentage points.

In order to show the effect of the improved clustering method more clearly, we draw the clustering results of three data sets. Figure 5, Figure 6 and Figure 7 show the clustering results of Wine, Seed and Image Segmentation data sets respect-ively. In each group of graphs, Subfigure a to Sub-figure f are results of real label, clustering based on adaptive method, clustering based on City-block distance, clustering based on Chebyshev distance, hierarchical clustering methods using the EH distance and SNEH distance respectively. In each subfigure of Figure 5, we use three colors to represent three classes. We use the value of the seventh attribute as thexaxis and the value of the eighth attribute asyaxis. From Figure 5, we can clearly see that the result of the our adaptive method is most similar to the result of the real label. In each subfigure of Figure 6, there are three colors to represent three classes. We use the value of the first attri-bute as thexaxis and the value of the second attri-bute asyaxis. According to Figure 6, we can clearly find that the result of the adaptive clustering method is more similar to the result of the real label than others. There are seven colors in each subfigure of Figure 7 to represent seven classes in real label. We use the value of the eleventh attri-bute as thexaxis and the value of the twelfth attribute asyaxis. From Figure 7, we can clearly see that the result of the our adaptive method is closest to the result of the real label. The adaptive method performs much better than other methods because it divides all instances into seven classes effectively. At the same time, we can clearly findthere are only two clusters in the results of clus-tering algorithm concerning City-block distance and Chebyshev distance, and five clusters are lost.

Fig.5 Scatter diagrams of the Wine data set based on the two selected characteristics

Fig.6 Scatter diagrams of the Seeds data set based on the two selected characteristics

Fig.7 Scatter diagrams of the Image Segmentation Data Set based on the two selected characteristics

By comparing the ten experiments with the other four methods and the pictures of the results we can clearly find that the new method performs much better. Compared with hierarchical clus-tering methods using the EH and SNEH distances, our method can achieve better and more stable results. Meanwhile, the adaptive clustering method outperforms the clustering algorithm concerning City-block distance and Chebyshev distance which only consider about the endpoints of intervals. So it is reasonable to apply kernel density estimation into interval data and take into account the distribution between the intervals for avoiding the pro-blem of Cityblock distance, Chebyshev distance and Hausdorff distance which only consider the endpoints of intervals.

4 Conclusions

The main contribution of this paper is the application of the kernel density estimation into interval data and the construction of an adaptive clustering approach for interval data. We estimate the density function of clusters and take the distribution on the intervals into account instead of only considering the endpoints of intervals. To estimate the density fun-ction, we propose a kernel density estimation approach for interval data, which is a nonparametric way of estimation. Based on the above probability density function, we define the distance between a cluster and an instance. Moreover, we present a clustering method for interval using the defined distance. Comparative results show the effectiveness of the proposed method. One may conclude that it is not reasonable to cluster only considering the endpoints of the interval. In the future, we will further consider the adaptive parameterhto improve the fixed parameterhused in this paper.