$\alpha$-Approximation Density-based Clustering of Multi-valued Objects

Zhang, Zhilin

arXiv.org Machine Learning 

Zhilin Zhang Abstract Multi-valued data are commonly found in many real applications. During the process of clustering multi-valued data, most existing methods use sampling or aggregation mechanisms that cannot reflect the real distribution of objects and their instances and thus fail to obtain high-quality clusters. In this paper, a concept ofα -approximation distance is introduced to measure the connectivity between multi-valued objects by taking account of the distribution of the instances. An α -approximation density-based clustering algorithm (DBCMO) is proposed to efficiently cluster the multi-valued objects by using global and local R* tree structures. To speed up the algorithm, four pruning rules on the tree structures are implemented. Empirical studies on synthetic and real datasets demonstrate that DBCMO can efficiently and effectively discover the multi-valued object clusters. A comparison with two existing methods further shows that DBCMO can better handle a continuous decrease in the cluster density and detect clusters of varying density. Keywords Multi-valued objects· α -Approximation· Density-based· Clustering 1 Introduction Multi-valued data (Zhang et al. 2010), including multi-instance data and uncertain data, are commonly found in many real applications. The check-in data of location-based social networks are one example. Each user is an object, and he/she can have multiple check-in records associated with different temporal and spatial information. The observation data of dynamic objects, such as seismic activity, sea floor bathymetry, and sea height, are other examples. Since the states of observed objects change constantly, the limited observation data can only reveal the objects' states with a certain probability. The clustering of multi-valued objects is the process of grouping objects into different partitions based on similarity measurements or connectivity calculations. Based on the mechanism used for measuring similarity or connectivity, the clustering algorithms for multi-valued objects can be divided into two main categories: aggregation-based clustering and sampling-based clustering. Aggregation-based clustering methodology first transfers the multi-valued objects into single-valued objects with an aggregation function (e.g. the mean). After that, various traditional clustering algorithms can be applied directly. Sampling-based methods obtain a sequence of sample points for each object using sampling techniques. And then the distance density function or the expected distance of two objects can be computed with the multiple discrete distance values from the samples. Both aggregation and sampling are useful in reducing computational cost, especially when there is large number of values for objects. However, determination of a proper aggregation function or sampling strategy is not trivial.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found