AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Density Adaptive Parallel Clustering

La Rocca, Marcello

arXiv.org Machine LearningJul-11-2014

In this paper we are going to introduce a new nearest neighbours based approach to clustering, and compare it with previous solutions; the resulting algorithm, which takes inspiration from both DBscan and minimum spanning tree approaches, is deterministic but proves simpler, faster and doesn't require to set in advance a value for k, the number of clusters.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

1407.3242

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Nonparametric Hierarchical Clustering of Functional Data

Boullé, Marc, Guigourès, Romain, Rossi, Fabrice

arXiv.org Machine LearningJul-2-2014

In this paper, we deal with the problem of curves clustering. We propose a nonparametric method which partitions the curves into clusters and discretizes the dimensions of the curve points into intervals. The cross-product of these partitions forms a data-grid which is obtained using a Bayesian model selection approach while making no assumptions regarding the curves. Finally, a post-processing technique, aiming at reducing the number of clusters in order to improve the interpretability of the clustering, is proposed. It consists in optimally merging the clusters step by step, which corresponds to an agglomerative hierarchical classification whose dissimilarity measure is the variation of the criterion. Interestingly this measure is none other than the sum of the Kullback-Leibler divergences between clusters distributions before and after the merges. The practical interest of the approach for functional data exploratory analysis is presented and compared with an alternative approach on an artificial and a real world data set.

artificial intelligence, bayesian inference, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/978-3-319-02999-3_2

1407.0612

Country: North America > United States (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.88)

Add feedback

How Many Dissimilarity/Kernel Self Organizing Map Variants Do We Need?

Rossi, Fabrice

arXiv.org Machine LearningJul-2-2014

In numerous applicative contexts, data are too rich and too complex to be represented by numerical vectors. A general approach to extend machine learning and data mining techniques to such data is to really on a dissimilarity or on a kernel that measures how different or similar two objects are. This approach has been used to define several variants of the Self Organizing Map (SOM). This paper reviews those variants in using a common set of notations in order to outline differences and similarities between them.

artificial intelligence, machine learning, survey article, (18 more...)

arXiv.org Machine Learning

doi: 10.1007/978-3-319-07695-9_1

1407.0611

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre:

Research Report (0.90)
Overview (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

An Efficient Hybrid CS and K-Means Algorithm for the Capacitated PMedian Problem

Mazinan, Hassan Gholami, Ahmadi, Gholam Reza, Khaji, Erfan

arXiv.org Artificial IntelligenceJun-29-2014

The capacitated P-median problem (CPMP) is an NPcomplete problem which investigates the problem of partitioning a set of N nodes into M different disjoint clusters, each represented by a certain node that is designed as concentrator. The NM nodes that are not chosen as concentrators are referred as terminals. The partitioning of the initial N nodes must be performed in such a way that a measure of total distance between the terminals and their corresponding concentrators is minimized. In addition, a capacity constraint imposed on the concentrators must be met, in order to obtain feasible solutions to the problem [1-4]. A direct application of the CPMP is in the context of communication networks deployment, where a set of terminals in the network must be assigned to the corresponding concentrator in order to compose access networks that satisfy the rate requirements of such terminals [5]. In this context, most of the efforts so far has focused on the topological design of communication networks (e.g. Wireless Sensor Networks (WSN), backbone networks or mobile networks [6-8]) since many of the processes involved in such networks can be approached as a CPMP problem, e.g.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

1406.7473

Country: Asia > Middle East > Iran (0.14)

Genre: Research Report (0.40)

Industry: Telecommunications (0.66)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

On Soft Power Diagrams

Borgwardt, Steffen

arXiv.org Machine LearningJun-24-2014

Noname manuscript No. (will be inserted by the editor) Abstract Many applications in data analysis begin with a set of points in a Euclidean space that is partitioned into clusters. Common tasks then are to devise a classifier deciding which of the clusters a new point is associated to, finding outliers with respect to the clusters, or identifying the type of clustering used for the partition. One of the common kinds of clusterings are (balanced) least-squares assignments with respect to a given set of sites. For these, there is a'separating power diagram' for which each cluster lies in its own cell. In the present paper, we aim for efficient algorithms for outlier detection and the computation of thresholds that measure how similar a clustering is to a leastsquares assignment for fixed sites. For this purpose, we devise a new model for the computation of a'soft power diagram', which allows a soft separation of the clusters with'point counting properties'; e.g. As our results hold for a more general non-convex model of free sites, we describe it and our proofs in this more general way. Its locally optimal solutions satisfy the aforementioned point counting properties. For our target applications that use fixed sites, our algorithms are efficiently solvable to global optimality by linear programming.

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Machine Learning

1307.3949

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Divide-and-Conquer Learning by Anchoring a Conical Hull

Zhou, Tianyi, Bilmes, Jeff, Guestrin, Carlos

arXiv.org Machine LearningJun-22-2014

We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ "anchors" lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme "DCA" that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.

artificial intelligence, bayesian inference, machine learning, (12 more...)

arXiv.org Machine Learning

1406.5752

Country: North America > United States (0.45)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
(2 more...)

Add feedback

Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means

Nielsen, Frank, Nock, Richard

arXiv.org Machine LearningJun-22-2014

Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.

artificial intelligence, hartigan, machine learning, (17 more...)

arXiv.org Machine Learning

1406.6314

Country: North America > United States (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

Fast Computation of Wasserstein Barycenters

Cuturi, Marco, Doucet, Arnaud

arXiv.org Machine LearningJun-17-2014

We present new algorithms to compute the mean of a set of empirical probability measures under the optimal transport metric. This mean, known as the Wasserstein barycenter, is the measure that minimizes the sum of its Wasserstein distances to each element in that set. We propose two original algorithms to compute Wasserstein barycenters that build upon the subgradient method. A direct implementation of these algorithms is, however, too costly because it would require the repeated resolution of large primal and dual optimal transport problems to compute subgradients. Extending the work of Cuturi (2013), we propose to smooth the Wasserstein distance used in the definition of Wasserstein barycenters with an entropic regularizer and recover in doing so a strictly convex objective whose gradients can be computed for a considerably cheaper computational cost using matrix scaling algorithms. We use these algorithms to visualize a large family of images and to solve a constrained clustering problem.

algorithm, barycenter, wasserstein barycenter, (12 more...)

arXiv.org Machine Learning

1310.4375

Country:

Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

The Laplacian K-modes algorithm for clustering

Wang, Weiran, Carreira-Perpiñán, Miguel Á.

arXiv.org Machine LearningJun-15-2014

In addition to finding meaningful clusters, centroid-based clustering algorithms such as K-means or mean-shift should ideally find centroids that are valid patterns in the input space, representative of data in their cluster. This is challenging with data having a nonconvex or manifold structure, as with images or text. We introduce a new algorithm, Laplacian K-modes, which naturally combines three powerful ideas in clustering: the explicit use of assignment variables (as in K-means); the estimation of cluster centroids which are modes of each cluster's density estimate (as in mean-shift); and the regularizing effect of the graph Laplacian, which encourages similar assignments for nearby points (as in spectral clustering). The optimization algorithm alternates an assignment step, which is a convex quadratic program, and a mean-shift step, which separates for each cluster centroid. The algorithm finds meaningful density estimates for each cluster, even with challenging problems where the clusters have manifold structure, are highly nonconvex or in high dimension. It also provides centroids that are valid patterns, truly representative of their cluster (unlike K-means), and an out-of-sample mapping that predicts soft assignments for a new point.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1406.3895

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

An Incremental Reseeding Strategy for Clustering

Bresson, Xavier, Hu, Huiyi, Laurent, Thomas, Szlam, Arthur, von Brecht, James

arXiv.org Machine LearningJun-15-2014

In this work we propose a simple and easily parallelizable algorithm for multiway graph partitioning. The algorithm alternates between three basic components: diffusing seed vertices over the graph, thresholding the diffused seeds, and then randomly reseeding the thresholded clusters. We demonstrate experimentally that the proper combination of these ingredients leads to an algorithm that achieves state-of-the-art performance in terms of cluster purity on standard benchmarks datasets. Moreover, the algorithm runs an order of magnitude faster than the other algorithms that achieve comparable results in terms of accuracy. We also describe a coarsen, cluster and refine approach similar to GRACLUS and METIS that removes an additional order of magnitude from the runtime of our algorithm while still maintaining competitive accuracy.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Machine Learning

1406.3837

Country: North America > United States > California (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback