Clustering
Density Adaptive Parallel Clustering
In this paper we are going to introduce a new nearest neighbours based approach to clustering, and compare it with previous solutions; the resulting algorithm, which takes inspiration from both DBscan and minimum spanning tree approaches, is deterministic but proves simpler, faster and doesn't require to set in advance a value for k, the number of clusters.
Nonparametric Hierarchical Clustering of Functional Data
Boullé, Marc, Guigourès, Romain, Rossi, Fabrice
In this paper, we deal with the problem of curves clustering. We propose a nonparametric method which partitions the curves into clusters and discretizes the dimensions of the curve points into intervals. The cross-product of these partitions forms a data-grid which is obtained using a Bayesian model selection approach while making no assumptions regarding the curves. Finally, a post-processing technique, aiming at reducing the number of clusters in order to improve the interpretability of the clustering, is proposed. It consists in optimally merging the clusters step by step, which corresponds to an agglomerative hierarchical classification whose dissimilarity measure is the variation of the criterion. Interestingly this measure is none other than the sum of the Kullback-Leibler divergences between clusters distributions before and after the merges. The practical interest of the approach for functional data exploratory analysis is presented and compared with an alternative approach on an artificial and a real world data set.
How Many Dissimilarity/Kernel Self Organizing Map Variants Do We Need?
In numerous applicative contexts, data are too rich and too complex to be represented by numerical vectors. A general approach to extend machine learning and data mining techniques to such data is to really on a dissimilarity or on a kernel that measures how different or similar two objects are. This approach has been used to define several variants of the Self Organizing Map (SOM). This paper reviews those variants in using a common set of notations in order to outline differences and similarities between them.
An Efficient Hybrid CS and K-Means Algorithm for the Capacitated PMedian Problem
Mazinan, Hassan Gholami, Ahmadi, Gholam Reza, Khaji, Erfan
The capacitated P-median problem (CPMP) is an NPcomplete problem which investigates the problem of partitioning a set of N nodes into M different disjoint clusters, each represented by a certain node that is designed as concentrator. The NM nodes that are not chosen as concentrators are referred as terminals. The partitioning of the initial N nodes must be performed in such a way that a measure of total distance between the terminals and their corresponding concentrators is minimized. In addition, a capacity constraint imposed on the concentrators must be met, in order to obtain feasible solutions to the problem [1-4]. A direct application of the CPMP is in the context of communication networks deployment, where a set of terminals in the network must be assigned to the corresponding concentrator in order to compose access networks that satisfy the rate requirements of such terminals [5]. In this context, most of the efforts so far has focused on the topological design of communication networks (e.g. Wireless Sensor Networks (WSN), backbone networks or mobile networks [6-8]) since many of the processes involved in such networks can be approached as a CPMP problem, e.g.
On Soft Power Diagrams
Noname manuscript No. (will be inserted by the editor) Abstract Many applications in data analysis begin with a set of points in a Euclidean space that is partitioned into clusters. Common tasks then are to devise a classifier deciding which of the clusters a new point is associated to, finding outliers with respect to the clusters, or identifying the type of clustering used for the partition. One of the common kinds of clusterings are (balanced) least-squares assignments with respect to a given set of sites. For these, there is a'separating power diagram' for which each cluster lies in its own cell. In the present paper, we aim for efficient algorithms for outlier detection and the computation of thresholds that measure how similar a clustering is to a leastsquares assignment for fixed sites. For this purpose, we devise a new model for the computation of a'soft power diagram', which allows a soft separation of the clusters with'point counting properties'; e.g. As our results hold for a more general non-convex model of free sites, we describe it and our proofs in this more general way. Its locally optimal solutions satisfy the aforementioned point counting properties. For our target applications that use fixed sites, our algorithms are efficiently solvable to global optimality by linear programming.
Divide-and-Conquer Learning by Anchoring a Conical Hull
Zhou, Tianyi, Bilmes, Jeff, Guestrin, Carlos
We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ "anchors" lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme "DCA" that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.
Further heuristics for $k$-means: The merge-and-split heuristic and the $(k,l)$-means
Finding the optimal $k$-means clustering is NP-hard in general and many heuristics have been designed for minimizing monotonically the $k$-means objective. We first show how to extend Lloyd's batched relocation heuristic and Hartigan's single-point relocation heuristic to take into account empty-cluster and single-point cluster events, respectively. Those events tend to increasingly occur when $k$ or $d$ increases, or when performing several restarts. First, we show that those special events are a blessing because they allow to partially re-seed some cluster centers while further minimizing the $k$-means objective function. Second, we describe a novel heuristic, merge-and-split $k$-means, that consists in merging two clusters and splitting this merged cluster again with two new centers provided it improves the $k$-means objective. This novel heuristic can improve Hartigan's $k$-means when it has converged to a local minimum. We show empirically that this merge-and-split $k$-means improves over the Hartigan's heuristic which is the {\em de facto} method of choice. Finally, we propose the $(k,l)$-means objective that generalizes the $k$-means objective by associating the data points to their $l$ closest cluster centers, and show how to either directly convert or iteratively relax the $(k,l)$-means into a $k$-means in order to reach better local minima.
Fast Computation of Wasserstein Barycenters
We present new algorithms to compute the mean of a set of empirical probability measures under the optimal transport metric. This mean, known as the Wasserstein barycenter, is the measure that minimizes the sum of its Wasserstein distances to each element in that set. We propose two original algorithms to compute Wasserstein barycenters that build upon the subgradient method. A direct implementation of these algorithms is, however, too costly because it would require the repeated resolution of large primal and dual optimal transport problems to compute subgradients. Extending the work of Cuturi (2013), we propose to smooth the Wasserstein distance used in the definition of Wasserstein barycenters with an entropic regularizer and recover in doing so a strictly convex objective whose gradients can be computed for a considerably cheaper computational cost using matrix scaling algorithms. We use these algorithms to visualize a large family of images and to solve a constrained clustering problem.
The Laplacian K-modes algorithm for clustering
Wang, Weiran, Carreira-Perpiñán, Miguel Á.
In addition to finding meaningful clusters, centroid-based clustering algorithms such as K-means or mean-shift should ideally find centroids that are valid patterns in the input space, representative of data in their cluster. This is challenging with data having a nonconvex or manifold structure, as with images or text. We introduce a new algorithm, Laplacian K-modes, which naturally combines three powerful ideas in clustering: the explicit use of assignment variables (as in K-means); the estimation of cluster centroids which are modes of each cluster's density estimate (as in mean-shift); and the regularizing effect of the graph Laplacian, which encourages similar assignments for nearby points (as in spectral clustering). The optimization algorithm alternates an assignment step, which is a convex quadratic program, and a mean-shift step, which separates for each cluster centroid. The algorithm finds meaningful density estimates for each cluster, even with challenging problems where the clusters have manifold structure, are highly nonconvex or in high dimension. It also provides centroids that are valid patterns, truly representative of their cluster (unlike K-means), and an out-of-sample mapping that predicts soft assignments for a new point.
An Incremental Reseeding Strategy for Clustering
Bresson, Xavier, Hu, Huiyi, Laurent, Thomas, Szlam, Arthur, von Brecht, James
In this work we propose a simple and easily parallelizable algorithm for multiway graph partitioning. The algorithm alternates between three basic components: diffusing seed vertices over the graph, thresholding the diffused seeds, and then randomly reseeding the thresholded clusters. We demonstrate experimentally that the proper combination of these ingredients leads to an algorithm that achieves state-of-the-art performance in terms of cluster purity on standard benchmarks datasets. Moreover, the algorithm runs an order of magnitude faster than the other algorithms that achieve comparable results in terms of accuracy. We also describe a coarsen, cluster and refine approach similar to GRACLUS and METIS that removes an additional order of magnitude from the runtime of our algorithm while still maintaining competitive accuracy.