Goto

Collaborating Authors

 Clustering


Spectral Analysis Of Weighted Laplacians Arising In Data Clustering

arXiv.org Machine Learning

Graph Laplacians computed from weighted adjacency matrices are widely used to identify geometric structure in data, and clusters in particular; their spectral properties play a central role in a number of unsupervised and semi-supervised learning algorithms. When suitably scaled, graph Laplacians approach limiting continuum operators in the large data limit. Studying these limiting operators, therefore, sheds light on learning algorithms. This paper is devoted to the study of a parameterized family of divergence form elliptic operators that arise as the large data limit of graph Laplacians. The link between a three-parameter family of graph Laplacians and a three-parameter family of differential operators is explained. The spectral properties of these differential perators are analyzed in the situation where the data comprises two nearly separated clusters, in a sense which is made precise. In particular, we investigate how the spectral gap depends on the three parameters entering the graph Laplacian and on a parameter measuring the size of the perturbation from the perfectly clustered case. Numerical results are presented which exemplify and extend the analysis; in particular the computations study situations with more than two clusters. The findings provide insight into parameter choices made in learning algorithms which are based on weighted adjacency matrices; they also provide the basis for analysis of the consistency of various unsupervised and semi-supervised learning algorithms, in the large data limit.


Scala for Machine Learning - Programmer Books

#artificialintelligence

The discovery of information through data clustering and classification is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, engineering designs, biometrics, and trading strategies, to detection of genetic anomalies. The book begins with an introduction to the functional capabilities of the Scala programming language that are critical to the creation of machine learning algorithms such as dependency injection and implicits. Next, you'll learn about data preprocessing and filtering techniques. A review of the Akka framework and Apache Spark clusters concludes the tutorial.


Multiple Partitions Aligned Clustering

arXiv.org Artificial Intelligence

Multi-view clustering is an important yet challenging task due to the difficulty of integrating the information from multiple representations. Most existing multi-view clustering methods explore the heterogeneous information in the space where the data points lie. Such common practice may cause significant information loss because of unavoidable noise or inconsistency among views. Since different views admit the same cluster structure, the natural space should be all partitions. Orthogonal to existing techniques, in this paper, we propose to leverage the multi-view information by fusing partitions. Specifically, we align each partition to form a consensus cluster indicator matrix through a distinct rotation matrix. Moreover, a weight is assigned for each view to account for the clustering capacity differences of views. Finally, the basic partitions, weights, and consensus clustering are jointly learned in a unified framework. We demonstrate the effectiveness of our approach on several real datasets, where significant improvement is found over other state-of-the-art multi-view clustering methods.


Accelerating Column Generation via Flexible Dual Optimal Inequalities with Application to Entity Resolution

arXiv.org Artificial Intelligence

In this paper, we introduce a new optimization approach to Entity Resolution. Traditional approaches tackle entity resolution with hierarchical clustering, which does not benefit from a formal optimization formulation. In contrast, we model entity resolution as correlation-clustering, which we treat as a weighted set-packing problem and write as an integer linear program (ILP). In this case sources in the input data correspond to elements and entities in output data correspond to sets/clusters. We tackle optimization of weighted set packing by relaxing integrality in our ILP formulation. The set of potential sets/clusters can not be explicitly enumerated, thus motivating optimization via column generation. In addition to the novel formulation, we also introduce new dual optimal inequalities (DOI), that we call flexible dual optimal inequalities, which tightly lower-bound dual variables during optimization and accelerate column generation. We apply our formulation to entity resolution (also called de-duplication of records), and achieve state-of-the-art accuracy on two popular benchmark datasets.


Modelling Efficient Military Deployments with Machine Learning -- K-Means Clustering in R

#artificialintelligence

Armed forces in Latin America & the Caribbean are faced with the challenge of having to operate with a multi-dimensional mandate. In times of heightened civil unrest they are required to undertake peace-keeping operations, gang warfare driven by the arms-for-drugs trade calls for counter-insurgence style deployments and seasonal natural disasters often require their services to support the essential services under extreme conditions. With limited resources, every opportunity to prevent the unnecessary expenditure while maintaining effectiveness needs to be taken. In this post I will demonstrate how the application of the K-means clustering algorithm, in the context of how Naval Forces in Latin America and the Caribbean, can be used to schedule efficient Naval deployments and reduce the number of unnecessary operations. For this example I simulated 200 data points that represent the location of incidents that would result in the need for Naval resources to be deployed in the Caribbean Sea. The data have a timestamp that indicates the time of day of each incident on a 24-hour clock cycle.


Subspace clustering without knowing the number of clusters: A parameter free approach

arXiv.org Machine Learning

Subspace clustering, the task of clustering high dimensional data when the data points come from a union of subspaces is one of the fundamental tasks in unsupervised machine learning. Most of the existing algorithms for this task involves supplying prior information in form of a parameter, like the number of clusters, to the algorithm. In this work, a parameter free method for subspace clustering is proposed, where the data points are clustered on the basis of the difference in statistical distribution of the angles made by the data points within a subspace and those by points belonging to different subspaces. Given an initial coarse clustering, the proposed algorithm merges the clusters until a true clustering is obtained. This, unlike many existing methods, does not involve the use of an unknown parameter or tuning for one through cross validation. Also, a parameter free method for producing a coarse initial clustering is discussed, which makes the whole process of subspace clustering parameter free. The comparison of algorithm performance with the existing state of the art in synthetic and real data sets, shows the significance of the proposed method.


Differentially Private Algorithms for Learning Mixtures of Separated Gaussians

arXiv.org Machine Learning

Learning the parameters of a Gaussian mixtures models is a fundamental and widely studied problem with numerous applications. In this work, we give new algorithms for learning the parameters of a high-dimensional, well separated, Gaussian mixture model subject to the strong constraint of differential privacy. In particular, we give a differentially private analogue of the algorithm of Achlioptas and McSherry. Our algorithm has two key properties not achieved by prior work: (1) The algorithm's sample complexity matches that of the corresponding non-private algorithm up to lower order terms in a wide range of parameters. (2) The algorithm does not require strong a priori bounds on the parameters of the mixture components.


A Flexible Framework for Anomaly Detection via Dimensionality Reduction

arXiv.org Artificial Intelligence

Anomaly detection is challenging, especially for large datasets in high dimensions. Here we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. We release DRAMA, a general python package that implements the general framework with a wide range of built-in options. We test DRAMA on a wide variety of simulated and real datasets, in up to 3000 dimensions, and find it robust and highly competitive with commonly-used anomaly detection algorithms, especially in high dimensions. The flexibility of the DRAMA framework allows for significant optimization once some examples of anomalies are available, making it ideal for online anomaly detection, active learning and highly unbalanced datasets.


Iterative Spectral Method for Alternative Clustering

arXiv.org Machine Learning

It is extensively used for exploratory data analysis. Traditional clustering algorithms typically identify a single partitioning of a given dataset. However, data is often multifaceted and can be both interpreted and clustered through multiple viewpoints (or, views). For example, the same face data can be clustered based on either identity or based on pose. In real applications, partitions generated by a clustering algorithm may not correspond to the view a user is interested in. In this paper, we address the problem of finding an alternative clustering, given a dataset and an existing, pre-computed clustering. Ideally, one would like the alternative clustering to be novel (i.e., non-redundant) w.r.t. the existing clustering to reveal a new viewpoint to the user. Simultaneously, one would like the result to reveal partitions of high clustering quality. Several recent papers propose algorithms for alternativeProceedings of the 21 st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lan-zarote, Spain.


Concentration of kernel matrices with application to kernel spectral clustering

arXiv.org Machine Learning

We study the concentration of random kernel matrices around their mean. We derive nonasymptotic exponential concentration inequalities for Lipschitz kernels assuming that the data points are independent draws from a class of multivariate distributions on $\mathbb{R}^d$, including the strongly log-concave distributions under affine transformations. A feature of our result is that the data points need not have identical distributions or have zero mean, which is key in certain applications such as clustering. For comparison, we also derive the companion result for the Euclidean (inner product) kernel under a slightly modified set of distributional assumptions, more precisely, a class of sub-Gaussian vectors. A notable difference between the two cases is that, in contrast to the Euclidean kernel, in the Lipschitz case, the concentration inequality does not depend on the mean of the underlying vectors. As an application of these inequalities, we derive a bound on the misclassification rate of a kernel spectral clustering (KSC) algorithm, under a perturbed nonparametric mixture model. We show an example where this bound establishes the high-dimensional consistency (as $d \to \infty$) of the KSC, when applied with a Gaussian kernel, to a signal consisting of nested nonlinear manifolds (e.g., spheres) plus noise.