AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A time series distance measure for efficient clustering of input output signals by their underlying dynamics

Lauwers, Oliver, De Moor, Bart

arXiv.org Machine LearningMar-6-2017

Starting from a dataset with input/output time series generated by multiple deterministic linear dynamical systems, this paper tackles the problem of automatically clustering these time series. We propose an extension to the so-called Martin cepstral distance, that allows to efficiently cluster these time series, and apply it to simulated electrical circuits data. Traditionally, two ways of handling the problem are used. The first class of methods employs a distance measure on time series (e.g. Euclidean, Dynamic Time Warping) and a clustering technique (e.g. k-means, k-medoids, hierarchical clustering) to find natural groups in the dataset. It is, however, often not clear whether these distance measures effectively take into account the specific temporal correlations in these time series. The second class of methods uses the input/output data to identify a dynamic system using an identification scheme, and then applies a model norm-based distance (e.g. H2, H-infinity) to find out which systems are similar. This, however, can be very time consuming for large amounts of long time series data. We show that the new distance measure presented in this paper performs as good as when every input/output pair is modelled explicitly, but remains computationally much less complex. The complexity of calculating this distance between two time series of length N is O(N logN).

artificial intelligence, distance measure, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1109/LCSYS.2017.2715399

1703.01923

Country: Europe > Belgium (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Spectral Clustering via Graph Filtering: Consistency on the High-Dimensional Stochastic Block Model

Pydi, Muni Sreenivas, Dukkipati, Ambedkar

arXiv.org Machine LearningMar-5-2017

Spectral clustering is amongst the most popular methods for community detection in graphs. A key step in spectral clustering algorithms is the eigen-decomposition of the $n{\times}n$ graph Laplacian matrix to extract its $k$ leading eigenvectors, where $k$ is the desired number of clusters among $n$ objects. This is prohibitively complex to implement for very large datasets. However, it has recently been shown that it is possible to bypass the eigen-decomposition by computing an approximate spectral embedding through graph filtering of random signals. In this paper, we prove that spectral clustering performed via graph filtering can still recover the planted clusters consistently, under mild conditions. We analyse the effects of sparsity, dimensionality and filter approximation error on the consistency of the algorithm.

artificial intelligence, machine learning, spectral, (16 more...)

arXiv.org Machine Learning

1702.03522

Country: Asia (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables

Daunizeau, Jean

arXiv.org Machine LearningMar-3-2017

This note is concerned with accurate and computationally efficient approximations of moments of Gaussian random variables passed through sigmoid or softmax mappings. These approximations are semi-analytical (i.e. they involve the numerical adjustment of parametric forms) and highly accurate (they yield 5% error at most). We also highlight a few niche applications of these approximations, which arise in the context of, e.g., drift-diffusion models of decision making or non-parametric data clustering approaches. We provide these as examples of efficient alternatives to more tedious derivations that would be needed if one was to approach the underlying mathematical issues in a more formal way. We hope that this technical note will be helpful to modellers facing similar mathematical issues, although maybe stemming from different academic prospects.

approximation, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1703.00091

Country: Europe > France (0.15)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Phylogenetic Tools in Astrophysics

Fraix-Burnet, Didier

arXiv.org Machine LearningMar-1-2017

Multivariate clustering in astrophysics is a recent development justified by the bigger and bigger surveys of the sky. The phylogenetic approach is probably the most unexpected technique that has appeared for the unsupervised classification of galaxies, stellar populations or globular clusters. On one side, this is a somewhat natural way of classifying astrophysical entities which are all evolving objects. On the other side, several conceptual and practical difficulties arize, such as the hierarchical representation of the astrophysical diversity, the continuous nature of the parameters, and the adequation of the result to the usual practice for the physical interpretation. Most of these have now been solved through the studies of limited samples of stellar clusters and galaxies. Up to now, only the Maximum Parsimony (cladistics) has been used since it is the simplest and most general phylogenetic technique. Probabilistic and network approaches are obvious extensions that should be explored in the future.

artificial intelligence, galaxy, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1002/9781118445112.stat07935

1703.00286

Country: Europe > France (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

A description length approach to determining the number of k-means clusters

Mizutani, Hiromitsu, Kanai, Ryota

arXiv.org Machine LearningFeb-28-2017

We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical structure by examining whether multi-stage compression can further reduce the description length. We applied our criteria to determine the number of clusters to synthetic data and empirical neuroimaging data to observe the behavior of the criteria across different types of data set and suitability of the two types of criteria for different datasets. We found that our method can offer reasonable clustering results that are useful for dimension reduction. While our numerical results revealed dependency of our criteria on the various aspects of dataset such as the dimensionality, the description length approach proposed here provides a useful guidance to determine the number of clusters in a principled manner when underlying properties of the data are unknown and only inferred from observation of data.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Machine Learning

1703.00039

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Multi-Sensor Data Pattern Recognition for Multi-Target Localization: A Machine Learning Approach

Suresh, Kasthurirengan, Silva, Samuel, Votion, Johnathan, Cao, Yongcan

arXiv.org Machine LearningFeb-28-2017

Conducting surveillance missions using sensor networks is essential for many civilian and military applications, such as disaster response [1], border patrol [2], force protection [3], [4], combat missions [5], and traffic management [6]. One main task in these missions is to collect data regarding the operational environment and then obtain intelligence information from the data. Because the sensors used to collect data are often spatially distributed, extracting data patterns becomes critical to obtain accurate knowledge about the underlying activities. The existing work on identifying data patterns from spatially distributed sensors is focused on developing probabilistic reasoning techniques without recognizing the specific data association or data patterns. Existing approaches for multitarget state estimation can be characterized by two features: a data-to-target assignment algorithm, and an algorithm for single target state estimation under preexisting data-to-target associations. With unknown data association, probabilistic data association (PDA) [7] and multiple hypothesis tracking (MHT) [8] are two common approaches where dense measurements are available. In the study of traffic patterns, the existing research is focused on estimating traffic density and smart routes [6] without analyzing the data pattern to obtain better knowledge of traffic information.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

1703.00084

Country: North America > United States > Texas (0.28)

Genre: Research Report (0.40)

Industry:

Transportation (0.88)
Government > Military (0.48)

Technology:

Information Technology > Communications > Networks > Sensor Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Uniform Deviation Bounds for Unbounded Loss Functions like k-Means

Bachem, Olivier, Lucic, Mario, Hassani, S. Hamed, Krause, Andreas

arXiv.org Machine LearningFeb-27-2017

Uniform deviation bounds limit the difference between a model's expected loss and its loss on an empirical sample uniformly for all models in a learning problem. As such, they are a critical component to empirical risk minimization. In this paper, we provide a novel framework to obtain uniform deviation bounds for loss functions which are *unbounded*. In our main application, this allows us to obtain bounds for $k$-Means clustering under weak assumptions on the underlying distribution. If the fourth moment is bounded, we prove a rate of $\mathcal{O}\left(m^{-\frac12}\right)$ compared to the previously known $\mathcal{O}\left(m^{-\frac14}\right)$ rate. Furthermore, we show that the rate also depends on the kurtosis - the normalized fourth moment which measures the "tailedness" of a distribution. We further provide improved rates under progressively stronger assumptions, namely, bounded higher moments, subgaussianity and bounded support.

artificial intelligence, machine learning, unbounded loss function, (12 more...)

arXiv.org Machine Learning

1702.08249

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Scalable and Distributed Clustering via Lightweight Coresets

Bachem, Olivier, Lucic, Mario, Krause, Andreas

arXiv.org Machine LearningFeb-27-2017

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct light-weight coresets for k-Means clustering, Bregman clustering and maximum likelihood estimation of Gaussian mixture models. The algorithm is substantially faster than existing constructions, embarrassingly parallel and resulting coresets are smaller. In an extensive experimental evaluation, we demonstrate that the proposed method outperforms existing coreset constructions.

artificial intelligence, bayesian inference, machine learning, (14 more...)

arXiv.org Machine Learning

1702.08248

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Add feedback

The Shape of Data and Probability Measures

Martínez, Diego Hernán Díaz, Mémoli, Facundo, Mio, Washington

arXiv.org Machine LearningFeb-27-2017

We introduce the notion of multiscale covariance tensor fields (CTF) associated with Euclidean random variables as a gateway to the shape of their distributions. Multiscale CTFs quantify variation of the data about every point in the data landscape at all spatial scales, unlike the usual covariance tensor that only quantifies global variation about the mean. Empirical forms of localized covariance previously have been used in data analysis and visualization, but we develop a framework for the systematic treatment of theoretical questions and computational models based on localized covariance. We prove strong stability theorems with respect to the Wasserstein distance between probability measures, obtain consistency results, as well as estimates for the rate of convergence of empirical CTFs. These results ensure that CTFs are robust to sampling, noise and outliers. We provide numerous illustrations of how CTFs let us extract shape from data and also apply CTFs to manifold clustering, the problem of categorizing data points according to their noisy membership in a collection of possibly intersecting, smooth submanifolds of Euclidean space. We prove that the proposed manifold clustering method is stable and carry out several experiments to validate the method.

artificial intelligence, kernel, machine learning, (15 more...)

arXiv.org Machine Learning

1509.04632

Country:

North America > United States (0.93)
Europe > United Kingdom > England (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.88)

Add feedback

On Context-Dependent Clustering of Bandits

Gentile, Claudio, Li, Shuai, Kar, Purushottam, Karatzoglou, Alexandros, Etrue, Evans, Zappella, Giovanni

arXiv.org Artificial IntelligenceFeb-27-2017

We investigate a novel cluster-of-bandit algorithm CAB for collaborative recommendation tasks that implements the underlying feedback sharing mechanism by estimating the neighborhood of users in a context-dependent manner. CAB makes sharp departures from the state of the art by incorporating collaborative effects into inference as well as learning processes in a manner that seamlessly interleaving explore-exploit tradeoffs and collaborative steps. We prove regret bounds under various assumptions on the data, which exhibit a crisp dependence on the expected number of clusters over the users, a natural measure of the statistical difficulty of the learning task. Experiments on production and real-world datasets show that CAB offers significantly increased prediction performance against a representative pool of state-of-the-art methods.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

1608.03544

Country: Europe > United Kingdom (0.28)

Genre: Research Report (1.00)

Industry: Media (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback