AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

A Study of the Spatio-Temporal Correlations in Mobile Calls Networks

Guigourès, Romain, Boullé, Marc, Rossi, Fabrice

arXiv.org Machine LearningOct-30-2015

For the last few years, the amount of data has significantly increased in the companies. It is the reason why data analysis methods have to evolve to meet new demands. In this article, we introduce a practical analysis of a large database from a telecommunication operator. The problem is to segment a territory and characterize the retrieved areas owing to their inhabitant behavior in terms of mobile telephony. We have call detail records collected during five months in France. We propose a two stages analysis. The first one aims at grouping source antennas which originating calls are similarly distributed on target antennas and conversely for target antenna w.r.t. source antenna. A geographic projection of the data is used to display the results on a map of France. The second stage discretizes the time into periods between which we note changes in distributions of calls emerging from the clusters of source antennas. This enables an analysis of temporal changes of inhabitants behavior in every area of the country.

antenna, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1007/978-3-319-23751-0_1

1510.09005

Country: Europe > France (0.87)

Genre: Research Report (1.00)

Industry: Telecommunications (0.48)

Technology:

Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science (0.66)

Add feedback

Robust Subspace Clustering via Tighter Rank Approximation

Kang, Zhao, Peng, Chong, Cheng, Qiang

arXiv.org Machine LearningOct-30-2015

Matrix rank minimization problem is in general NP-hard. The nuclear norm is used to substitute the rank function in many recent studies. Nevertheless, the nuclear norm approximation adds all singular values together and the approximation error may depend heavily on the magnitudes of singular values. This might restrict its capability in dealing with many practical problems. In this paper, an arctangent function is used as a tighter approximation to the rank function. We use it on the challenging subspace clustering problem. For this nonconvex minimization problem, we develop an effective optimization procedure based on a type of augmented Lagrange multipliers (ALM) method. Extensive experiments on face clustering and motion segmentation show that the proposed method is effective for rank approximation.

artificial intelligence, machine learning, optimization problem, (16 more...)

arXiv.org Machine Learning

doi: 10.1145/2806416.2806506

1510.08971

Country: North America > United States > Illinois (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)

Add feedback

A Unified Framework for Representation-based Subspace Clustering of Out-of-sample and Large-scale Data

Peng, Xi, Tang, Huajin, Zhang, Lei, Yi, Zhang, Xiao, Shijie

arXiv.org Machine LearningOct-30-2015

Under the framework of spectral clustering, the key of subspace clustering is building a similarity graph which describes the neighborhood relations among data points. Some recent works build the graph using sparse, low-rank, and $\ell_2$-norm-based representation, and have achieved state-of-the-art performance. However, these methods have suffered from the following two limitations. First, the time complexities of these methods are at least proportional to the cube of the data size, which make those methods inefficient for solving large-scale problems. Second, they cannot cope with out-of-sample data that are not used to construct the similarity graph. To cluster each out-of-sample datum, the methods have to recalculate the similarity graph and the cluster membership of the whole data set. In this paper, we propose a unified framework which makes representation-based subspace clustering algorithms feasible to cluster both out-of-sample and large-scale data. Under our framework, the large-scale problem is tackled by converting it as out-of-sample problem in the manner of "sampling, clustering, coding, and classifying". Furthermore, we give an estimation for the error bounds by treating each subspace as a point in a hyperspace. Extensive experimental results on various benchmark data sets show that our methods outperform several recently-proposed scalable methods in clustering large-scale data set.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1109/TNNLS.2015.2490080

1309.6487

Country:

Europe (1.00)
Asia > China (1.00)
North America > United States > California (0.67)

Genre:

Overview (0.87)
Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Fast Landmark Subspace Clustering

Wang, Xu, Lerman, Gilad

arXiv.org Machine LearningOct-28-2015

Kernel methods obtain superb performance in terms of accuracy for various machine learning tasks since they can effectively extract nonlinear relations. However, their time complexity can be rather large especially for clustering tasks. In this paper we define a general class of kernels that can be easily approximated by randomization. These kernels appear in various applications, in particular, traditional spectral clustering, landmark-based spectral clustering and landmark-based subspace clustering. We show that for $n$ data points from $K$ clusters with $D$ landmarks, the randomization procedure results in an algorithm of complexity $O(KnD)$. Furthermore, we bound the error between the original clustering scheme and its randomization. To illustrate the power of this framework, we propose a new fast landmark subspace (FLS) clustering algorithm. Experiments over synthetic and real datasets demonstrate the superior performance of FLS in accelerating subspace clustering with marginal sacrifice of accuracy.

artificial intelligence, kernel, machine learning, (16 more...)

arXiv.org Machine Learning

1510.08406

Country: North America > United States > California (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (0.35)

Add feedback

Fully adaptive density-based clustering

Steinwart, Ingo

arXiv.org Machine LearningOct-28-2015

The clusters of a distribution are often defined by the connected components of a density level set. However, this definition depends on the user-specified level. We address this issue by proposing a simple, generic algorithm, which uses an almost arbitrary level set estimator to estimate the smallest level at which there are more than one connected components. In the case where this algorithm is fed with histogram-based level set estimates, we provide a finite sample analysis, which is then used to show that the algorithm consistently estimates both the smallest level and the corresponding connected components. We further establish rates of convergence for the two estimation problems, and last but not least, we present a simple, yet adaptive strategy for determining the width-parameter of the involved density estimator in a data-depending way.

artificial intelligence, lemma, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1214/15-AOS1331

1409.8437

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Spectral Convergence Rate of Graph Laplacian

Wang, Xu

arXiv.org Machine LearningOct-27-2015

Laplacian Eigenvectors of the graph constructed from a data set are used in many spectral manifold learning algorithms such as diffusion maps and spectral clustering. Given a graph constructed from a random sample of a d-dimensional compact submanifold M in R D, we establish the spectral convergence rate of the graph Laplacian. It implies the consistency of the spectral clustering algorithm via a standard perturbation argument. A simple numerical study indicates the necessity of a denoising step before applying spectral algorithms. 1. Introduction High-dimensional data appears naturally in real-world applications. A common assumption is that the data resides on a low-dimensional manifold.

artificial intelligence, convergence rate, machine learning, (15 more...)

arXiv.org Machine Learning

1510.0811

Country:

North America > United States > California (0.16)
North America > Canada > British Columbia (0.15)

Genre: Research Report (0.40)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Add feedback

Adaptive Mixtures of Factor Analyzers

Kaya, Heysem, Salah, Albert Ali

arXiv.org Machine LearningOct-22-2015

A mixture of factor analyzers is a semi-parametric density estimator that generalizes the well-known mixtures of Gaussians model by allowing each Gaussian in the mixture to be represented in a different lower-dimensional manifold. This paper presents a robust and parsimonious model selection algorithm for training a mixture of factor analyzers, carrying out simultaneous clustering and locally linear, globally nonlinear dimensionality reduction. Permitting different number of factors per mixture component, the algorithm adapts the model complexity to the data complexity. We compare the proposed algorithm with related automatic model selection algorithms on a number of benchmarks. The results indicate the effectiveness of this fast and robust approach in clustering, manifold learning and class-conditional modeling.

algorithm, artificial intelligence, machine learning, (13 more...)

arXiv.org Machine Learning

1507.02801

Country: Europe (0.68)

Genre: Research Report (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

Tokuda, Tomoki, Yoshimoto, Junichiro, Shimizu, Yu, Toki, Shigeru, Okada, Go, Takamura, Masahiro, Yamamoto, Tetsuya, Yoshimura, Shinpei, Okamoto, Yasumasa, Yamawaki, Shigeto, Doya, Kenji

arXiv.org Machine LearningOct-21-2015

We propose a novel method for multiple clustering that assumes a co-clustering structure (partitions in both rows and columns of the data matrix) in each view. The new method is applicable to high-dimensional data. It is based on a nonparametric Bayesian approach in which the number of views and the number of feature-/subject clusters are inferred in a data-driven manner. We simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block. This makes our method applicable to datasets consisting of both numerical and categorical variables, which biomedical data typically do. Clustering solutions are based on variational inference with mean field approximation. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data.

artificial intelligence, machine learning, multiple co-clustering method, (17 more...)

arXiv.org Machine Learning

1510.06138

Country: Asia > Japan (0.46)

Genre: Research Report > Promising Solution (0.66)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.87)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.70)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

A Bayesian alternative to mutual information for the hierarchical clustering of dependent random variables

Marrelec, Guillaume, Messé, Arnaud, Bellec, Pierre

arXiv.org Machine LearningOct-19-2015

The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC) raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity), provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms) to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI) datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.

artificial intelligence, machine learning, mutual information, (18 more...)

arXiv.org Machine Learning

doi: 10.1371/journal.pone.0137278

1501.05194

Country:

North America > United States (1.00)
Europe (0.92)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Clustering Noisy Signals with Structured Sparsity Using Time-Frequency Representation

Hope, Tom, Wagner, Avishai, Zuk, Or

arXiv.org Machine LearningOct-18-2015

Clustering of high-dimensional signals, sequences or functional data is a common task that arises in many domains [18, 19]. Such data come up in diverse fields, as in speech analysis, genomics, mass spectrometry, MRI or EEG measurements, and many more. Clustering seeks to partition data into groups with high overall similarity between members (instances) of the same group and dissimilarity to members of other groups. For time-series signals, this means partitioning the instances into groups of similarly behaving functions over time, where the measure of similarity is crucial and often application-specific. In many real-world scenarios, signals are high-dimensional (such as in genomics), noisy (as in low-quality speech recordings), and exhibit non-stationary behavior: for example peaks and other non-smooth local patterns, or changes in frequency over time.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1510.05214

Country:

North America > United States (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.54)

Technology:

Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback