AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Overlapping Communities Detection via Measure Space Embedding

Kozdoba, Mark, Mannor, Shie

arXiv.org Machine LearningJun-28-2015

We present a new algorithm for community detection. The algorithm uses random walks to embed the graph in a space of measures, after which a modification of $k$-means in that space is applied. The algorithm is therefore fast and easily parallelizable. We evaluate the algorithm on standard random graph benchmarks, including some overlapping community benchmarks, and find its performance to be better or at least as good as previously known algorithms. We also prove a linear time (in number of edges) guarantee for the algorithm on a $p,q$-stochastic block model with $p \geq c\cdot N^{-\frac{1}{2} + \epsilon}$ and $p-q \geq c' \sqrt{p N^{-\frac{1}{2} + \epsilon} \log N}$.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Machine Learning

1504.06796

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

CRAFT: ClusteR-specific Assorted Feature selecTion

Garg, Vikas K., Rudin, Cynthia, Jaakkola, Tommi

arXiv.org Machine LearningJun-25-2015

We present a framework for clustering with cluster-specific feature selection. The framework, CRAFT, is derived from asymptotic log posterior formulations of nonparametric MAP-based clustering models. CRAFT handles assorted data, i.e., both numeric and categorical data, and the underlying objective functions are intuitively appealing. The resulting algorithm is simple to implement and scales nicely, requires minimal parameter tuning, obviates the need to specify the number of clusters a priori, and compares favorably with other methods on real datasets.

artificial intelligence, feature selection, machine learning, (17 more...)

arXiv.org Machine Learning

1506.07609

Country: North America > United States > Massachusetts (0.14)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.50)

Add feedback

Graphs in machine learning: an introduction

Latouche, Pierre, Rossi, Fabrice

arXiv.org Machine LearningJun-23-2015

Graphs are commonly used to characterise interactions between objects of interest. Because they are based on a straightforward formalism, they are used in many scientific fields from computer science to historical sciences. In this paper, we give an introduction to some methods relying on graphs for learning. This includes both unsupervised and supervised methods. Unsupervised learning algorithms usually aim at visualising graphs in latent spaces and/or clustering the nodes. Both focus on extracting knowledge from graph topologies. While most existing techniques are only applicable to static graphs, where edges do not evolve through time, recent developments have shown that they could be extended to deal with evolving networks. In a supervised context, one generally aims at inferring labels or numerical values attached to nodes using both the graph and, when they are available, node characteristics. Balancing the two sources of information can be challenging, especially as they can disagree locally or globally. In both contexts, supervised and un-supervised, data can be relational (augmented with one or several global graphs) as described above, or graph valued. In this latter case, each object of interest is given as a full graph (possibly completed by other characteristics). In this context, natural tasks include graph clustering (as in producing clusters of graphs rather than clusters of nodes in a single graph), graph classification, etc. 1 Real networks One of the first practical studies on graphs can be dated back to the original work of Moreno [51] in the 30s. Since then, there has been a growing interest in graph analysis associated with strong developments in the modelling and the processing of these data. Graphs are now used in many scientific fields. In Biology [54, 2, 7], for instance, metabolic networks can describe pathways of biochemical reactions [41], while in social sciences networks are used to represent relation ties between actors [66, 56, 36, 34]. Other examples include powergrids [71] and the web [75]. Recently, networks have also been considered in other areas such as geography [22] and history [59, 39]. In machine learning, networks are seen as powerful tools to model problems in order to extract information from data and for prediction purposes. This is the object of this paper. For more complete surveys, we refer to [28, 62, 49, 45]. In this section, we introduce notations and highlight properties shared by most real networks. In Section 2, we then consider methods aiming at extracting information from a unique network. We will particularly focus on clustering methods where the goal is to find clusters of vertices. Finally, in Section 3, techniques that take a series of networks into account, where each network is

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

1506.06962

Country:

North America > Canada (0.46)
North America > United States (0.28)
Europe > Belgium (0.28)

Genre: Overview (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Clustering, Classification, Discriminant Analysis, and Dimension Reduction via Generalized Hyperbolic Mixtures

Morris, Katherine, McNicholas, Paul D.

arXiv.org Machine LearningJun-23-2015

A method for dimension reduction with clustering, classification, or discriminant analysis is introduced. This mixture model-based approach is based on fitting generalized hyperbolic mixtures on a reduced subspace within the paradigm of model-based clustering, classification, or discriminant analysis. A reduced subspace of the data is derived by considering the extent to which group means and group covariances vary. The members of the subspace arise through linear combinations of the original data, and are ordered by importance via the associated eigenvalues. The observations can be projected onto the subspace, resulting in a set of variables that captures most of the clustering information available. The use of generalized hyperbolic mixtures gives a robust framework capable of dealing with skewed clusters. Although dimension reduction is increasingly in demand across many application areas, the authors are most familiar with biological applications and so two of the five real data examples are within that sphere. Simulated data are also used for illustration. The approach introduced herein can be considered the most general such approach available, and so we compare results to three special and limiting cases. Comparisons with several well established techniques illustrate its promising performance.

artificial intelligence, discriminant analysis, machine learning, (17 more...)

arXiv.org Machine Learning

doi: 10.1016/j.csda.2015.10.008

1308.6315

Country: North America > Canada > Ontario (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning in High Dimensional Spaces (0.82)

Add feedback

Detectability thresholds and optimal algorithms for community structure in dynamic networks

Ghasemian, Amir, Zhang, Pan, Clauset, Aaron, Moore, Cristopher, Peel, Leto

arXiv.org Machine LearningJun-19-2015

We study the fundamental limits on learning latent community structure in dynamic networks. Specifically, we study dynamic stochastic block models where nodes change their community membership over time, but where edges are generated independently at each time step. In this setting (which is a special case of several existing models), we are able to derive the detectability threshold exactly, as a function of the rate of change and the strength of the communities. Below this threshold, we claim that no algorithm can identify the communities better than chance. We then give two algorithms that are optimal in the sense that they succeed all the way down to this limit. The first uses belief propagation (BP), which gives asymptotically optimal accuracy, and the second is a fast spectral clustering algorithm, based on linearizing the BP equations. We verify our analytic and algorithmic results via numerical simulation, and close with a brief discussion of extensions and open questions.

artificial intelligence, dynamic network, us government, (20 more...)

arXiv.org Machine Learning

doi: 10.1103/PhysRevX.6.031005

1506.06179

Country:

North America > United States > Colorado > Boulder County > Boulder (0.14)
Europe > Spain (0.14)

Genre: Research Report (0.50)

Industry:

Energy > Oil & Gas (0.88)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

A general framework for the IT-based clustering methods

Qiu, Teng, Li, Yongjie

arXiv.org Machine LearningJun-19-2015

Previously, we proposed a physically inspired rule to organize the data points in a sparse yet effective structure, called the in-tree (IT) graph, which is able to capture a wide class of underlying cluster structures in the datasets, especially for the density-based datasets. Although there are some redundant edges or lines between clusters requiring to be removed by computer, this IT graph has a big advantage compared with the k-nearest-neighborhood (k-NN) or the minimal spanning tree (MST) graph, in that the redundant edges in the IT graph are much more distinguishable and thus can be easily determined by several methods previously proposed by us. In this paper, we propose a general framework to re-construct the IT graph, based on an initial neighborhood graph, such as the k-NN or MST, etc, and the corresponding graph distances. For this general framework, our previous way of constructing the IT graph turns out to be a special case of it. This general framework 1) can make the IT graph capture a wider class of underlying cluster structures in the datasets, especially for the manifolds, and 2) should be more effective to cluster the sparse or graph-based datasets.

artificial intelligence, graph, machine learning, (16 more...)

arXiv.org Machine Learning

1506.06068

Country: Asia > China (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Representation Learning for Clustering: A Statistical Framework

Ashtiani, Hassan, Ben-David, Shai

arXiv.org Machine LearningJun-19-2015

We address the problem of communicating domain knowledge from a user to the designer of a clustering algorithm. We propose a protocol in which the user provides a clustering of a relatively small random sample of a data set. The algorithm designer then uses that sample to come up with a data representation under which $k$-means clustering results in a clustering (of the full data set) that is aligned with the user's clustering. We provide a formal statistical model for analyzing the sample complexity of learning a clustering representation with this paradigm. We then introduce a notion of capacity of a class of possible representations, in the spirit of the VC-dimension, showing that classes of representations that have finite such dimension can be successfully learned with sample size error bounds, and end our discussion with an analysis of that dimension for classes of representations induced by linear embeddings.

artificial intelligence, machine learning, mapping, (14 more...)

arXiv.org Machine Learning

1506.059

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.93)

Add feedback

Representative Selection in Non Metric Datasets

Liebman, Elad, Chor, Benny, Stone, Peter

arXiv.org Artificial IntelligenceJun-19-2015

This paper considers the problem of representative selection: choosing a subset of data points from a dataset that best represents its overall set of elements. This subset needs to inherently reflect the type of information contained in the entire set, while minimizing redundancy. For such purposes, clustering may seem like a natural approach. However, existing clustering methods are not ideally suited for representative selection, especially when dealing with non-metric data, where only a pairwise similarity measure exists. In this paper we propose $\delta$-medoids, a novel approach that can be viewed as an extension to the $k$-medoids algorithm and is specifically suited for sample representative selection from non-metric data. We empirically validate $\delta$-medoids in two domains, namely music analysis and motion analysis. We also show some theoretical bounds on the performance of $\delta$-medoids and the hardness of representative selection in general.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1080/08839514.2015.1071092

1502.07428

Country: North America > United States > Texas (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

A Topological Approach to Spectral Clustering

Rieser, Antonio

arXiv.org Machine LearningJun-8-2015

The analysis of complex, high-dimensional data is one of the major research challenges in contemporary computer science and statistics. In recent years, geometric and topological approaches to data analysis have begun to yield important insights into the structure of complex data sets (see, for instance, [1] for an example of spectral geometry applied to dimension reduction, and [6], [2] for surveys on homological methods of data analysis and visualization). The common point of departure of these methods is the assumption that data in highdimensional spaces is often concentrated around a low-dimensional manifold or other topological space. In this note, we begin from the assumption that the data comes from a uniform distribution supported on a topologically disconnected space, and that clusters in the data reflect this lack of topological connectivity. Geometric techniques for data analysis have concentrated on approximating the geometry of the data as a step toward nonlinear dimension reduction. Once the dimension is reduced, standard statistical techniques are then used to analyze the data in the lower-dimensional space.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

1506.02633

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.49)

Add feedback

Robust and computationally feasible community detection in the presence of arbitrary outlier nodes

Cai, T. Tony, Li, Xiaodong

arXiv.org Machine LearningJun-3-2015

Community detection, which aims to cluster $N$ nodes in a given graph into $r$ distinct groups based on the observed undirected edges, is an important problem in network data analysis. In this paper, the popular stochastic block model (SBM) is extended to the generalized stochastic block model (GSBM) that allows for adversarial outlier nodes, which are connected with the other nodes in the graph in an arbitrary way. Under this model, we introduce a procedure using convex optimization followed by $k$-means algorithm with $k=r$. Both theoretical and numerical properties of the method are analyzed. A theoretical guarantee is given for the procedure to accurately detect the communities with small misclassification rate under the setting where the number of clusters can grow with $N$. This theoretical result admits to the best-known result in the literature of computationally feasible community detection in SBM without outliers. Numerical results show that our method is both computationally fast and robust to different kinds of outliers, while some popular computationally fast community detection algorithms, such as spectral clustering applied to adjacency matrices or graph Laplacians, may fail to retrieve the major clusters due to a small portion of outliers. We apply a slight modification of our method to a political blogs data set, showing that our method is competent in practice and comparable to existing computationally feasible methods in the literature. To the best of the authors' knowledge, our result is the first in the literature in terms of clustering communities with fast growing numbers under the GSBM where a portion of arbitrary outlier nodes exist.

data mining, machine learning, node, (19 more...)

arXiv.org Machine Learning

doi: 10.1214/14-AOS1290

1404.6

Country: North America > United States > Pennsylvania (0.28)

Genre: Research Report > New Finding (0.54)

Industry: Information Technology (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback