AITopics | Clustering

Collaborating Authors

Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. (Wikipedia)

News Overviews Instructional Materials AI-Alerts Classics

Dirichlet Fragmentation Processes

Ge, Hong, Gal, Yarin, Ghahramani, Zoubin

arXiv.org Machine LearningSep-15-2015

Tree structures are ubiquitous in data across many domains, and many datasets are naturally modelled by unobserved tree structures. In this paper, first we review the theory of random fragmentation processes [Bertoin, 2006], and a number of existing methods for modelling trees, including the popular nested Chinese restaurant process (nCRP). Then we define a general class of probability distributions over trees: the Dirichlet fragmentation process (DFP) through a novel combination of the theory of Dirichlet processes and random fragmentation processes. This DFP presents a stick-breaking construction, and relates to the nCRP in the same way the Dirichlet process relates to the Chinese restaurant process. Furthermore, we develop a novel hierarchical mixture model with the DFP, and empirically compare the new model to similar models in machine learning. Experiments show the DFP mixture model to be convincingly better than existing state-of-the-art approaches for hierarchical clustering and density modelling. The process of random fragmentation is common to many areas, such as the degradation of large polymer chains in chemistry, or the evolution of phylogenetic trees in biology. An elegant mathematical tool for describing such phenomena is the fragmentation process (FP) [Bertoin, 2006]. As a concrete example of a FP, consider a stick of unit length.

artificial intelligence, fragmentation process, machine learning, (16 more...)

arXiv.org Machine Learning

1509.04781

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre:

Overview (0.54)
Research Report > Promising Solution (0.34)

Industry: Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.51)

Add feedback

Weakly supervised clustering: Learning fine-grained signals from coarse labels

Wager, Stefan, Blocker, Alexander, Cardin, Niall

arXiv.org Machine LearningSep-15-2015

Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a latent variables model that performs well in our experiments. We illustrate our methods on an analysis of aggregated elections data and an industry data set that was the original motivation for this research.

artificial intelligence, information, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1214/15-AOAS812

1310.1363

Country: North America > United States > California > Santa Clara County (0.28)

Genre: Research Report > New Finding (0.89)

Industry: Government > Voting & Elections (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Performance Bounds for Pairwise Entity Resolution

Barnes, Matt, Miller, Kyle, Dubrawski, Artur

arXiv.org Machine LearningSep-10-2015

One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.

information retrieval, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

1509.03302

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.15)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

A deep matrix factorization method for learning attribute representations

Trigeorgis, George, Bousmalis, Konstantinos, Zafeiriou, Stefanos, Schuller, Bjoern W.

arXiv.org Machine LearningSep-10-2015

Semi-Non-negative Matrix Factorization is a technique that learns a low-dimensional representation of a dataset that lends itself to a clustering interpretation. It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical information with implicit lower-level hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep Semi-NMF, that is able to learn such hidden representations that allow themselves to an interpretation of clustering according to different, unknown attributes of a given dataset. We also present a semi-supervised version of the algorithm, named Deep WSF, that allows the use of (partial) prior information for each of the known attributes of a dataset, that allows the model to be used on datasets with mixed attribute knowledge. Finally, we show that our models are able to learn low-dimensional representations that are better suited for clustering, but also classification, outperforming Semi-Non-negative Matrix Factorization, but also other state-of-the-art methodologies variants.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Machine Learning

1509.03248

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report > Promising Solution (0.48)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Fuzzy Jets

Mackey, Lester, Nachman, Benjamin, Schwartzman, Ariel, Stansbury, Conrad

arXiv.org Machine LearningSep-7-2015

While some particles can be identified by their type, such as electrons [3, 4] and muons [5, 6], most of the detected particles are light hadrons produced in collimated sprays called jets. Jets are the consequence of high energy quarks or gluons fragmenting into colorless hadrons. Experimentally, jets are defined by clustering schemes which group together measured calorimeter energy deposits or reconstructed charged particle tracks. A jet algorithm is a clustering scheme that connects the measured objects with theoretical quantities that can be calculated and simulated. At a hadron collider, the natural coordinates for describing particles arep T, y, and φ, where p T is the magnitude of the momentum transverse to the proton beam,y is the rapidity, andφ is the azimuthal angle. Particles or calorimeter energy deposits are clustered using jet algorithms based on distance metrics on their coordinates in (p T, ρ) (p T,y,φ) . In order for a jet algorithm to be useful to experimentalists and theorists, the collection of jets should be IRC safe in the following sense: 1. Infrared safe (IR): if a particlei is added with p T 0, the jets are unaffected.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Machine Learning

doi: 10.1007/JHEP06(2016)010

1509.02216

Country: North America > United States > California (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

Simultaneous Clustering and Model Selection for Multinomial Distribution: A Comparative Study

Hasnat, Md. Abul, Velcin, Julien, Bonnevay, Stéphane, Jacques, Julien

arXiv.org Machine LearningSep-6-2015

In this paper, we study different discrete data clustering methods, which use the Model-Based Clustering (MBC) framework with the Multinomial distribution. Our study comprises several relevant issues, such as initialization, model estimation and model selection. Additionally, we propose a novel MBC method by efficiently combining the partitional and hierarchical clustering techniques. We conduct experiments on both synthetic and real data and evaluate the methods using accuracy, stability and computation time. Our study identifies appropriate strategies to be used for discrete data analysis with the MBC methods. Moreover, our proposed method is very competitive w.r.t.

artificial intelligence, dataset, machine learning, (12 more...)

arXiv.org Machine Learning

1505.02324

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)

Add feedback

Toward a generic representation of random variables for machine learning

Marti, Gautier, Very, Philippe, Donnat, Philippe

arXiv.org Machine LearningSep-3-2015

This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.

artificial intelligence, machine learning, random variable, (18 more...)

arXiv.org Machine Learning

1506.00976

Country: Europe (0.28)

Genre: Research Report (0.64)

Industry: Banking & Finance > Trading (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)

Add feedback

A Review of Nonnegative Matrix Factorization Methods for Clustering

Türkmen, Ali Caner

arXiv.org Machine LearningAug-28-2015

Clustering, the problem of partitioning observations with high intragroup similarity, has always been one of the central themes in unsupervised learning. Although a wide variety of algorithms for clustering applications have been studied in literature, the subject still remains an active avenue for research. In fact, it is possible to formulate clustering as a matrix decomposition problem. This formulation leads to interesting interpretations as well as novel algorithms that benefit from favorable computational properties of numerical linear algebra. Popularized by Lee and Seung [11], Nonnegative Matrix Factorization (NMF) has turned into one of the primary tools for decomposing data sets into low-rank factorizing matrices in order to yield a parts-based representation.

artificial intelligence, machine learning, matrix, (16 more...)

arXiv.org Machine Learning

1507.03194

Country:

North America > United States (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Add feedback

Robust Subspace Clustering via Thresholding

Heckel, Reinhard, Bölcskei, Helmut

arXiv.org Machine LearningAug-21-2015

The problem of clustering noisy and incompletely observed high-dimensional data points into a union of low-dimensional subspaces and a set of outliers is considered. The number of subspaces, their dimensions, and their orientations are assumed unknown. We propose a simple low-complexity subspace clustering algorithm, which applies spectral clustering to an adjacency matrix obtained by thresholding the correlations between data points. In other words, the adjacency matrix is constructed from the nearest neighbors of each data point in spherical distance. A statistical performance analysis shows that the algorithm exhibits robustness to additive noise and succeeds even when the subspaces intersect. Specifically, our results reveal an explicit tradeoff between the affinity of the subspaces and the tolerable noise level. We furthermore prove that the algorithm succeeds even when the data points are incompletely observed with the number of missing entries allowed to be (up to a log-factor) linear in the ambient dimension. We also propose a simple scheme that provably detects outliers, and we present numerical results on real and synthetic data.

outlier, probability, subspace, (14 more...)

arXiv.org Machine Learning

1307.4891

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Review and Perspective for Distance Based Trajectory Clustering

Besse, Philippe, Guillouet, Brendan, Loubes, Jean-Michel, François, Royer

arXiv.org Machine LearningAug-20-2015

TRAJECTORY is a set of positional information for a moving object, ordered by time. This kind of multidimensional data is prevalent in many fields and applications, for example, to understand migration patterns by studying trajectories of animals, predict meteorology with hurricane data, improve athletes performance, etc. Our study is concentrated on vehicle trajectories within a road network. The growing use of GPS receivers and WIFI embedded mobile devices equipped with hardware for storing data enables us to collect a very large amount of data, that has to be analyzed in order to extract any relevant information. The complexity of the extracted data makes it a difficult challenge.

artificial intelligence, machine learning, trajectory, (17 more...)

arXiv.org Machine Learning

1508.04904

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback