Clustering
Dirichlet Fragmentation Processes
Ge, Hong, Gal, Yarin, Ghahramani, Zoubin
Tree structures are ubiquitous in data across many domains, and many datasets are naturally modelled by unobserved tree structures. In this paper, first we review the theory of random fragmentation processes [Bertoin, 2006], and a number of existing methods for modelling trees, including the popular nested Chinese restaurant process (nCRP). Then we define a general class of probability distributions over trees: the Dirichlet fragmentation process (DFP) through a novel combination of the theory of Dirichlet processes and random fragmentation processes. This DFP presents a stick-breaking construction, and relates to the nCRP in the same way the Dirichlet process relates to the Chinese restaurant process. Furthermore, we develop a novel hierarchical mixture model with the DFP, and empirically compare the new model to similar models in machine learning. Experiments show the DFP mixture model to be convincingly better than existing state-of-the-art approaches for hierarchical clustering and density modelling. The process of random fragmentation is common to many areas, such as the degradation of large polymer chains in chemistry, or the evolution of phylogenetic trees in biology. An elegant mathematical tool for describing such phenomena is the fragmentation process (FP) [Bertoin, 2006]. As a concrete example of a FP, consider a stick of unit length.
Weakly supervised clustering: Learning fine-grained signals from coarse labels
Wager, Stefan, Blocker, Alexander, Cardin, Niall
Consider a classification problem where we do not have access to labels for individual training examples, but only have average labels over subpopulations. We give practical examples of this setup and show how such a classification task can usefully be analyzed as a weakly supervised clustering problem. We propose three approaches to solving the weakly supervised clustering problem, including a latent variables model that performs well in our experiments. We illustrate our methods on an analysis of aggregated elections data and an industry data set that was the original motivation for this research.
Performance Bounds for Pairwise Entity Resolution
Barnes, Matt, Miller, Kyle, Dubrawski, Artur
One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.
A deep matrix factorization method for learning attribute representations
Trigeorgis, George, Bousmalis, Konstantinos, Zafeiriou, Stefanos, Schuller, Bjoern W.
Semi-Non-negative Matrix Factorization is a technique that learns a low-dimensional representation of a dataset that lends itself to a clustering interpretation. It is possible that the mapping between this new representation and our original data matrix contains rather complex hierarchical information with implicit lower-level hidden attributes, that classical one level clustering methodologies can not interpret. In this work we propose a novel model, Deep Semi-NMF, that is able to learn such hidden representations that allow themselves to an interpretation of clustering according to different, unknown attributes of a given dataset. We also present a semi-supervised version of the algorithm, named Deep WSF, that allows the use of (partial) prior information for each of the known attributes of a dataset, that allows the model to be used on datasets with mixed attribute knowledge. Finally, we show that our models are able to learn low-dimensional representations that are better suited for clustering, but also classification, outperforming Semi-Non-negative Matrix Factorization, but also other state-of-the-art methodologies variants.
Fuzzy Jets
Mackey, Lester, Nachman, Benjamin, Schwartzman, Ariel, Stansbury, Conrad
While some particles can be identified by their type, such as electrons [3, 4] and muons [5, 6], most of the detected particles are light hadrons produced in collimated sprays called jets. Jets are the consequence of high energy quarks or gluons fragmenting into colorless hadrons. Experimentally, jets are defined by clustering schemes which group together measured calorimeter energy deposits or reconstructed charged particle tracks. A jet algorithm is a clustering scheme that connects the measured objects with theoretical quantities that can be calculated and simulated. At a hadron collider, the natural coordinates for describing particles arep T, y, and φ, where p T is the magnitude of the momentum transverse to the proton beam,y is the rapidity, andφ is the azimuthal angle. Particles or calorimeter energy deposits are clustered using jet algorithms based on distance metrics on their coordinates in (p T, ρ) (p T,y,φ) . In order for a jet algorithm to be useful to experimentalists and theorists, the collection of jets should be IRC safe in the following sense: 1. Infrared safe (IR): if a particlei is added with p T 0, the jets are unaffected.
Simultaneous Clustering and Model Selection for Multinomial Distribution: A Comparative Study
Hasnat, Md. Abul, Velcin, Julien, Bonnevay, Stéphane, Jacques, Julien
In this paper, we study different discrete data clustering methods, which use the Model-Based Clustering (MBC) framework with the Multinomial distribution. Our study comprises several relevant issues, such as initialization, model estimation and model selection. Additionally, we propose a novel MBC method by efficiently combining the partitional and hierarchical clustering techniques. We conduct experiments on both synthetic and real data and evaluate the methods using accuracy, stability and computation time. Our study identifies appropriate strategies to be used for discrete data analysis with the MBC methods. Moreover, our proposed method is very competitive w.r.t.
Toward a generic representation of random variables for machine learning
Marti, Gautier, Very, Philippe, Donnat, Philippe
This paper presents a pre-processing and a distance which improve the performance of machine learning algorithms working on independent and identically distributed stochastic processes. We introduce a novel non-parametric approach to represent random variables which splits apart dependency and distribution without losing any information. We also propound an associated metric leveraging this representation and its statistical estimate. Besides experiments on synthetic datasets, the benefits of our contribution is illustrated through the example of clustering financial time series, for instance prices from the credit default swaps market. Results are available on the website www.datagrapple.com and an IPython Notebook tutorial is available at www.datagrapple.com/Tech for reproducible research.
A Review of Nonnegative Matrix Factorization Methods for Clustering
Clustering, the problem of partitioning observations with high intragroup similarity, has always been one of the central themes in unsupervised learning. Although a wide variety of algorithms for clustering applications have been studied in literature, the subject still remains an active avenue for research. In fact, it is possible to formulate clustering as a matrix decomposition problem. This formulation leads to interesting interpretations as well as novel algorithms that benefit from favorable computational properties of numerical linear algebra. Popularized by Lee and Seung [11], Nonnegative Matrix Factorization (NMF) has turned into one of the primary tools for decomposing data sets into low-rank factorizing matrices in order to yield a parts-based representation.
Robust Subspace Clustering via Thresholding
Heckel, Reinhard, Bölcskei, Helmut
The problem of clustering noisy and incompletely observed high-dimensional data points into a union of low-dimensional subspaces and a set of outliers is considered. The number of subspaces, their dimensions, and their orientations are assumed unknown. We propose a simple low-complexity subspace clustering algorithm, which applies spectral clustering to an adjacency matrix obtained by thresholding the correlations between data points. In other words, the adjacency matrix is constructed from the nearest neighbors of each data point in spherical distance. A statistical performance analysis shows that the algorithm exhibits robustness to additive noise and succeeds even when the subspaces intersect. Specifically, our results reveal an explicit tradeoff between the affinity of the subspaces and the tolerable noise level. We furthermore prove that the algorithm succeeds even when the data points are incompletely observed with the number of missing entries allowed to be (up to a log-factor) linear in the ambient dimension. We also propose a simple scheme that provably detects outliers, and we present numerical results on real and synthetic data.
Review and Perspective for Distance Based Trajectory Clustering
Besse, Philippe, Guillouet, Brendan, Loubes, Jean-Michel, François, Royer
TRAJECTORY is a set of positional information for a moving object, ordered by time. This kind of multidimensional data is prevalent in many fields and applications, for example, to understand migration patterns by studying trajectories of animals, predict meteorology with hurricane data, improve athletes performance, etc. Our study is concentrated on vehicle trajectories within a road network. The growing use of GPS receivers and WIFI embedded mobile devices equipped with hardware for storing data enables us to collect a very large amount of data, that has to be analyzed in order to extract any relevant information. The complexity of the extracted data makes it a difficult challenge.