Clustering
Constrained Clustering and Multiple Kernel Learning without Pairwise Constraint Relaxation
Boecking, Benedikt, Jeanselme, Vincent, Dubrawski, Artur
Clustering under pairwise constraints is an important knowledge discovery tool that enables the learning of appropriate kernels or distance metrics to improve clustering performance. These pairwise constraints, which come in the form of must-link and cannot-link pairs, arise naturally in many applications and are intuitive for users to provide. However, the common practice of relaxing discrete constraints to a continuous domain to ease optimization when learning kernels or metrics can harm generalization, as information which only encodes linkage is transformed to informing distances. We introduce a new constrained clustering algorithm that jointly clusters data and learns a kernel in accordance with the available pairwise constraints. To generalize well, our method is designed to maximize constraint satisfaction without relaxing pairwise constraints to a continuous domain where they inform distances. We show that the proposed method outperforms existing approaches on a large number of diverse publicly available datasets, and we discuss how our method can scale to handling large data.
Adaptative clustering by minimization of the mixing entropy criterion
We present a clustering method and provide a theoretical analysis and an explanation to a phenomenon encountered in the applied statistical literature since the 1990's. This phenomenon is the natural adaptability of the order when using a clustering method derived from the famous EM algorithm. We define a new statistic, the relative entropic order, that represents the number of clumps in the target distribution. We prove in particular that the empirical version of this relative entropic order is consistent. Our approach is easy to implement and has a high potential of applications. Perspectives of this works are algorithmic and theoretical, with possible natural extensions to various cases such as dependent or multidimensional data.
Unsupervised machine learning approaches to the $q$-state Potts model
Tirelli, Andrea, Carvalho, Danyella O., Oliveira, Lucas A., Lima, J. P., Costa, Natanael C., Santos, Raimundo R. dos
In this paper with study phase transitions of the $q$-state Potts model, through a number of unsupervised machine learning techniques, namely Principal Component Analysis (PCA), $k$-means clustering, Uniform Manifold Approximation and Projection (UMAP), and Topological Data Analysis (TDA). Even though in all cases we are able to retrieve the correct critical temperatures $T_c(q)$, for $q = 3, 4$ and $5$, results show that non-linear methods as UMAP and TDA are less dependent on finite size effects, while still being able to distinguish between first and second order phase transitions. This study may be considered as a benchmark for the use of different unsupervised machine learning algorithms in the investigation of phase transitions.
Hierarchical Clustering and Matrix Completion for the Reconstruction of World Input-Output Tables
Metulini, Rodolfo, Gnecco, Giorgio, Biancalani, Francesco, Riccaboni, Massimo
World Input-Output (I/O) matrices provide the networks of within- and cross-country economic relations. In the context of I/O analysis, the methodology adopted by national statistical offices in data collection raises the issue of obtaining reliable data in a timely fashion and it makes the reconstruction of (part of) the I/O matrices of particular interest. In this work, we propose a method combining hierarchical clustering and Matrix Completion (MC) with a LASSO-like nuclear norm penalty, to impute missing entries of a partially unknown I/O matrix. Through simulations based on synthetic matrices we study the effectiveness of the proposed method to predict missing values from both previous years data and current data related to countries similar to the one for which current data are obscured. To show the usefulness of our method, an application based on World Input-Output Database (WIOD) tables - which are an example of industry-by-industry I/O tables - is provided. Strong similarities in structure between WIOD and other I/O tables are also found, which make the proposed approach easily generalizable to them.
Natural Hierarchical Cluster Analysis by Nearest Neighbors with Near-Linear Time Complexity
We propose a nearest neighbor based clustering algorithm that results in a naturally defined hierarchy of clusters. In contrast to the agglomerative and divisive hierarchical clustering algorithms, our approach is not dependent on the iterative working of the algorithm, in the sense that the partitions of the hierarchical clusters are purely defined in accordance with the input dataset. Our method is a universal hierarchical clustering approach since it can be implemented as bottom up or top down versions, both of which result in the same clustering. We show that for certain types of datasets, our algorithm has near-linear time and space complexity.
Geometric reconstructions of density based clusterings
Garcia-Pulido, A. L., Samardzhiev, K. P.
DBSCAN* and HDBSCAN* are well established density based clustering algorithms. However, obtaining the clusters of very large datasets is infeasible, limiting their use in real world applications. By exploiting the geometry of Euclidean space, we prove that it is possible to systematically construct the DBSCAN* and HDBSCAN* clusters of a finite $X\subset \mathbb{R}^n$ from specific subsets of $X$. We are able to control the size of these subsets and therefore our results make it possible to cluster very large datasets. To illustrate our theory, we cluster the Microsoft Building Footprint Database of the US, which is not possible using the standard implementations.
Generalized Spectral Clustering for Directed and Undirected Graphs
Sevi, Harry, Jonckheere, Matthieu, Kalogeratos, Argyris
Spectral clustering is a popular approach for clustering undirected graphs, but its extension to directed graphs (digraphs) is much more challenging. A typical workaround is to naively symmetrize the adjacency matrix of the directed graph, which can however lead to discarding valuable information carried by edge directionality. In this paper, we present a generalized spectral clustering framework that can address both directed and undirected graphs. Our approach is based on the spectral relaxation of a new functional that we introduce as the generalized Dirichlet energy of a graph function, with respect to an arbitrary positive regularizing measure on the graph edges. We also propose a practical parametrization of the regularizing measure constructed from the iterated powers of the natural random walk on the graph. We present theoretical arguments to explain the efficiency of our framework in the challenging setting of unbalanced classes. Experiments using directed K-NN graphs constructed from real datasets show that our graph partitioning method performs consistently well in all cases, while outperforming existing approaches in most of them.
Cluster Assignment in Multi-Agent Systems
Abstract--We study cluster assignment in multi-agent networks. The process of reaching an agreement between agents is In this work we focus on homogeneous networks, that is one of the fundamental tasks for a multi-agent system (MAS). The problem we aim to solve is how to design graphs computation [1], robotics [2], biochemical systems [3], and that ensure the networked system will converge to a prescribed sensor networks [4]. A natural extension to the agreement cluster configuration, i.e., specifying the number of clusters problem is the cluster agreement problem, which seeks to and the number of agents within each cluster. Employing tools drive agents into groups. All the agents within the same group from group theory, we show that it is possible to design an should then reach an agreement.
From graph cuts to isoperimetric inequalities: Convergence rates of Cheeger cuts on data clouds
Trillos, Nicolas Garcia, Murray, Ryan, Thorpe, Matthew
In this work we study statistical properties of graph-based clustering algorithms that rely on the optimization of balanced graph cuts, the main example being the optimization of Cheeger cuts. We consider proximity graphs built from data sampled from an underlying distribution supported on a generic smooth compact manifold $M$. In this setting, we obtain high probability convergence rates for both the Cheeger constant and the associated Cheeger cuts towards their continuum counterparts. The key technical tools are careful estimates of interpolation operators which lift empirical Cheeger cuts to the continuum, as well as continuum stability estimates for isoperimetric problems. To our knowledge the quantitative estimates obtained here are the first of their kind.
Sparse Subspace Clustering for Concept Discovery (SSCCD)
Vielhaben, Johanna, Blücher, Stefan, Strodthoff, Nils
Concepts are key building blocks of higher level human understanding. Explainable AI (XAI) methods have shown tremendous progress in recent years, however, local attribution methods do not allow to identify coherent model behavior across samples and therefore miss this essential component. In this work, we study concept-based explanations and put forward a new definition of concepts as low-dimensional subspaces of hidden feature layers. We novelly apply sparse subspace clustering to discover these concept subspaces. Moving forward, we derive insights from concept subspaces in terms of localized input (concept) maps, show how to quantify concept relevances and lastly, evaluate similarities and transferability between concepts. We empirically demonstrate the soundness of the proposed Sparse Subspace Clustering for Concept Discovery (SSCCD) method for a variety of different image classification tasks. This approach allows for deeper insights into the actual model behavior that would remain hidden from conventional input-level heatmaps.