Clustering
T-Stochastic Graphs
Previous statistical approaches to hierarchical clustering for social network analysis all construct an "ultrametric" hierarchy. While the assumption of ultrametricity has been discussed and studied in the phylogenetics literature, it has not yet been acknowledged in the social network literature. We show that "non-ultrametric structure" in the network introduces significant instabilities in the existing top-down recovery algorithms. To address this issue, we introduce an instability diagnostic plot and use it to examine a collection of empirical networks. These networks appear to violate the "ultrametric" assumption. We propose a deceptively simple and yet general class of probabilistic models called $\mathbb{T}$-Stochastic Graphs which impose no topological restrictions on the latent hierarchy. To illustrate this model, we propose six alternative forms of hierarchical network models and then show that all six are equivalent to the $\mathbb{T}$-Stochastic Graph model. These alternative models motivate a novel approach to hierarchical clustering that combines spectral techniques with the well-known Neighbor-Joining algorithm from phylogenetic reconstruction. We prove this spectral approach is statistically consistent.
Normalised clustering accuracy: An asymmetric external cluster validity measure
There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. Yet, their validity is questionable, because the clusterings they promote can sometimes be meaningless. External measures, on the other hand, compare the algorithms' outputs to the reference, ground truth groupings that are provided by experts. In this paper, we argue that the commonly-used classical partition similarity scores, such as the normalised mutual information, Fowlkes-Mallows, or adjusted Rand index, miss some desirable properties, e.g., they do not identify worst-case scenarios correctly or are not easily interpretable. This makes comparing clustering algorithms across many benchmark datasets difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic, scale invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
Parallel Computation of Multi-Slice Clustering of Third-Order Tensors
Andriantsiory, Dina Faneva, Coti, Camille, Geloun, Joseph Ben, Lebbah, Mustapha
Machine Learning approaches like clustering methods deal with massive datasets that present an increasing challenge. We devise parallel algorithms to compute the Multi-Slice Clustering (MSC) for 3rd-order tensors. The MSC method is based on spectral analysis of the tensor slices and works independently on each tensor mode. Such features fit well in the parallel paradigm via a distributed memory system. We show that our parallel scheme outperforms sequential computing and allows for the scalability of the MSC method.
Gaussian Latent Representations for Uncertainty Estimation using Mahalanobis Distance in Deep Classifiers
Venkataramanan, Aishwarya, Benbihi, Assia, Laviale, Martin, Pradalier, Cedric
Recent works show that the data distribution in a network's latent space is useful for estimating classification uncertainty and detecting Out-of-distribution (OOD) samples. To obtain a well-regularized latent space that is conducive for uncertainty estimation, existing methods bring in significant changes to model architectures and training procedures. In this paper, we present a lightweight, fast, and high-performance regularization method for Mahalanobis distance-based uncertainty prediction, and that requires minimal changes to the network's architecture. To derive Gaussian latent representation favourable for Mahalanobis Distance calculation, we introduce a self-supervised representation learning method that separates in-class representations into multiple Gaussians. Classes with non-Gaussian representations are automatically identified and dynamically clustered into multiple new classes that are approximately Gaussian. Evaluation on standard OOD benchmarks shows that our method achieves state-of-the-art results on OOD detection with minimal inference time, and is very competitive on predictive probability calibration. Finally, we show the applicability of our method to a real-life computer vision use case on microorganism classification.
Statistically Optimal K-means Clustering via Nonnegative Low-rank Semidefinite Programming
Zhuang, Yubo, Chen, Xiaohui, Yang, Yun, Zhang, Richard Y.
$K$-means clustering is a widely used machine learning method for identifying patterns in large datasets. Semidefinite programming (SDP) relaxations have recently been proposed for solving the $K$-means optimization problem that enjoy strong statistical optimality guarantees, but the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. By contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm that is widely used by machine learning practitioners, but without a solid statistical underpinning nor rigorous guarantees. In this paper, we describe an NMF-like algorithm that works by solving a nonnegative low-rank restriction of the SDP relaxed $K$-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is just as simple and scalable as state-of-the-art NMF algorithms, while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves substantially smaller mis-clustering errors compared to the existing state-of-the-art.
Attribute Graph Clustering via Learnable Augmentation
Yang, Xihong, Liu, Yue, Liang, Ke, Zhou, Sihang, Liu, Xinwang, Zhu, En
Contrastive deep graph clustering (CDGC) utilizes contrastive learning to group nodes into different clusters. Better augmentation techniques benefit the quality of the contrastive samples, thus being one of key factors to improve performance. However, the augmentation samples in existing methods are always predefined by human experiences, and agnostic from the downstream task clustering, thus leading to high human resource costs and poor performance. To this end, we propose an Attribute Graph Clustering method via Learnable Augmentation (\textbf{AGCLA}), which introduces learnable augmentors for high-quality and suitable augmented samples for CDGC. Specifically, we design two learnable augmentors for attribute and structure information, respectively. Besides, two refinement matrices, including the high-confidence pseudo-label matrix and the cross-view sample similarity matrix, are generated to improve the reliability of the learned affinity matrix. During the training procedure, we notice that there exist differences between the optimization goals for training learnable augmentors and contrastive learning networks. In other words, we should both guarantee the consistency of the embeddings as well as the diversity of the augmented samples. Thus, an adversarial learning mechanism is designed in our method. Moreover, a two-stage training strategy is leveraged for the high-confidence refinement matrices. Extensive experimental results demonstrate the effectiveness of AGCLA on six benchmark datasets.
Multi-Swap $k$-Means++
Beretta, Lorenzo, Cohen-Addad, Vincent, Lattanzi, Silvio, Parotsidis, Nikos
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective and is known to give an $O(\log k)$-approximation in expectation. To obtain higher quality solutions, Lattanzi and Sohler (ICML 2019) proposed augmenting $k$-means++ with $O(k \log \log k)$ local search steps obtained through the $k$-means++ sampling distribution to yield a $c$-approximation to the $k$-means clustering problem, where $c$ is a large absolute constant. Here we generalize and extend their local search algorithm by considering larger and more sophisticated local search neighborhoods hence allowing to swap multiple centers at the same time. Our algorithm achieves a $9 + \varepsilon$ approximation ratio, which is the best possible for local search. Importantly we show that our approach yields substantial practical improvements, we show significant quality improvements over the approach of Lattanzi and Sohler (ICML 2019) on several datasets.
ShapeDBA: Generating Effective Time Series Prototypes using ShapeDTW Barycenter Averaging
Ismail-Fawaz, Ali, Fawaz, Hassan Ismail, Petitjean, François, Devanne, Maxime, Weber, Jonathan, Berretti, Stefano, Webb, Geoffrey I., Forestier, Germain
Time series data can be found in almost every domain, ranging from the medical field to manufacturing and wireless communication. Generating realistic and useful exemplars and prototypes is a fundamental data analysis task. In this paper, we investigate a novel approach to generating realistic and useful exemplars and prototypes for time series data. Our approach uses a new form of time series average, the ShapeDTW Barycentric Average. We therefore turn our attention to accurately generating time series prototypes with a novel approach. The existing time series prototyping approaches rely on the Dynamic Time Warping (DTW) similarity measure such as DTW Barycentering Average (DBA) and SoftDBA. These last approaches suffer from a common problem of generating out-of-distribution artifacts in their prototypes. This is mostly caused by the DTW variant used and its incapability of detecting neighborhood similarities, instead it detects absolute similarities. Our proposed method, ShapeDBA, uses the ShapeDTW variant of DTW, that overcomes this issue. We chose time series clustering, a popular form of time series analysis to evaluate the outcome of ShapeDBA compared to the other prototyping approaches. Coupled with the k-means clustering algorithm, and evaluated on a total of 123 datasets from the UCR archive, our proposed averaging approach is able to achieve new state-of-the-art results in terms of Adjusted Rand Index.
Ellipsoid fitting with the Cayley transform
Melikechi, Omar, Dunson, David B.
We introduce Cayley transform ellipsoid fitting (CTEF), an algorithm that uses the Cayley transform to fit ellipsoids to noisy data in any dimension. Unlike many ellipsoid fitting methods, CTEF is ellipsoid specific, meaning it always returns elliptic solutions, and can fit arbitrary ellipsoids. It also significantly outperforms other fitting methods when data are not uniformly distributed over the surface of an ellipsoid. Inspired by growing calls for interpretable and reproducible methods in machine learning, we apply CTEF to dimension reduction, data visualization, and clustering in the context of cell cycle and circadian rhythm data and several classical toy examples. Since CTEF captures global curvature, it extracts nonlinear features in data that other machine learning methods fail to identify. For example, on the clustering examples CTEF outperforms 10 popular algorithms.
A Comprehensive Review of Community Detection in Graphs
Lai, Songning, Li, Jiakang, Lu, Yonggang
The study of complex networks has significantly advanced our understanding of community structures which serves as a crucial feature of real-world graphs. Detecting communities in graphs is a challenging problem with applications in sociology, biology, and computer science. Despite the efforts of an interdisciplinary community of scientists, a satisfactory solution to this problem has not yet been achieved. This review article delves into the topic of community detection in graphs, which serves as a crucial role in understanding the organization and functioning of complex systems. We begin by introducing the concept of community structure, which refers to the arrangement of vertices into clusters, with strong internal connections and weaker connections between clusters. Then, we provide a thorough exposition of various community detection methods, including a new method designed by us. Additionally, we explore real-world applications of community detection in diverse networks. In conclusion, this comprehensive review provides a deep understanding of community detection in graphs. It serves as a valuable resource for researchers and practitioners in multiple disciplines, offering insights into the challenges, methodologies, and applications of community detection in complex networks.