Clustering
The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering
Yu, Shangdi, Shi, Jessica, Meindl, Jamison, Eisenstat, David, Ju, Xiaoen, Tavakkol, Sasan, Dhulipala, Laxman, Łącki, Jakub, Mirrokni, Vahab, Shun, Julian
We introduce the ParClusterers Benchmark Suite (PCBS) -- a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms, which can be useful for fine-tuning the performance of clustering on a given task, and for comparing different clustering algorithms based on different metrics of interest, including clustering quality and running time. Using PCBS, we evaluate a broad collection of real-world graph clustering datasets. Somewhat surprisingly, we find that the best quality results are obtained by algorithms that not included in many popular graph clustering toolkits. The PCBS provides a standardized way to evaluate and judge the quality-performance tradeoffs of the active research area of scalable graph clustering algorithms. We believe it will help enable fair, accurate, and nuanced evaluation of graph clustering algorithms in the future.
Low-Rank Optimal Transport through Factor Relaxation with Latent Coupling
Halmos, Peter, Liu, Xinhao, Gold, Julian, Raphael, Benjamin J
Optimal transport (OT) is a general framework for finding a minimum-cost transport plan, or coupling, between probability distributions, and has many applications in machine learning. A key challenge in applying OT to massive datasets is the quadratic scaling of the coupling matrix with the size of the dataset. [Forrow et al. 2019] introduced a factored coupling for the k-Wasserstein barycenter problem, which [Scetbon et al. 2021] adapted to solve the primal low-rank OT problem. We derive an alternative parameterization of the low-rank problem based on the $\textit{latent coupling}$ (LC) factorization previously introduced by [Lin et al. 2021] generalizing [Forrow et al. 2019]. The LC factorization has multiple advantages for low-rank OT including decoupling the problem into three OT problems and greater flexibility and interpretability. We leverage these advantages to derive a new algorithm $\textit{Factor Relaxation with Latent Coupling}$ (FRLC), which uses $\textit{coordinate}$ mirror descent to compute the LC factorization. FRLC handles multiple OT objectives (Wasserstein, Gromov-Wasserstein, Fused Gromov-Wasserstein), and marginal constraints (balanced, unbalanced, and semi-relaxed) with linear space complexity. We provide theoretical results on FRLC, and demonstrate superior performance on diverse applications -- including graph clustering and spatial transcriptomics -- while demonstrating its interpretability.
Temporal Patterns of Multiple Long-Term Conditions in Individuals with Intellectual Disability Living in Wales: An Unsupervised Clustering Approach to Disease Trajectories
Kousovista, Rania, Cosma, Georgina, Abakasanga, Emeka, Akbari, Ashley, Zaccardi, Francesco, Jun, Gyuchan Thomas, Kiani, Reza, Gangadharan, Satheesh
Identifying and understanding the co-occurrence of multiple long-term conditions (MLTC) in individuals with intellectual disabilities (ID) is vital for effective healthcare management. These individuals often face earlier onset and higher prevalence of MLTCs, yet specific co-occurrence patterns remain unexplored. This study applies an unsupervised approach to characterise MLTC clusters based on shared disease trajectories using electronic health records (EHRs) from 13069 individuals with ID in Wales (2000-2021). Disease associations and temporal directionality were assessed, followed by spectral clustering to group shared trajectories. The population consisted of 52.3% males and 47.7% females, with an average of 4.5 conditions per patient. Males under 45 formed a single cluster dominated by neurological conditions (32.4%), while males above 45 had three clusters, the largest characterised circulatory (51.8%). Females under 45 formed one cluster with digestive conditions (24.6%) as most prevalent, while those aged 45 and older showed two clusters: one dominated by circulatory (34.1%), and the other by digestive (25.9%) and musculoskeletal (21.9%) system conditions. Mental illness, epilepsy, and reflux were common across groups. These clusters offer insights into disease progression in individuals with ID, informing targeted interventions and personalised healthcare strategies.
Fully Dynamic Adversarially Robust Correlation Clustering in Polylogarithmic Update Time
Braverman, Vladimir, Dharangutte, Prathamesh, Pai, Shreyas, Shah, Vihan, Wang, Chen
We study the dynamic correlation clustering problem with $\textit{adaptive}$ edge label flips. In correlation clustering, we are given a $n$-vertex complete graph whose edges are labeled either $(+)$ or $(-)$, and the goal is to minimize the total number of $(+)$ edges between clusters and the number of $(-)$ edges within clusters. We consider the dynamic setting with adversarial robustness, in which the $\textit{adaptive}$ adversary could flip the label of an edge based on the current output of the algorithm. Our main result is a randomized algorithm that always maintains an $O(1)$-approximation to the optimal correlation clustering with $O(\log^{2}{n})$ amortized update time. Prior to our work, no algorithm with $O(1)$-approximation and $\text{polylog}{(n)}$ update time for the adversarially robust setting was known. We further validate our theoretical results with experiments on synthetic and real-world datasets with competitive empirical performances. Our main technical ingredient is an algorithm that maintains $\textit{sparse-dense decomposition}$ with $\text{polylog}{(n)}$ update time, which could be of independent interest.
A Dual Adaptive Assignment Approach for Robust Graph-Based Clustering
Xiang, Yang, Fan, Li, Saha, Tulika, Pang, Xiaoying, Pan, Yushan, Zhang, Haiyang, Ji, Chengtao
Graph clustering is an essential aspect of network analysis that involves grouping nodes into separate clusters. Recent developments in deep learning have resulted in advanced deep graph clustering techniques, which have proven effective in many applications. Nonetheless, these methods often encounter difficulties when dealing with the complexities of real-world graphs, particularly in the presence of noisy edges. Additionally, many denoising graph clustering strategies tend to suffer from lower performance compared to their non-denoised counterparts, training instability, and challenges in scaling to large datasets. To tackle these issues, we introduce a new framework called the Dual Adaptive Assignment Approach for Robust Graph-Based Clustering (RDSA). RDSA consists of three key components: (i) a node embedding module that effectively integrates the graph's topological features and node attributes; (ii) a structure-based soft assignment module that improves graph modularity by utilizing an affinity matrix for node assignments; and (iii) a node-based soft assignment module that identifies community landmarks and refines node assignments to enhance the model's robustness. We assess RDSA on various real-world datasets, demonstrating its superior performance relative to existing state-of-the-art methods. Our findings indicate that RDSA provides robust clustering across different graph types, excelling in clustering effectiveness and robustness, including adaptability to noise, stability, and scalability.
KULCQ: An Unsupervised Keyword-based Utterance Level Clustering Quality Metric
Guruprasad, Pranav, Mokhberian, Negar, Varghese, Nikhil, Khatri, Chandra, Kelkar, Amol
Intent discovery is crucial for both building new conversational agents and improving existing ones. While several approaches have been proposed for intent discovery, most rely on clustering to group similar utterances together. Traditional evaluation of these utterance clusters requires intent labels for each utterance, limiting scalability. Although some clustering quality metrics exist that do not require labeled data, they focus solely on cluster geometry while ignoring the linguistic nuances present in conversational transcripts. In this paper, we introduce Keyword-based Utterance Level Clustering Quality (KULCQ), an unsupervised metric that leverages keyword analysis to evaluate clustering quality. We demonstrate KULCQ's effectiveness by comparing it with existing unsupervised clustering metrics and validate its performance through comprehensive ablation studies. Our results show that KULCQ better captures semantic relationships in conversational data while maintaining consistency with geometric clustering principles.
Adaptive Transfer Clustering: A Unified Framework
Gu, Yuqi, Lyu, Zhongyuan, Wang, Kaizheng
We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios.
Revealing the Evolution of Order in Materials Microstructures Using Multi-Modal Computer Vision
Ter-Petrosyan, Arman, Holden, Michael, Bilbrey, Jenna A., Akers, Sarah, Doty, Christina, Yano, Kayla H., Wang, Le, Paudel, Rajendra, Lang, Eric, Hattar, Khalid, Comes, Ryan B., Du, Yingge, Matthews, Bethany E., Spurgeon, Steven R.
The development of high-performance materials for microelectronics, energy storage, and extreme environments depends on our ability to describe and direct property-defining microstructural order. Our present understanding is typically derived from laborious manual analysis of imaging and spectroscopy data, which is difficult to scale, challenging to reproduce, and lacks the ability to reveal latent associations needed for mechanistic models. Here, we demonstrate a multi-modal machine learning (ML) approach to describe order from electron microscopy analysis of the complex oxide La$_{1-x}$Sr$_x$FeO$_3$. We construct a hybrid pipeline based on fully and semi-supervised classification, allowing us to evaluate both the characteristics of each data modality and the value each modality adds to the ensemble. We observe distinct differences in the performance of uni- and multi-modal models, from which we draw general lessons in describing crystal order using computer vision.
Deep Autoencoders for Unsupervised Anomaly Detection in Wildfire Prediction
Üstek, İrem, Arana-Catania, Miguel, Farr, Alexander, Petrunin, Ivan
Wildfires pose a significantly increasing hazard to global ecosystems due to the climate crisis. Due to its complex nature, there is an urgent need for innovative approaches to wildfire prediction, such as machine learning. This research took a unique approach, differentiating from classical supervised learning, and addressed the gap in unsupervised wildfire prediction using autoencoders and clustering techniques for anomaly detection. Historical weather and normalised difference vegetation index datasets of Australia for 2005 - 2021 were utilised. Two main unsupervised approaches were analysed. The first used a deep autoencoder to obtain latent features, which were then fed into clustering models, isolation forest, local outlier factor and one-class SVM for anomaly detection. The second approach used a deep autoencoder to reconstruct the input data and use reconstruction errors to identify anomalies. Long Short-Term Memory (LSTM) autoencoders and fully connected (FC) autoencoders were employed in this part, both in an unsupervised way learning only from nominal data. The FC autoencoder outperformed its counterparts, achieving an accuracy of 0.71, an F1-score of 0.74, and an MCC of 0.42. These findings highlight the practicality of this method, as it effectively predicts wildfires in the absence of ground truth, utilising an unsupervised learning technique.
Partial Multi-View Clustering via Meta-Learning and Contrastive Feature Alignment
Partial multi-view clustering (PVC) presents significant challenges practical research problem for data analysis in real-world applications, especially when some views of the data are partially missing. Existing clustering methods struggle to handle incomplete views effectively, leading to suboptimal clustering performance. In this paper, we propose a novel dual optimization framework based on contrastive learning, which aims to maximize the consistency of latent features in incomplete multi-view data and improve clustering performance through deep learning models. By combining a fine-tuned Vision Transformer and k-nearest neighbors (KNN), we fill in missing views and dynamically adjust view weights using self-supervised learning and meta-learning. Experimental results demonstrate that our framework outperforms state-of-the-art clustering models on the BDGP and HW datasets, particularly in handling complex and incomplete multi-view data.