Clustering
S3GC: Scalable Self-Supervised Graph Clustering
We study the problem of clustering graphs with additional side-information of node features. The problem is extensively studied, and several existing methods exploit Graph Neural Networks to learn node representations. However, most of the existing methods focus on generic representations instead of their cluster-ability or do not scale to large scale graph datasets. In this work, we propose S3GC which uses contrastive learning along with Graph Neural Networks and node features to learn clusterable features. We empirically demonstrate that S3GC is able to learn the correct cluster structure even when graph information or node features are individually not informative enough to learn correct clusters.
Fair Clustering via Alignment
Kim, Kunwoong, Lee, Jihu, Park, Sangchul, Kim, Yongdai
Algorithmic fairness in clustering aims to balance the proportions of instances assigned to each cluster with respect to a given sensitive attribute. While recently developed fair clustering algorithms optimize clustering objectives under specific fairness constraints, their inherent complexity or approximation often results in suboptimal clustering utility or numerical instability in practice. To resolve these limitations, we propose a new fair clustering algorithm based on a novel decomposition of the fair $K$-means clustering objective function. The proposed algorithm, called Fair Clustering via Alignment (FCA), operates by alternately (i) finding a joint probability distribution to align the data from different protected groups, and (ii) optimizing cluster centers in the aligned space. A key advantage of FCA is that it theoretically guarantees approximately optimal clustering utility for any given fairness level without complex constraints, thereby enabling high-utility fair clustering in practice. Experiments show that FCA outperforms existing methods by (i) attaining a superior trade-off between fairness level and clustering utility, and (ii) achieving near-perfect fairness without numerical instability.
Unsupervised Clustering for Fault Analysis in High-Voltage Power Systems Using Voltage and Current Signals
Oelhaf, Julian, Kordowich, Georg, Maier, Andreas, Jager, Johann, Bayer, Siming
The widespread use of sensors in modern power grids has led to the accumulation of large amounts of voltage and current waveform data, especially during fault events. However, the lack of labeled datasets poses a significant challenge for fault classification and analysis. This paper explores the application of unsupervised clustering techniques for fault diagnosis in high-voltage power systems. A dataset provided by the Reseau de Transport d'Electricite (RTE) is analyzed, with frequency domain features extracted using the Fast Fourier Transform (FFT). The K-Means algorithm is then applied to identify underlying patterns in the data, enabling automated fault categorization without the need for labeled training samples. The resulting clusters are evaluated in collaboration with power system experts to assess their alignment with real-world fault characteristics. The results demonstrate the potential of unsupervised learning for scalable and data-driven fault analysis, providing a robust approach to detecting and classifying power system faults with minimal prior assumptions.
Supervised Models Can Generalize Also When Trained on Random Labels
Allerbo, Oskar, Schön, Thomas B.
The success of unsupervised learning raises the question of whether also supervised models can be trained without using the information in the output $y$. In this paper, we demonstrate that this is indeed possible. The key step is to formulate the model as a smoother, i.e. on the form $\hat{f}=Sy$, and to construct the smoother matrix $S$ independently of $y$, e.g. by training on random labels. We present a simple model selection criterion based on the distribution of the out-of-sample predictions and show that, in contrast to cross-validation, this criterion can be used also without access to $y$. We demonstrate on real and synthetic data that $y$-free trained versions of linear and kernel ridge regression, smoothing splines, and neural networks perform similarly to their standard, $y$-based, versions and, most importantly, significantly better than random guessing.
Redefining Clustered Federated Learning for System Identification: The Path of ClusterCraft
Keçeci, Ertuğrul, Güzelkaya, Müjde, Kumbasar, Tufan
This paper addresses the System Identification (SYSID) problem within the framework of federated learning. We introduce a novel algorithm, Incremental Clustering-based federated learning method for SYSID (IC-SYSID), designed to tackle SYSID challenges across multiple data sources without prior knowledge. IC-SYSID utilizes an incremental clustering method, ClusterCraft (CC), to eliminate the dependency on the prior knowledge of the dataset. CC starts with a single cluster model and assigns similar local workers to the same clusters by dynamically increasing the number of clusters. To reduce the number of clusters generated by CC, we introduce ClusterMerge, where similar cluster models are merged. We also introduce enhanced ClusterCraft to reduce the generation of similar cluster models during the training. Moreover, IC-SYSID addresses cluster model instability by integrating a regularization term into the loss function and initializing cluster models with scaled Glorot initialization. It also utilizes a mini-batch deep learning approach to manage large SYSID datasets during local training. Through the experiments conducted on a real-world representing SYSID problem, where a fleet of vehicles collaboratively learns vehicle dynamics, we show that IC-SYSID achieves a high SYSID performance while preventing the learning of unstable clusters.
Quantum Feature Optimization for Enhanced Clustering of Blockchain Transaction Data
Tsai, Yun-Cheng, Chen, Samuel Yen-Chi
Blockchain transaction data exhibits high dimensionality, noise, and intricate feature entanglement, presenting significant challenges for traditional clustering algorithms. In this study, we conduct a comparative analysis of three clustering approaches: (1) Classical K-Means Clustering, applied to pre-processed feature representations; (2) Hybrid Clustering, wherein classical features are enhanced with quantum random features extracted using randomly initialized quantum neural networks (QNNs); and (3) Fully Quantum Clustering, where a QNN is trained in a self-supervised manner leveraging a SwAV-based loss function to optimize the feature space for clustering directly. The proposed experimental framework systematically investigates the impact of quantum circuit depth and the number of learned prototypes, demonstrating that even shallow quantum circuits can effectively extract meaningful non-linear representations, significantly improving clustering performance.
Infinite hierarchical contrastive clustering for personal digital envirotyping
Huang, Ya-Yun, McClernon, Joseph, Oliver, Jason A., Engelhard, Matthew M.
Daily environments have profound influence on our health and behavior. Recent work has shown that digital envirotyping, where computer vision is applied to images of daily environments taken during ecological momentary assessment (EMA), can be used to identify meaningful relationships between environmental features and health outcomes of interest. To systematically study such effects on an individual level, it is helpful to group images into distinct environments encountered in an individual's daily life; these may then be analyzed, further grouped into related environments with similar features, and linked to health outcomes. Here we introduce infinite hierarchical contrastive clustering to address this challenge. Building on the established contrastive clustering framework, our method a) allows an arbitrary number of clusters without requiring the full Dirichlet Process machinery by placing a stick-breaking prior on predicted cluster probabilities; and b) encourages distinct environments to form well-defined sub-clusters within each cluster of related environments by incorporating a participant-specific prediction loss. Our experiments show that our model effectively identifies distinct personal environments and groups these environments into meaningful environment types. We then illustrate how the resulting clusters can be linked to various health outcomes, highlighting the potential of our approach to advance the envirotyping paradigm.
LOBSTUR: A Local Bootstrap Framework for Tuning Unsupervised Representations in Graph Neural Networks
Graph Neural Networks (GNNs) are increasingly used in conjunction with unsupervised learning techniques to learn powerful node representations, but their deployment is hindered by their high sensitivity to hyperparameter tuning and the absence of established methodologies for selecting the optimal models. To address these challenges, we propose LOBSTUR-GNN ({\bf Lo}cal {\bf B}oot{\bf s}trap for {\bf T}uning {\bf U}nsupervised {\bf R}epresentations in GNNs) i), a novel framework designed to adapt bootstrapping techniques for unsupervised graph representation learning. LOBSTUR-GNN tackles two main challenges: (a) adapting the bootstrap edge and feature resampling process to account for local graph dependencies in creating alternative versions of the same graph, and (b) establishing robust metrics for evaluating learned representations without ground-truth labels. Using locally bootstrapped resampling and leveraging Canonical Correlation Analysis (CCA) to assess embedding consistency, LOBSTUR provides a principled approach for hyperparameter tuning in unsupervised GNNs. We validate the effectiveness and efficiency of our proposed method through extensive experiments on established academic datasets, showing an 65.9\% improvement in the classification accuracy compared to an uninformed selection of hyperparameters. Finally, we deploy our framework on a real-world application, thereby demonstrating its validity and practical utility in various settings. \footnote{The code is available at \href{https://github.com/sowonjeong/lobstur-graph-bootstrap}{github.com/sowonjeong/lobstur-graph-bootstrap}.}
Are machine learning interpretations reliable? A stability study on global interpretations
Gan, Luqin, Zikry, Tarek M., Allen, Genevera I.
As machine learning systems are increasingly used in high-stakes domains, there is a growing emphasis placed on making them interpretable to improve trust in these systems. In response, a range of interpretable machine learning (IML) methods have been developed to generate human-understandable insights into otherwise black box models. With these methods, a fundamental question arises: Are these interpretations reliable? Unlike with prediction accuracy or other evaluation metrics for supervised models, the proximity to the true interpretation is difficult to define. Instead, we ask a closely related question that we argue is a prerequisite for reliability: Are these interpretations stable? We define stability as findings that are consistent or reliable under small random perturbations to the data or algorithms. In this study, we conduct the first systematic, large-scale empirical stability study on popular machine learning global interpretations for both supervised and unsupervised tasks on tabular data. Our findings reveal that popular interpretation methods are frequently unstable, notably less stable than the predictions themselves, and that there is no association between the accuracy of machine learning predictions and the stability of their associated interpretations. Moreover, we show that no single method consistently provides the most stable interpretations across a range of benchmark datasets. Overall, these results suggest that interpretability alone does not warrant trust, and underscores the need for rigorous evaluation of interpretation stability in future work. To support these principles, we have developed and released an open source IML dashboard and Python package to enable researchers to assess the stability and reliability of their own data-driven interpretations and discoveries.
Data Balancing Strategies: A Survey of Resampling and Augmentation Methods
Yousefimehr, Behnam, Ghatee, Mehdi, Seifi, Mohammad Amin, Fazli, Javad, Tavakoli, Sajed, Rafei, Zahra, Ghaffari, Shervin, Nikahd, Abolfazl, Gandomani, Mahdi Razi, Orouji, Alireza, Kashani, Ramtin Mahmoudi, Heshmati, Sarina, Mousavi, Negin Sadat
Imbalanced data poses a significant obstacle in machine learning, as an unequal distribution of class labels often results in skewed predictions and diminished model accuracy. To mitigate this problem, various resampling strategies have been developed, encompassing both oversampling and undersampling techniques aimed at modifying class proportions. Conventional oversampling approaches like SMOTE enhance the representation of the minority class, whereas undersampling methods focus on trimming down the majority class. Advances in deep learning have facilitated the creation of more complex solutions, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), which are capable of producing high-quality synthetic examples. This paper reviews a broad spectrum of data balancing methods, classifying them into categories including synthetic oversampling, adaptive techniques, generative models, ensemble-based strategies, hybrid approaches, undersampling, and neighbor-based methods. Furthermore, it highlights current developments in resampling techniques and discusses practical implementations and case studies that validate their effectiveness. The paper concludes by offering perspectives on potential directions for future exploration in this domain.