Goto

Collaborating Authors

 Clustering


Cold-Start Active Correlation Clustering

arXiv.org Artificial Intelligence

We study active correlation clustering where pairwise similarities are not provided upfront and must be queried in a cost-efficient manner through active learning. Specifically, we focus on the cold-start scenario, where no true initial pairwise similarities are available for active learning. To address this challenge, we propose a coverage-aware method that encourages diversity early in the process. We demonstrate the effectiveness of our approach through several synthetic and real-world experiments.


Hybrid Quantum-Classical Optimisation of Traveling Salesperson Problem

arXiv.org Artificial Intelligence

The Traveling Salesperson Problem (TSP), a quintessential NP-hard combinatorial optimisation challenge, is vital for logistics and network design but limited by exponential complexity in large instances. We propose a hybrid quantum-classical framework integrating variational quantum eigensolver (VQE) optimisation with classical machine learning, using K-means clustering for problem decomposition and a RandomForestRegressor for path refinement. Evaluated on 80 European cities (from 4 to 80 cities, 38,500 samples in total) via Qiskit's AerSimulator and ibm_kyiv 127-qubit backend, the hybrid approach outperforms quantum-only methods, achieving an approximation ratio of 1.0287 at 80 cities, a 47.5% improvement over quantum-only's 1.9614, nearing the classical baseline. Machine learning reduces variability in tour distances (interquartile range, IQR - the spread of the middle 50% of results relative to the median - from 0.06 to 0.04), enhancing stability despite noisy intermediate-scale quantum (NISQ) noise. This framework underscores hybrid strategies' potential for scalable TSP optimisation, with future hardware advancements promising practical quantum advantages.


Adaptive Graph Coarsening for Efficient GNN Training

arXiv.org Artificial Intelligence

We propose an adaptive graph coarsening method to jointly learn graph neural network (GNN) parameters and merge nodes via K-means clustering during training. As real-world graphs grow larger, processing them directly becomes increasingly challenging and sometimes infeasible. Tailoring algorithms to large-scale data may sacrifice performance, so we instead consider graph reduction to decrease the amount of data used during training. In particular, we propose a method to simultaneously train a GNN and coarsen its graph by partitioning nodes via K-means clustering based on their embeddings. Unlike past graph coarsening works, our approach allows us to merge nodes during training. Not only does this preclude coarsening as a preprocessing step, but our node clusters can adapt to the learning task instead of relying solely on graph connectivity and features. Thus, our method is amenable to scenarios that are challenging for other methods, such as heterophilic data. We validate our approach on both homophilic and heterophilic node classification datasets. We further visualize relationships between node embeddings and their corresponding clusters to illustrate that our coarsened graph adapts to the learning task during training.


Infrastructure Sensor-enabled Vehicle Data Generation using Multi-Sensor Fusion for Proactive Safety Applications at Work Zone

arXiv.org Artificial Intelligence

INFRASTRUCTURE SENSOR-ENABLED VEHICLE DA T A GENERA TION USING MUL TI-SENSOR FUSION FOR PROACTIVE SAFETY APPLICA TIONS A T WORK ZONE Suhala Rabab Saba Department of Civil, Construction & Environmental Engineering, The University of Alabama Smart Communities and Innovation Building (SCIB), 28 Kirkbride Lane, Tuscaloosa, AL 35487-0288 Email: ssaba@crimson.ua.edu Saba, Khan, Ahmad, Cao, Rahman, Zhao, Huynh, and Ozguven 3 ABSTRACT Infrastructure-based sensing and real-time trajectory generation hold significant promise for improving safety in high-risk roadway segments like work zones, yet practical deployments are hindered by perspective distortion, complex geometry, occlusions, and costs. This study tackles these barriers by (i) integrating roadside camera and LiDAR sensors into a cosimulation environment to develop a scalable, cost-effective vehicle detection and localization framework, and (ii) employing a Kalman Filter-based late fusion strategy to enhance trajectory consistency and accuracy. In simulation, the fusion algorithm reduced longitudinal error by up to 70% compared to individual sensors while preserving lateral accuracy within 1-3 meters. Field validation in an active work zone, using LiDAR, a radar-camera rig, and RTK-GPS as ground truth, demonstrated that the fused trajectories closely match real vehicle paths, even when single-sensor data are intermittent or degraded. These results confirm that KF based sensor fusion can reliably compensate for individual sensor limitations, providing precise and robust vehicle tracking capabilities. Our approach thus offers a practical pathway to deploy infrastructure-enabled multi-sensor systems for proactive safety measures in complex traffic environments. Keywords: work zone, fusion, lidar, camera, localization, safety Saba, Khan, Ahmad, Cao, Rahman, Zhao, Huynh, and Ozguven 4 INTRODUCTION Work zone crashes do not necessarily impact only the vehicles and people directly involved; instead, they have cascading effects that cause operational delays for passing vehicles and project completion delays for work zone contractors. The Federal Motor Carrier Safety Administration (FMCSA) report indicates that commercial motor vehicles (CMVs) are involved in one-third of work zone fatal crashes, although they represent only 5% of all vehicular traffic (1). In addition, speed is a contributing factor in 26% of all fatal work zone crashes (2). According to Jiao (2022) (3), 13% of CMV drivers are fatigued when they are involved in crashes.


Clustering Aggregation as Maximum-Weight Independent Set

Neural Information Processing Systems

We formulate clustering aggregation as a special instance of Maximum-Weight Independent Set (MWIS) problem. For a given dataset, an attributed graph is constructed from the union of the input clusterings generated by different underlying clustering algorithms with different parameters. The vertices, which represent the distinct clusters, are weighted by an internal index measuring both cohesion and separation. The edges connect the vertices whose corresponding clusters overlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an optimal subset of non-overlapping clusters partitioning the dataset together.


Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

Neural Information Processing Systems

Discovering hierarchical regularities in data is a key problem in interacting with large datasets, modeling cognition, and encoding knowledge. A previous Bayesian solution---Kingman's coalescent---provides a convenient probabilistic model for data represented as a binary tree. Unfortunately, this is inappropriate for data better described by bushier trees. We generalize an existing belief propagation framework of Kingman's coalescent to the beta coalescent, which models a wider range of tree structures. Because of the complex combinatorial search over possible structures, we develop new sampling schemes using sequential Monte Carlo and Dirichlet process mixture models, which render inference efficient and tractable.


Dependent nonparametric trees for dynamic hierarchical clustering

Neural Information Processing Systems

Hierarchical clustering methods offer an intuitive and powerful way to model a wide variety of data sets. However, the assumption of a fixed hierarchy is often overly restrictive when working with data generated over a period of time: We expect both the structure of our hierarchy, and the parameters of the clusters, to evolve with time. In this paper, we present a distribution over collections of time-dependent, infinite-dimensional trees that can be used to model evolving hierarchies, and present an efficient and scalable algorithm for performing approximate inference in such a model. We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora.


Community detection robustness of graph neural networks

arXiv.org Machine Learning

Graph neural networks (GNNs) are increasingly widely used for community detection in attributed networks. They combine structural topology with node attributes through message passing and pooling. However, their robustness or lack of thereof with respect to different perturbations and targeted attacks in conjunction with community detection tasks is not well understood. To shed light into latent mechanisms behind GNN sensitivity on community detection tasks, we conduct a systematic computational evaluation of six widely adopted GNN architectures: GCN, GAT, Graph-SAGE, DiffPool, MinCUT, and DMoN. The analysis covers three perturbation categories: node attribute manipulations, edge topology distortions, and adversarial attacks. We use element-centric similarity as the evaluation metric on synthetic benchmarks and real-world citation networks. Our findings indicate that supervised GNNs tend to achieve higher baseline accuracy, while unsupervised methods, particularly DMoN, maintain stronger resilience under targeted and adversarial perturbations. Furthermore, robustness appears to be strongly influenced by community strength, with well-defined communities reducing performance loss. Across all models, node attribute perturbations associated with targeted edge deletions and shift in attribute distributions tend to cause the largest degradation in community recovery. These findings highlight important trade-offs between accuracy and robustness in GNN-based community detection and offer new insights into selecting architectures resilient to noise and adversarial attacks.


STRAPSim: A Portfolio Similarity Metric for ETF Alignment and Portfolio Trades

arXiv.org Artificial Intelligence

Accurately measuring portfolio similarity is critical for a wide range of financial applications, including Exchange-traded Fund (ETF) recommendation, portfolio trading, and risk alignment. Existing similarity measures often rely on exact asset overlap or static distance metrics, which fail to capture similarities among the constituents (e.g., securities within the portfolio) as well as nuanced relationships between partially overlapping portfolios with heterogeneous weights. We introduce STRAPSim (Semantic, Two-level, Residual-Aware Portfolio Similarity), a novel method that computes portfolio similarity by matching constituents based on semantic similarity, weighting them according to their portfolio share, and aggregating results via residual-aware greedy alignment. We benchmark our approach against Jaccard, weighted Jaccard, as well as BERTScore-inspired variants across public classification, regression, and recommendation tasks, as well as on corporate bond ETF datasets. Empirical results show that our method consistently outperforms baselines in predictive accuracy and ranking alignment, achieving the highest Spearman correlation with return-based similarity. By leveraging constituent-aware matching and dynamic reweighting, portfolio similarity offers a scalable, interpretable framework for comparing structured asset baskets, demonstrating its utility in ETF benchmarking, portfolio construction, and systematic execution.


GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data

arXiv.org Artificial Intelligence

To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.