Clustering
Illuminating Patterns of Divergence: DataDios SmartDiff for Large-Scale Data Difference Analysis
Poduri, Aryan, Tailor, Yashwant
Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines schema-aware mapping, type-specific comparators, and parallel execution. It aligns evolving schemas, compares structured and semi-structured data (strings, numbers, dates, JSON/XML), and clusters results with labels that explain how and why differences occur. On multi-million-row datasets, SmartDiff achieves over 95 percent precision and recall, runs 30 to 40 percent faster, and uses 30 to 50 percent less memory than baselines; in user studies, it reduces root-cause analysis time from 10 hours to 12 minutes. An LLM-assisted labeling pipeline produces deterministic, schema-valid multilabel explanations using retrieval augmentation and constrained decoding; ablations show further gains in label accuracy and time to diagnosis over rules-only baselines. These results indicate SmartDiff's utility for migration validation, regression testing, compliance auditing, and continuous data quality monitoring. Index Terms: data differencing, schema evolution, data quality, parallel processing, clustering, explainable validation, big data
Friend or Foe
Cherendichenko, Oleksandr, Solowiej-Wedderburn, Josephine, Carroll, Laura M., Libby, Eric
A fundamental challenge in microbial ecology is determining whether bacteria compete or cooperate in different environmental conditions. With recent advances in genome-scale metabolic models, we are now capable of simulating interactions between thousands of pairs of bacteria in thousands of different environmental settings at a scale infeasible experimentally. These approaches can generate tremendous amounts of data that can be exploited by state-of-the-art machine learning algorithms to uncover the mechanisms driving interactions. Here, we present Friend or Foe, a compendium of 64 tabular environmental datasets, consisting of more than 26M shared environments for more than 10K pairs of bacteria sampled from two of the largest collections of metabolic models. The Friend or Foe datasets are curated for a wide range of machine learning tasks -- supervised, unsupervised, and generative -- to address specific questions underlying bacterial interactions. We benchmarked a selection of the most recent models for each of these tasks and our results indicate that machine learning can be successful in this application to microbial ecology. Going beyond, analyses of the Friend or Foe compendium can shed light on the predictability of bacterial interactions and highlight novel research directions into how bacteria infer and navigate their relationships.
Hybrid Perception and Equivariant Diffusion for Robust Multi-Node Rebar Tying
Wang, Zhitao, Xiong, Yirong, Horowitz, Roberto, Wang, Yanke, Han, Yuxing
Rebar tying is a repetitive but critical task in reinforced concrete construction, typically performed manually at considerable ergonomic risk. Recent advances in robotic manipulation hold the potential to automate the tying process, yet face challenges in accurately estimating tying poses in congested rebar nodes. In this paper, we introduce a hybrid perception and motion planning approach that integrates geometry-based perception with Equivariant Denoising Diffusion on SE(3) (Diffusion-EDFs) to enable robust multi-node rebar tying with minimal training data. Our perception module utilizes density-based clustering (DBSCAN), geometry-based node feature extraction, and principal component analysis (PCA) to segment rebar bars, identify rebar nodes, and estimate orientation vectors for sequential ranking, even in complex, unstructured environments. The motion planner, based on Diffusion-EDFs, is trained on as few as 5-10 demonstrations to generate sequential end-effector poses that optimize collision avoidance and tying efficiency. The proposed system is validated on various rebar meshes, including single-layer, multi-layer, and cluttered configurations, demonstrating high success rates in node detection and accurate sequential tying. Compared with conventional approaches that rely on large datasets or extensive manual parameter tuning, our method achieves robust, efficient, and adaptable multi-node tying while significantly reducing data requirements. This result underscores the potential of hybrid perception and diffusion-driven planning to enhance automation in on-site construction tasks, improving both safety and labor efficiency.
From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots
Wang, Yuxuan, Yang, Ming, Ding, Ziluo, Zhang, Yu, Zeng, Weishuai, Xu, Xinrun, Jiang, Haobin, Lu, Zongqing
Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world. The project webpage is available at https://beingbeyond.github.io/BumbleBee/.
Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering
Schoinas, Thanasis, Guinard, Benjamin, Esbati, Diba, Chalk, Richard
Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional's toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based ('keyword') approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.
Human-AI Collaborative Bot Detection in MMORPGs
In Massively Multiplayer Online Role-Playing Games (MMORPGs), auto-leveling bots exploit automated programs to level up characters at scale, undermining gameplay balance and fairness. Detecting such bots is challenging, not only because they mimic human behavior, but also because punitive actions require explainable justification to avoid legal and user experience issues. In this paper, we present a novel framework for detecting auto-leveling bots by leveraging contrastive representation learning and clustering techniques in a fully unsupervised manner to identify groups of characters with similar level-up patterns. To ensure reliable decisions, we incorporate a Large Language Model (LLM) as an auxiliary reviewer to validate the clustered groups, effectively mimicking a secondary human judgment. We also introduce a growth curve-based visualization to assist both the LLM and human moderators in assessing leveling behavior. This collaborative approach improves the efficiency of bot detection workflows while maintaining explainability, thereby supporting scalable and accountable bot regulation in MMORPGs.
MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification
Dou, Yifan, Khadre, Adam, Petreaca, Ruben C, Mirzaei, Golrokh
Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping.
Bounds on Perfect Node Classification: A Convex Graph Clustering Perspective
Shahriari-Mehr, Firooz, Aliakbari, Javad, Amat, Alexandre Graell i, Panahi, Ashkan
We present an analysis of the transductive node classification problem, where the underlying graph consists of communities that agree with the node labels and node features. For node classification, we propose a novel optimization problem that incorporates the node-specific information (labels and features) in a spectral graph clustering framework. Studying this problem, we demonstrate a synergy between the graph structure and node-specific information. In particular, we show that suitable node-specific information guarantees the solution of our optimization problem perfectly recovering the communities, under milder conditions than the bounds on graph clustering alone. We present algorithmic solutions to our optimization problem and numerical experiments that confirm such a synergy.
Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks
Kong, Mingyue, Zhang, Yinglong, Xu, Chengda, Xia, Xuewen, Xu, Xing
Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: https://github.com/mingyue15694/SGDNN/tree/main
Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation
Spezia, Afonso Martini, Fontanari, Thomas, Recamonde-Mendoza, Mariana
Cross-validation plays a fundamental role in Machine Learning, enabling robust evaluation of model performance and preventing overestimation on training and validation data. However, one of its drawbacks is the potential to create data subsets (folds) that do not adequately represent the diversity of the original dataset, which can lead to biased performance estimates. The objective of this work is to deepen the investigation of cluster-based cross-validation strategies by analyzing the performance of different clustering algorithms through experimental comparison. Additionally, a new cross-validation technique that combines Mini Batch K-Means with class stratification is proposed. Experiments were conducted on 20 datasets (both balanced and imbalanced) using four supervised learning algorithms, comparing cross-validation strategies in terms of bias, variance, and computational cost. The technique that uses Mini Batch K-Means with class stratification outperformed others in terms of bias and variance on balanced datasets, though it did not significantly reduce computational cost. On imbalanced datasets, traditional stratified cross-validation consistently performed better, showing lower bias, variance, and computational cost, making it a safe choice for performance evaluation in scenarios with class imbalance. In the comparison of different clustering algorithms, no single algorithm consistently stood out as superior. Overall, this work contributes to improving predictive model evaluation strategies by providing a deeper understanding of the potential of cluster-based data splitting techniques and reaffirming the effectiveness of well-established strategies like stratified cross-validation. Moreover, it highlights perspectives for increasing the robustness and reliability of model evaluations, especially in datasets with clustering characteristics.