AITopics

Data engineering workflows require reliable differencing across files, databases, and query outputs, yet existing tools falter under schema drift, heterogeneous types, and limited explainability. SmartDiff is a unified system that combines schema-aware mapping, type-specific comparators, and parallel execution. It aligns evolving schemas, compares structured and semi-structured data (strings, numbers, dates, JSON/XML), and clusters results with labels that explain how and why differences occur. On multi-million-row datasets, SmartDiff achieves over 95 percent precision and recall, runs 30 to 40 percent faster, and uses 30 to 50 percent less memory than baselines; in user studies, it reduces root-cause analysis time from 10 hours to 12 minutes. An LLM-assisted labeling pipeline produces deterministic, schema-valid multilabel explanations using retrieval augmentation and constrained decoding; ablations show further gains in label accuracy and time to diagnosis over rules-only baselines. These results indicate SmartDiff's utility for migration validation, regression testing, compliance auditing, and continuous data quality monitoring. Index Terms: data differencing, schema evolution, data quality, parallel processing, clustering, explainable validation, big data

data mining, machine learning, natural language, (23 more...)

2509.00293

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Cherendichenko, Oleksandr, Solowiej-Wedderburn, Josephine, Carroll, Laura M., Libby, Eric

Friend or Foe

A fundamental challenge in microbial ecology is determining whether bacteria compete or cooperate in different environmental conditions. With recent advances in genome-scale metabolic models, we are now capable of simulating interactions between thousands of pairs of bacteria in thousands of different environmental settings at a scale infeasible experimentally. These approaches can generate tremendous amounts of data that can be exploited by state-of-the-art machine learning algorithms to uncover the mechanisms driving interactions. Here, we present Friend or Foe, a compendium of 64 tabular environmental datasets, consisting of more than 26M shared environments for more than 10K pairs of bacteria sampled from two of the largest collections of metabolic models. The Friend or Foe datasets are curated for a wide range of machine learning tasks -- supervised, unsupervised, and generative -- to address specific questions underlying bacterial interactions. We benchmarked a selection of the most recent models for each of these tasks and our results indicate that machine learning can be successful in this application to microbial ecology. Going beyond, analyses of the Friend or Foe compendium can shed light on the predictability of bacterial interactions and highlight novel research directions into how bacteria infer and navigate their relationships.

artificial intelligence, interaction, machine learning, (18 more...)

2509.00123

Country:

North America > United States (0.15)
Europe > Sweden (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Hybrid Perception and Equivariant Diffusion for Robust Multi-Node Rebar Tying

Wang, Zhitao, Xiong, Yirong, Horowitz, Roberto, Wang, Yanke, Han, Yuxing

Rebar tying is a repetitive but critical task in reinforced concrete construction, typically performed manually at considerable ergonomic risk. Recent advances in robotic manipulation hold the potential to automate the tying process, yet face challenges in accurately estimating tying poses in congested rebar nodes. In this paper, we introduce a hybrid perception and motion planning approach that integrates geometry-based perception with Equivariant Denoising Diffusion on SE(3) (Diffusion-EDFs) to enable robust multi-node rebar tying with minimal training data. Our perception module utilizes density-based clustering (DBSCAN), geometry-based node feature extraction, and principal component analysis (PCA) to segment rebar bars, identify rebar nodes, and estimate orientation vectors for sequential ranking, even in complex, unstructured environments. The motion planner, based on Diffusion-EDFs, is trained on as few as 5-10 demonstrations to generate sequential end-effector poses that optimize collision avoidance and tying efficiency. The proposed system is validated on various rebar meshes, including single-layer, multi-layer, and cluttered configurations, demonstrating high success rates in node detection and accurate sequential tying. Compared with conventional approaches that rely on large datasets or extensive manual parameter tuning, our method achieves robust, efficient, and adaptable multi-node tying while significantly reducing data requirements. This result underscores the potential of hybrid perception and diffusion-driven planning to enhance automation in on-site construction tasks, improving both safety and labor efficiency.

artificial intelligence, machine learning, point cloud, (14 more...)

2509.00065

Country:

North America > United States (0.46)
Asia > China (0.29)
Europe > Spain (0.28)

Genre: Research Report > New Finding (0.34)

Industry: Materials > Construction Materials (0.73)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots

Wang, Yuxuan, Yang, Ming, Ding, Ziluo, Zhang, Yu, Zeng, Weishuai, Xu, Xinrun, Jiang, Haobin, Lu, Zongqing

Achieving general agile whole-body control on humanoid robots remains a major challenge due to diverse motion demands and data conflicts. While existing frameworks excel in training single motion-specific policies, they struggle to generalize across highly varied behaviors due to conflicting control requirements and mismatched data distributions. In this work, we propose BumbleBee (BB), an expert-generalist learning framework that combines motion clustering and sim-to-real adaptation to overcome these challenges. BB first leverages an autoencoder-based clustering method to group behaviorally similar motions using motion features and motion descriptions. Expert policies are then trained within each cluster and refined with real-world data through iterative delta action modeling to bridge the sim-to-real gap. Finally, these experts are distilled into a unified generalist controller that preserves agility and robustness across all motion types. Experiments on two simulations and a real humanoid robot demonstrate that BB achieves state-of-the-art general whole-body control, setting a new benchmark for agile, robust, and generalizable humanoid performance in the real world. The project webpage is available at https://beingbeyond.github.io/BumbleBee/.

artificial intelligence, expert policy, machine learning, (14 more...)

2506.12779

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.34)

Schoinas, Thanasis, Guinard, Benjamin, Esbati, Diba, Chalk, Richard

Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering

arXiv.org Artificial IntelligenceSep-1-2025

Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional's toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based ('keyword') approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.

data mining, machine learning, variation, (22 more...)

2508.21081

Country: Europe > United Kingdom (0.15)

Genre: Research Report (0.82)

Industry:

Banking & Finance (0.68)
Law (0.48)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(2 more...)

Son, Jaeman, Kim, Hyunsoo

Human-AI Collaborative Bot Detection in MMORPGs

In Massively Multiplayer Online Role-Playing Games (MMORPGs), auto-leveling bots exploit automated programs to level up characters at scale, undermining gameplay balance and fairness. Detecting such bots is challenging, not only because they mimic human behavior, but also because punitive actions require explainable justification to avoid legal and user experience issues. In this paper, we present a novel framework for detecting auto-leveling bots by leveraging contrastive representation learning and clustering techniques in a fully unsupervised manner to identify groups of characters with similar level-up patterns. To ensure reliable decisions, we incorporate a Large Language Model (LLM) as an auxiliary reviewer to validate the clustered groups, effectively mimicking a secondary human judgment. We also introduce a growth curve-based visualization to assist both the LLM and human moderators in assessing leveling behavior. This collaborative approach improves the efficiency of bot detection workflows while maintaining explainability, thereby supporting scalable and accountable bot regulation in MMORPGs.

large language model, machine learning, natural language, (12 more...)

2508.20578

Country: Asia (0.28)

Genre: Research Report > New Finding (0.47)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Information Technology (1.00)
Law (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.67)

Dou, Yifan, Khadre, Adam, Petreaca, Ruben C, Mirzaei, Golrokh

MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification

Motivation. Understanding the pan-cancer mutational landscape offers critical insights into the molecular mechanisms underlying tumorigenesis. While patient-level machine learning techniques have been widely employed to identify tumor subtypes, cohort-level clustering, where entire cancer types are grouped based on shared molecular features, has largely relied on classical statistical methods. Results. In this study, we introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types based on coding mutation data derived from the COSMIC database. For each cancer type, we construct two complementary mutation signatures: a gene-level profile capturing nucleotide substitution patterns across the most frequently mutated genes, and a chromosome-level profile representing normalized substitution frequencies across chromosomes. These dual views are encoded using TabNet encoders and optimized via a multi-scale contrastive learning objective (NT-Xent loss) to learn unified cancer-type embeddings. We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types, aligning with known mutational processes and tissue origins. Our work represents the first application of contrastive learning to cohort-level cancer clustering, offering a scalable and interpretable framework for mutation-driven cancer subtyping.

artificial intelligence, cancer type, machine learning, (16 more...)

2508.19424

Country: North America > United States > Ohio (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Shahriari-Mehr, Firooz, Aliakbari, Javad, Amat, Alexandre Graell i, Panahi, Ashkan

Bounds on Perfect Node Classification: A Convex Graph Clustering Perspective

We present an analysis of the transductive node classification problem, where the underlying graph consists of communities that agree with the node labels and node features. For node classification, we propose a novel optimization problem that incorporates the node-specific information (labels and features) in a spectral graph clustering framework. Studying this problem, we demonstrate a synergy between the graph structure and node-specific information. In particular, we show that suitable node-specific information guarantees the solution of our optimization problem perfectly recovering the communities, under milder conditions than the bounds on graph clustering alone. We present algorithmic solutions to our optimization problem and numerical experiments that confirm such a synergy.

artificial intelligence, machine learning, optimization problem, (16 more...)

2508.20231

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

Parameter-Free Structural-Diversity Message Passing for Graph Neural Networks

Kong, Mingyue, Zhang, Yinglong, Xu, Chengda, Xia, Xuewen, Xu, Xing

Graph Neural Networks (GNNs) have shown remarkable performance in structured data modeling tasks such as node classification. However, mainstream approaches generally rely on a large number of trainable parameters and fixed aggregation rules, making it difficult to adapt to graph data with strong structural heterogeneity and complex feature distributions. This often leads to over-smoothing of node representations and semantic degradation. To address these issues, this paper proposes a parameter-free graph neural network framework based on structural diversity, namely SDGNN (Structural-Diversity Graph Neural Network). The framework is inspired by structural diversity theory and designs a unified structural-diversity message passing mechanism that simultaneously captures the heterogeneity of neighborhood structures and the stability of feature semantics, without introducing additional trainable parameters. Unlike traditional parameterized methods, SDGNN does not rely on complex model training, but instead leverages complementary modeling from both structure-driven and feature-driven perspectives, thereby effectively improving adaptability across datasets and scenarios. Experimental results show that on eight public benchmark datasets and an interdisciplinary PubMed citation network, SDGNN consistently outperforms mainstream GNNs under challenging conditions such as low supervision, class imbalance, and cross-domain transfer. This work provides a new theoretical perspective and general approach for the design of parameter-free graph neural networks, and further validates the importance of structural diversity as a core signal in graph representation learning. To facilitate reproducibility and further research, the full implementation of SDGNN has been released at: https://github.com/mingyue15694/SGDNN/tree/main

artificial intelligence, machine learning, representation, (14 more...)

2508.19884

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Spezia, Afonso Martini, Fontanari, Thomas, Recamonde-Mendoza, Mariana

Comparing Cluster-Based Cross-Validation Strategies for Machine Learning Model Evaluation

arXiv.org Artificial IntelligenceAug-28-2025

Cross-validation plays a fundamental role in Machine Learning, enabling robust evaluation of model performance and preventing overestimation on training and validation data. However, one of its drawbacks is the potential to create data subsets (folds) that do not adequately represent the diversity of the original dataset, which can lead to biased performance estimates. The objective of this work is to deepen the investigation of cluster-based cross-validation strategies by analyzing the performance of different clustering algorithms through experimental comparison. Additionally, a new cross-validation technique that combines Mini Batch K-Means with class stratification is proposed. Experiments were conducted on 20 datasets (both balanced and imbalanced) using four supervised learning algorithms, comparing cross-validation strategies in terms of bias, variance, and computational cost. The technique that uses Mini Batch K-Means with class stratification outperformed others in terms of bias and variance on balanced datasets, though it did not significantly reduce computational cost. On imbalanced datasets, traditional stratified cross-validation consistently performed better, showing lower bias, variance, and computational cost, making it a safe choice for performance evaluation in scenarios with class imbalance. In the comparison of different clustering algorithms, no single algorithm consistently stood out as superior. Overall, this work contributes to improving predictive model evaluation strategies by providing a deeper understanding of the potential of cluster-based data splitting techniques and reaffirming the effectiveness of well-established strategies like stratified cross-validation. Moreover, it highlights perspectives for increasing the robustness and reliability of model evaluations, especially in datasets with clustering characteristics.

artificial intelligence, dataset, machine learning, (14 more...)

2507.22299

Country: South America > Brazil (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.94)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Cross Validation (1.00)