Goto

Collaborating Authors

 Overview


Innovative tokenisation of structured data for LLM training

arXiv.org Artificial Intelligence

Data representation remains a fundamental challenge in machine learning, particularly when adapting sequence-based architectures like Transformers and Large Language Models (LLMs) for structured tabular data. Existing methods often fail to cohesively encode the mix of numerical and categorical features or preserve the inherent structure of tables. This paper introduces a novel, hybrid tokenisation methodology designed to convert tabular data into a unified, sequential format suitable for LLM training. Our approach combines predefined fixed tokens to represent structural elements and low-cardinality categorical features, with a learned subword vocabulary using Byte-Pair Encoding (BPE) for high-cardinality and continuous values. We demonstrate the efficacy of this technique by applying it to a large-scale NetFlow dataset (CIDDS-001), preparing a corpus for a Network Intrusion Detection System (NIDS) foundation model. The evaluation shows that our method is highly efficient, processing over 31 million network flows in under five hours and achieving a significant data compression ratio of 6.18:1. This process resulted in a computationally manageable corpus of over one billion tokens, establishing a viable and generalisable pathway for training foundation models on structured data.


Diffusion Models for Future Networks and Communications: A Comprehensive Survey

arXiv.org Artificial Intelligence

The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks. Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option, capable of handling complex, high-dimensional data distribution, as well as consistent, noise-robust performance. In this survey, we aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems. We first provide an extensive tutorial of DMs and demonstrate how they can be applied to enhance optimizers, reinforcement learning and incentive mechanisms, which are popular approaches for problems in wireless networks. Then, we review and discuss the DM-based methods proposed for emerging issues in future networks and communications, including channel modeling and estimation, signal detection and data reconstruction, integrated sensing and communication, resource management in edge computing networks, semantic communications and other notable issues. We conclude the survey with highlighting technical limitations of DMs and their applications, as well as discussing future research directions.


Empowering Tabular Data Preparation with Language Models: Why and How?

arXiv.org Artificial Intelligence

Data preparation is a critical step in enhancing the usability of tabular data and thus boosts downstream data-driven tasks. Traditional methods often face challenges in capturing the intricate relationships within tables and adapting to the tasks involved. Recent advances in Language Models (LMs), especially in Large Language Models (LLMs), offer new opportunities to automate and support tabular data preparation. However, why LMs suit tabular data preparation (i.e., how their capabilities match task demands) and how to use them effectively across phases still remain to be systematically explored. In this survey, we systematically analyze the role of LMs in enhancing tabular data preparation processes, focusing on four core phases: data acquisition, integration, cleaning, and transformation. For each phase, we present an integrated analysis of how LMs can be combined with other components for different preparation tasks, highlight key advancements, and outline prospective pipelines.


Regression Augmentation With Data-Driven Segmentation

arXiv.org Artificial Intelligence

Imbalanced regression arises when the target distribution is skewed, causing models to focus on dense regions and struggle with underrepresented (minority) samples. Despite its relevance across many applications, few methods have been designed specifically for this challenge. Existing approaches often rely on fixed, ad hoc thresholds to label samples as rare or common, overlooking the continuous complexity of the joint feature-target space and fail to represent the true underlying rare regions. To address these limitations, we propose a fully data-driven GAN-based augmentation framework that uses Mahalanobis-Gaussian Mixture Modeling (GMM) to automatically identify minority samples and employs deterministic nearest-neighbour matching to enrich sparse regions. Rather than preset thresholds, our method lets the data determine which observations are truly rare. Evaluation on 32 benchmark imbalanced regression datasets demonstrates that our approach consistently outperforms state-of-the-art data augmentation methods.


Foundation Models for Bioacoustics -- a Comparative Review

arXiv.org Artificial Intelligence

Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning including major pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models by thoroughly analysing design decisions such as model architecture, pretraining scheme, and training paradigm. Additionally, we evaluate selected foundation models on classification tasks from the BEANS and BirdSet benchmarks, comparing the generalisability of learned representations under both linear and attentive probing strategies. Our comprehensive experimental analysis reveals that BirdMAE, trained on large-scale bird song data with a self-supervised objective, achieves the best performance on the BirdSet benchmark. On BEANS, BEATs$_{NLM}$, the extracted encoder of the NatureLM-audio large audio model, is slightly better. Both transformer-based models require attentive probing to extract the full performance of their representations. ConvNext$_{BS}$ and Perch models trained with supervision on large-scale bird song data remain competitive for passive acoustic monitoring classification tasks of BirdSet in linear probing settings. Training a new linear classifier has clear advantages over evaluating these models without further training. While on BEANS, the baseline model BEATs trained with self-supervision on AudioSet outperforms bird-specific models when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.


A Survey on Agent Workflow -- Status and Future

arXiv.org Artificial Intelligence

In the age of large language models (LLMs), autonomous agents have emerged as a powerful paradigm for achieving general intelligence. These agents dynamically leverage tools, memory, and reasoning capabilities to accomplish user-defined goals. As agent systems grow in complexity, agent workflows-structured orchestration frameworks-have become central to enabling scalable, controllable, and secure AI behaviors. This survey provides a comprehensive review of agent workflow systems, spanning academic frameworks and industrial implementations. We classify existing systems along two key dimensions: functional capabilities (e.g., planning, multi-agent collaboration, external API integration) and architectural features (e.g., agent roles, orchestration flows, specification languages). By comparing over 20 representative systems, we highlight common patterns, potential technical challenges, and emerging trends. We further address concerns related to workflow optimization strategies and security. Finally, we outline open problems such as standardization and multimodal integration, offering insights for future research at the intersection of agent design, workflow infrastructure, and safe automation.


Towards Bridging Review Sparsity in Recommendation with Textual Edge Graph Representation

arXiv.org Artificial Intelligence

Textual reviews enrich recommender systems with fine-grained preference signals and enhanced explainability. However, in real-world scenarios, users rarely leave reviews, resulting in severe sparsity that undermines the effectiveness of existing models. A natural solution is to impute or generate missing reviews to enrich the data. However, conventional imputation techniques -- such as matrix completion and LLM-based augmentation -- either lose contextualized semantics by embedding texts into vectors, or overlook structural dependencies among user-item interactions. To address these shortcomings, we propose TWISTER (ToWards Imputation on Sparsity with Textual Edge Graph Representation), a unified framework that imputes missing reviews by jointly modeling semantic and structural signals. Specifically, we represent user-item interactions as a Textual-Edge Graph (TEG), treating reviews as edge attributes. To capture relational context, we construct line-graph views and employ a large language model as a graph-aware aggregator. For each interaction lacking a textual review, our model aggregates the neighborhood's natural-language representations to generate a coherent and personalized review. Experiments on the Amazon and Goodreads datasets show that TWISTER consistently outperforms traditional numeric, graph-based, and LLM baselines, delivering higher-quality imputed reviews and, more importantly, enhanced recommendation performance. In summary, TWISTER generates reviews that are more helpful, authentic, and specific, while smoothing structural signals for improved recommendations.


AutoEDA: Enabling EDA Flow Automation through Microservice-Based LLM Agents

arXiv.org Artificial Intelligence

Modern Electronic Design Automation (EDA) workflows, especially the RTL-to-GDSII flow, require heavily manual scripting and demonstrate a multitude of tool-specific interactions which limits scalability and efficiency. While LLMs introduces strides for automation, existing LLM solutions require expensive fine-tuning and do not contain standardized frameworks for integration and evaluation. We introduce AutoEDA, a framework for EDA automation that leverages paralleled learning through the Model Context Protocol (MCP) specific for standardized and scalable natural language experience across the entire RTL-to-GDSII flow. AutoEDA limits fine-tuning through structured prompt engineering, implements intelligent parameter extraction and task decomposition, and provides an extended CodeBLEU metric to evaluate the quality of TCL scripts. Results from experiments over five previously curated benchmarks show improvements in automation accuracy and efficiency, as well as script quality when compared to existing methods. AutoEDA is released open-sourced to support reproducibility and the EDA community. Available at: https://github.com/AndyLu666/MCP-EDA-Server


Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF

arXiv.org Artificial Intelligence

The proposal introduces an innovative drone swarm perception system that aims to solve problems related to computational limitations and low-bandwidth communication, and real-time scene reconstruction. The framework enables efficient multi-agent 3D/4D scene synthesis through federated learning of shared diffusion model and YOLOv12 lightweight semantic extraction and local NeRF updates while maintaining privacy and scalability. The framework redesigns generative diffusion models for joint scene reconstruction, and improves cooperative scene understanding, while adding semantic-aware compression protocols. The approach can be validated through simulations and potential real-world deployment on drone testbeds, positioning it as a disruptive advancement in multi-agent AI for autonomous systems.


Exploring Agentic Artificial Intelligence Systems: Towards a Typological Framework

arXiv.org Artificial Intelligence

Artificial intelligence (AI) systems are evolving beyond passive tools into autonomous agents capable of reasoning, adapting, and acting with minimal human intervention. Despite their growing presence, a structured framework is lacking to classify and compare these systems . This paper develops a typology of agentic AI systems, introducing eight dimensions that define their cognitive and environmental agency in an ordinal structure. Using a multi - phase methodological approach, we construct and refine this typology, which is then evaluated through a human - AI hybrid approach and further distilled into constructed types. The framework enables researchers and practitioners to analyze varying levels of agency in AI systems. By offering a structured perspective on the progression o f AI capabilities, the typology provides a foundation for assessing current systems and anticipating future developments in agentic AI.