Goto

Collaborating Authors

 Reinforcement Learning


Self-Explaining Reinforcement Learning for Mobile Network Resource Allocation

arXiv.org Artificial Intelligence

Abstract--Reinforcement Learning (RL) methods that incorporate deep neural networks (DNN), though powerful, often lack transparency. Their black-box characteristic hinders inter-pretability and reduces trustworthiness, particularly in critical domains. T o address this challenge in RL tasks, we propose a solution based on Self-Explaining Neural Networks (SENNs) along with explanation extraction methods to enhance inter-pretability while maintaining predictive accuracy. Our approach targets low-dimensionality problems to generate robust local and global explanations of the model's behaviour . We evaluate the proposed method on the resource allocation problem in mobile networks, demonstrating that SENNs can constitute interpretable solutions with competitive performance. This work highlights the potential of SENNs to improve transparency and trust in AIdriven decision-making for low-dimensional tasks. Interest in Explainable Artificial Intelligance (XAI) has been rapidly growing, facilitated by the need for transparency. Although powerful, Deep Neural Networks (DNNs) models often operate as black boxes, making it difficult to interpret their decisions, leading to a lack of trust among stakeholders and consequently hindering their applicability.


Multi-Fidelity Hybrid Reinforcement Learning via Information Gain Maximization

arXiv.org Artificial Intelligence

Optimizing a reinforcement learning (RL) policy typically requires extensive interactions with a high-fidelity simulator of the environment, which are often costly or impractical. Offline RL addresses this problem by allowing training from pre-collected data, but its effectiveness is strongly constrained by the size and quality of the dataset. Hybrid offline-online RL leverages both offline data and interactions with a single simulator of the environment. In many real-world scenarios, however, multiple simulators with varying levels of fidelity and computational cost are available. In this work, we study multi-fidelity hybrid RL for policy optimization under a fixed cost budget. We introduce multi-fidelity hybrid RL via information gain maximization (MF-HRL-IGM), a hybrid offline-online RL algorithm that implements fidelity selection based on information gain maximization through a bootstrapping approach. Theoretical analysis establishes the no-regret property of MF-HRL-IGM, while empirical evaluations demonstrate its superior performance compared to existing benchmarks.


Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we investigate the conflict between gradient contributions for each objective that emerge from scalarising the task objectives. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients using a multi-headed critic and resolves conflicts based on the objective priority. Our methodology, GCR-PPO, is evaluated on the well-known IsaacLab manipulation and locomotion benchmarks and additional multi-objective modifications on two related tasks. We show superior scalability compared to parallel PPO (p = 0.04), without significant computational overhead. We also show higher performance with more conflicting tasks. GCR-PPO improves on large-scale PPO with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement. The code is available at https://github.com/humphreymunn/GCR-PPO.


LEED: A Highly Efficient and Scalable LLM-Empowered Expert Demonstrations Framework for Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning (MARL) holds substantial promise for intelligent decision-making in complex environments. However, it suffers from a coordination and scalability bottleneck as the number of agents increases. To address these issues, we propose the LLM-empowered expert demonstrations framework for multi-agent reinforcement learning (LEED). LEED consists of two components: a demonstration generation (DG) module and a policy optimization (PO) module. Specifically, the DG module leverages large language models to generate instructions for interacting with the environment, thereby producing high-quality demonstrations. The PO module adopts a decentralized training paradigm, where each agent utilizes the generated demonstrations to construct an expert policy loss, which is then integrated with its own policy loss. This enables each agent to effectively personalize and optimize its local policy based on both expert knowledge and individual experience. Experimental results show that LEED achieves superior sample efficiency, time efficiency, and robust scalability compared to state-of-the-art baselines.


Online reinforcement learning via sparse Gaussian mixture model Q-functions

arXiv.org Artificial Intelligence

This paper introduces a structured and interpretable online policy-iteration framework for reinforcement learning (RL), built around the novel class of sparse Gaussian mixture model Q-functions (S-GMM-QFs). Extending earlier work that trained GMM-QFs offline, the proposed framework develops an online scheme that leverages streaming data to encourage exploration. Model complexity is regulated through sparsification by Hadamard overparametrization, which mitigates overfitting while preserving expressiveness. The parameter space of S-GMM-QFs is naturally endowed with a Riemannian manifold structure, allowing for principled parameter updates via online gradient descent on a smooth objective. Numerical tests show that S-GMM-QFs match the performance of dense deep RL (DeepRL) methods on standard benchmarks while using significantly fewer parameters, and maintain strong performance even in low-parameter-count regimes where sparsified DeepRL methods fail to generalize.


Analysis of AI Techniques for Orchestrating Edge-Cloud Application Migration

arXiv.org Artificial Intelligence

Application migration in edge-cloud system enables high QoS and cost effective service delivery. However, automatically orchestrating such migration is typically solved with heuristic approaches. Starting from the Markov Decision Process (MDP), in this paper, we identify, analyze and compare selected state-of-the-art Artificial Intelligence (AI) planning and Reinforcement Learning (RL) approaches for solving the class of edge-cloud application migration problems that can be modeled as Towers of Hanoi (ToH) problems. We introduce a new classification based on state space definition and analyze the compared models also through this lense. The aim is to understand available techniques capable of orchestrating such application migration in emerging computing continuum environments.


A Test-Function Approach to Incremental Stability

arXiv.org Artificial Intelligence

Abstract-- This paper presents a novel framework for analyzing Incremental-Input-to-State Stability (ฮดISS) based on the idea of using rewards as "test functions." Whereas control theory traditionally deals with Lyapunov functions that satisfy a time-decrease condition, reinforcement learning (RL) value functions are constructed by exponentially decaying a Lipschitz reward function that may be non-smooth and unbounded on both sides. Thus, these RL-style value functions cannot be directly understood as Lyapunov certificates. We develop a new equivalence between a variant of incremental input-to-state stability of a closed-loop system under given a policy, and the regularity of RL-style value functions under adversarial selection of a H older-continuous reward function. This result highlights that the regularity of value functions, and their connection to incremental stability, can be understood in a way that is distinct from the traditional Lyapunov-based approach to certifying stability in control theory.


Resolve Highway Conflict in Multi-Autonomous Vehicle Controls with Local State Attention

arXiv.org Artificial Intelligence

In mixed-traffic environments, autonomous vehicles must adapt to human-controlled vehicles and other unusual driving situations. This setting can be framed as a multi-agent reinforcement learning (MARL) environment with full cooperative reward among the autonomous vehicles. While methods such as Multi-agent Proximal Policy Optimization can be effective in training MARL tasks, they often fail to resolve local conflict between agents and are unable to generalize to stochastic events. In this paper, we propose a Local State Attention module to assist the input state representation. By relying on the self-attention operator, the module is expected to compress the essential information of nearby agents to resolve the conflict in traffic situations. Utilizing a simulated highway merging scenario with the priority vehicle as the unexpected event, our approach is able to prioritize other vehicles' information to manage the merging process. The results demonstrate significant improvements in merging efficiency compared to popular baselines, especially in high-density traffic settings.


Mastering Multi-Drone Volleyball through Hierarchical Co-Self-Play Reinforcement Learning

arXiv.org Artificial Intelligence

Competitive tasks have long served as benchmarks for progress in artificial intelligence. Landmark results have been achieved in domains such as Go [1], poker [2], and real-time strategy games [3], where agents learn to plan, adapt, and compete under structured rules. As research moves from virtual environments to the physical world, robot sports-structured, rule-based competitions involving physical agents-have emerged as a promising frontier for embodied intelligence. Examples include robot soccer [4, 5], table tennis [6, 7], and multi-drone pursuit-evasion [8], which combine high-level strategy with low-level motion control in physically grounded settings. In this paper, we tackle a new embodied competitive task proposed by the V olleyBots testbed [9]: 3v3 multi-drone volleyball. This task exemplifies the structure of a robot sport-well-defined objectives, explicit rules, and head-to-head competition-while presenting a set of unique and underex-plored challenges. Each team must coordinate three quadrotors to rally a ball over a net, switching roles dynamically between offense and defense in a turn-based fashion. The environment is highly dynamic and demands precise timing, agile 3D maneuvering, and strategic team-level behavior. The turn-based nature of ball exchange introduces long-horizon temporal dependencies; the multi-agent setting requires tightly coupled tactics; and the underactuated dynamics of quadrotors call for fine-grained, reactive motor skills.


Learning Conservative Neural Control Barrier Functions from Offline Data

arXiv.org Artificial Intelligence

Safety filters, particularly those based on control barrier functions, have gained increased interest as effective tools for safe control of dynamical systems. Existing correct-by-construction synthesis algorithms for such filters, however, suffer from the curse-of-dimensionality. Deep learning approaches have been proposed in recent years to address this challenge. In this paper, we add to this set of approaches an algorithm for training neural control barrier functions from offline datasets. Such functions can be used to design constraints for quadratic programs that are then used as safety filters. Our algorithm trains these functions so that the system is not only prevented from reaching unsafe states but is also disincentivized from reaching out-of-distribution ones, at which they would be less reliable. It is inspired by Conservative Q-learning, an offline reinforcement learning algorithm. We call its outputs Conservative Control Barrier Functions (CCBFs). Our empirical results demonstrate that CCBFs outperform existing methods in maintaining safety while minimally affecting task performance. Source code is available at https://github.com/tabz23/CCBF.