Goto

Collaborating Authors

 Reinforcement Learning


Enhancing Robustness of Offline Reinforcement Learning Under Data Corruption via Sharpness-Aware Minimization

arXiv.org Artificial Intelligence

Offline reinforcement learning (RL) is vulnerable to real-world data corruption, with even robust algorithms failing under challenging observation and mixture corruptions. We posit this failure stems from data corruption creating sharp minima in the loss landscape, leading to poor generalization. To address this, we are the first to apply Sharpness-A ware Minimization (SAM) as a general-purpose, plug-and-play optimizer for offline RL. SAM seeks flatter minima, guiding models to more robust parameter regions. We integrate SAM into strong baselines for data corruption: IQL, a top-performing offline RL algorithm in this setting, and RIQL, an algorithm designed specifically for data-corruption robustness. We evaluate them on D4RL benchmarks with both random and adversarial corruption. Our SAM-enhanced methods consistently and significantly outperform the original baselines. Visualizations of the reward surface confirm that SAM finds smoother solutions, providing strong evidence for its effectiveness in improving the robustness of offline RL agents.


Q-Learning-Based Time-Critical Data Aggregation Scheduling in IoT

arXiv.org Artificial Intelligence

Time-critical data aggregation in Internet of Things (IoT) networks demands efficient, collision-free scheduling to minimize latency for applications like smart cities and industrial automation. Traditional heuristic methods, with two-phase tree construction and scheduling, often suffer from high computational overhead and suboptimal delays due to their static nature. To address this, we propose a novel Q-learning framework that unifies aggregation tree construction and scheduling, modeling the process as a Markov Decision Process (MDP) with hashed states for scalability. By leveraging a reward function that promotes large, interference-free batch transmissions, our approach dynamically learns optimal scheduling policies. Simulations on static networks with up to 300 nodes demonstrate up to 10.87% lower latency compared to a state-of-the-art heuristic algorithm, highlighting its robustness for delay-sensitive IoT applications. This framework enables timely insights in IoT environments, paving the way for scalable, low-latency data aggregation.


Concise Reasoning via Reinforcement Learning

arXiv.org Artificial Intelligence

A major drawback of reasoning models is their excessive token usage, inflating computational cost, resource demand, and latency. We show this verbosity stems not from deeper reasoning but from reinforcement learning loss minimization when models produce incorrect answers. With unsolvable problems dominating training, this effect compounds into a systematic tendency toward longer outputs. Through theoretical analysis of PPO and GRPO, we prove that incorrect answers inherently drive policies toward verbosity \textit{even when} $ฮณ=1$, reframing response lengthening as an optimization artifact. We further uncover a consistent correlation between conciseness and correctness across reasoning and non-reasoning models. Building on these insights, we propose a two-phase RL procedure where a brief secondary stage, trained on a small set of solvable problems, significantly reduces response length while preserving or improving accuracy. Finally, we show that while GRPO shares properties with PPO, it exhibits collapse modes, limiting its reliability for concise reasoning. Our claims are supported by extensive experiments.


R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

arXiv.org Artificial Intelligence

Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.


Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

arXiv.org Artificial Intelligence

Decision-making architectures have played a central role for decades [1] both in engineering and other domains, e.g., guidance, navigation and control of Apollo missions [2], chemical plants [3], smart grids [4], unmanned aerial vehicles [5], recommender systems [6], and algorithms [7]. Moreover, architectures are ubiquitous in nature, e.g., diversity in the nervous system enables humans to have fast and accurate sensorimotor control [8]. Reinforcement Learning (RL) is a framework in which an agent learns to make sequential decisions through interaction with an environment in order to maximize cumulative reward [9]. Decision-making architectures have also been proposed and studied in RL. Hierarchical Reinforcement Learning (HRL) is a subfield of RL that deals with hierarchical structures for decision-making agents. Prospective advantages include improved long-term credit assignment, continual learning, interpretability, and the integration of preexisting policies [10], [11].


MIR: Efficient Exploration in Episodic Multi-Agent Reinforcement Learning via Mutual Intrinsic Reward

arXiv.org Artificial Intelligence

Episodic rewards present a significant challenge in reinforcement learning. While intrinsic reward methods have demonstrated effectiveness in single -agent reinforcement learning scenarios, their application to multi -agent reinforcement learning (MARL) remains problematic. The primary difficulties stem from two factors: (1) the exponential sparsity of joint action trajectories that lead to rewards as the exploration space expands, and (2) existing methods often fail to account for joint actions that can influence team states. To address these challenges, this paper introduces Mutual Intrinsic Reward (MIR), a simple yet effective enhancement strategy for MARL with extremely sparse rewards like episodic rewards. MIR incentivizes individual agents to explore actions that affect their teammates, and when combined with original strategies, effectively stimulates team exploration and improves algorithm performance. For comprehensive experimental validation, we extend the representative single-agent MiniGrid environment to create MiniGrid -MA, a series of MARL environments with sparse rewards. Our evaluation compares the proposed method against state-of -the -art approaches in the MiniGrid-MA setting, with experimental results demonstrating superior performance.


Dissecting Quantum Reinforcement Learning: A Systematic Evaluation of Key Components

arXiv.org Artificial Intelligence

Parameterised quantum circuit (PQC) based Quantum Reinforcement Learning (QRL) has emerged as a promising paradigm at the intersection of quantum computing and reinforcement learning (RL). By design, PQCs create hybrid quantum-classical models, but their practical applicability remains uncertain due to training instabilities, barren plateaus (BPs), and the difficulty of isolating the contribution of individual pipeline components. In this work, we dissect PQC based QRL architectures through a systematic experimental evaluation of three aspects recurrently identified as critical: (i) data embedding strategies, with Data Reuploading (DR) as an advanced approach; (ii) ansatz design, particularly the role of entanglement; and (iii) post-processing blocks after quantum measurement, with a focus on the underexplored Output Reuse (OR) technique. Using a unified PPO-CartPole framework, we perform controlled comparisons between hybrid and classical agents under identical conditions. Our results show that OR, though purely classical, exhibits distinct behaviour in hybrid pipelines, that DR improves trainability and stability, and that stronger entanglement can degrade optimisation, offsetting classical gains. Together, these findings provide controlled empirical evidence of the interplay between quantum and classical contributions, and establish a reproducible framework for systematic benchmarking and component-wise analysis in QRL.


FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models

arXiv.org Artificial Intelligence

Aligning Large Language Models (LLMs) with human values often involves balancing multiple, conflicting objectives such as helpfulness and harmlessness. Training these models is computationally intensive, and centralizing the process raises significant data privacy concerns. Federated Learning (FL) offers a compelling alternative, but existing Federated Multi-Objective Optimization (FMOO) methods face severe communication bottlenecks as their reliance on transmitting multiple gradients to a server is unscalable for large models. We introduce FIRM (Federated In-client Regularized Multi-objective alignment), a novel algorithm that achieves both client disagreement drift mitigation and communication efficiency. In FIRM, each client locally solves a regularized multi-objective optimization problem. By directly mitigating client disagreement drift through in-client regularization, our method eliminates the need for the multi-gradient transmissions common in prior works. Consequently, clients need only to transmit a single set of adapted parameters, maintaining high communication efficiency. We prove that our algorithm converges to Pareto-stationary points and, to our knowledge, provide the first finite-time convergence guarantees for this federated multi-objective alignment setting. Empirically, we show that FIRM leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines. We further propose a method to incorporate a preference over the objectives and report empirical Pareto plots, demonstrating that FIRM can smoothly adapt trade-offs between objectives in response to specified preferences.


CroTad: A Contrastive Reinforcement Learning Framework for Online Trajectory Anomaly Detection

arXiv.org Artificial Intelligence

Detecting trajectory anomalies is a vital task in modern Intelligent Transportation Systems (ITS), enabling the identification of unsafe, inefficient, or irregular travel behaviours. While deep learning has emerged as the dominant approach, several key challenges remain unresolved. First, sub-trajectory anomaly detection, capable of pinpointing the precise segments where anomalies occur, remains underexplored compared to whole-trajectory analysis. Second, many existing methods depend on carefully tuned thresholds, limiting their adaptability in real-world applications. Moreover, the irregular sampling of trajectory data and the presence of noise in training sets further degrade model performance, making it difficult to learn reliable representations of normal routes. To address these challenges, we propose a contrastive reinforcement learning framework for online trajectory anomaly detection, CroTad. Our method is threshold-free and robust to noisy, irregularly sampled data. By incorporating contrastive learning, CroTad learns to extract diverse normal travel patterns for different itineraries and effectively distinguish anomalous behaviours at both sub-trajectory and point levels. The detection module leverages deep reinforcement learning to perform online, real-time anomaly scoring, enabling timely and fine-grained identification of abnormal segments. Extensive experiments on two real-world datasets demonstrate the effectiveness and robustness of our framework across various evaluation scenarios.


Hybrid Differential Reward: Combining Temporal Difference and Action Gradients for Efficient Multi-Agent Reinforcement Learning in Cooperative Driving

arXiv.org Artificial Intelligence

In multi-vehicle cooperative driving tasks involving high-frequency continuous control, traditional state-based reward functions suffer from the issue of vanishing reward differences. This phenomenon results in a low signal-to-noise ratio (SNR) for policy gradients, significantly hindering algorithm convergence and performance improvement. To address this challenge, this paper proposes a novel Hybrid Differential Reward (HDR) mechanism. We first theoretically elucidate how the temporal quasi-steady nature of traffic states and the physical proximity of actions lead to the failure of traditional reward signals. Building on this analysis, the HDR framework innovatively integrates two complementary components: (1) a Temporal Difference Reward (TRD) based on a global potential function, which utilizes the evolutionary trend of potential energy to ensure optimal policy invariance and consistency with long-term objectives; and (2) an Action Gradient Reward (ARG), which directly measures the marginal utility of actions to provide a local guidance signal with a high SNR. Furthermore, we formulate the cooperative driving problem as a Multi-Agent Partially Observable Markov Game (POMDPG) with a time-varying agent set and provide a complete instantiation scheme for HDR within this framework. Extensive experiments conducted using both online planning (MCTS) and Multi-Agent Reinforcement Learning (QMIX, MAPPO, MADDPG) algorithms demonstrate that the HDR mechanism significantly improves convergence speed and policy stability. The results confirm that HDR guides agents to learn high-quality cooperative policies that effectively balance traffic efficiency and safety.