Reinforcement Learning
Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions
Stolz, Roland, Eichelbeck, Michael, Althoff, Matthias
In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained RL faces challenges regarding effective policy updates, computational efficiency, and predictable runtime. Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance. We argue that accurate estimation of these characteristics is crucial in the action-constrained RL setting, and propose efficient numerical approximations for them. We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.
Nonholonomic Narrow Dead-End Escape with Deep Reinforcement Learning
Xiong, Denghan, Zhao, Yanzhe, Chen, Yutong, Wang, Zichun
Nonholonomic constraints restrict feasible velocities without reducing configuration-space dimension, which makes collision-free geometric paths generally non-executable for car-like robots. Ackermann steering further imposes curvature bounds and forbids in-place rotation, so escaping from narrow dead ends typically requires tightly sequenced forward and reverse maneuvers. Classical planners that decouple global search and local steering struggle in these settings because narrow passages occupy low-measure regions and nonholonomic reachability shrinks the set of valid connections, which degrades sampling efficiency and increases sensitivity to clearances. We study nonholonomic narrow dead-end escape for Ackermann vehicles and contribute three components. First, we construct a generator that samples multi-phase forward-reverse trajectories compatible with Ackermann kinematics and inflates their envelopes to synthesize families of narrow dead ends that are guaranteed to admit at least one feasible escape. Second, we construct a training environment that enforces kinematic constraints and train a policy using the soft actor-critic algorithm. Third, we evaluate against representative classical planners that combine global search with nonholonomic steering. Across parameterized dead-end families, the learned policy solves a larger fraction of instances, reduces maneuver count, and maintains comparable path length and planning time while under the same sensing and control limits. We provide our project as open source at https://github.com/gitagitty/cisDRL-RobotNav.git
RELiQ: Scalable Entanglement Routing via Reinforcement Learning in Quantum Networks
Meuser, Tobias, Weil, Jannis, Lahiri, Aninda, Paraschiv, Marius
Quantum networks are becoming increasingly important because of advancements in quantum computing and quantum sensing, such as recent developments in distributed quantum computing and federated quantum machine learning. Routing entanglement in quantum networks poses several fundamental as well as technical challenges, including the high dynamicity of quantum network links and the probabilistic nature of quantum operations. Consequently, designing hand-crafted heuristics is difficult and often leads to suboptimal performance, especially if global network topology information is unavailable. In this paper, we propose RELiQ, a reinforcement learning-based approach to entanglement routing that only relies on local information and iterative message exchange. Utilizing a graph neural network, RELiQ learns graph representations and avoids overfitting to specific network topologies - a prevalent issue for learning-based approaches. Our approach, trained on random graphs, consistently outperforms existing local information heuristics and learning-based approaches when applied to random and real-world topologies. When compared to global information heuristics, our method achieves similar or superior performance because of its rapid response to topology changes.
Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation
Deng, Zehao, Ju, Tianjie, Wu, Zheng, Zhang, Zhuosheng, Liu, Gongshen
The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors. Code can be available at https://github.com/hehehahi4/CES.
Embedded Universal Predictive Intelligence: a coherent framework for multi-agent learning
Meulemans, Alexander, Nasser, Rajai, Woลczyk, Maciej, Weis, Marissa A., Kobayashi, Seijin, Richards, Blake, Lajoie, Guillaume, Steger, Angelika, Hutter, Marcus, Manyika, James, Saurous, Rif A., Sacramento, Joรฃo, Arcas, Blaise Agรผera y
The standard theory of model-free reinforcement learning assumes that the environment dynamics are stationary and that agents are decoupled from their environment, such that policies are treated as being separate from the world they inhabit. This leads to theoretical challenges in the multi-agent setting where the non-stationarity induced by the learning of other agents demands prospective learning based on prediction models. To accurately model other agents, an agent must account for the fact that those other agents are, in turn, forming beliefs about it to predict its future behavior, motivating agents to model themselves as part of the environment. Here, building upon foundational work on universal artificial intelligence (AIXI), we introduce a mathematical framework for prospective learning and embedded agency centered on self-prediction, where Bayesian RL agents predict both future perceptual inputs and their own actions, and must therefore resolve epistemic uncertainty about themselves as part of the universe they inhabit. We show that in multi-agent settings, self-prediction enables agents to reason about others running similar algorithms, leading to new game-theoretic solution concepts and novel forms of cooperation unattainable by classical decoupled agents. Moreover, we extend the theory of AIXI, and study universally intelligent embedded agents which start from a Solomonoff prior. We show that these idealized agents can form consistent mutual predictions and achieve infinite-order theory of mind, potentially setting a gold standard for embedded multi-agent learning.
BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning
Offline inverse reinforcement learning (IRL) aims to recover a reward function that explains expert behavior using only fixed demonstration data, without any additional online interaction. We propose BiCQL-ML, a policy-free offline IRL algorithm that jointly optimizes a reward function and a conservative Q-function in a bi-level framework, thereby avoiding explicit policy learning. The method alternates between (i) learning a conservative Q-function via Conservative Q-Learning (CQL) under the current reward, and (ii) updating the reward parameters to maximize the expected Q-values of expert actions while suppressing over-generalization to out-of-distribution actions. This procedure can be viewed as maximum likelihood estimation under a soft value matching principle. We provide theoretical guarantees that BiCQL-ML converges to a reward function under which the expert policy is soft-optimal. Empirically, we show on standard offline RL benchmarks that BiCQL-ML improves both reward recovery and downstream policy performance compared to existing offline IRL baselines.
Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning
Masrur, Saad, Guvenc, Ismail, Perez, David Lopez
Dynamic sleep mode optimization (SMO) in millimeter-wave (mmWave) networks is essential for maximizing energy efficiency (EE) under stringent quality-of-service (QoS) constraints. However, existing optimization and reinforcement learning (RL)-based approaches rely on aggregated, static base station (BS) traffic models that fail to capture non-stationary traffic dynamics and suffer from prohibitively large state-action spaces, limiting their real-world deployment. To address these challenges, this paper proposes a Multi-Agent Deep Reinforcement Learning (MARL) framework employing a Double Deep Q-Network (DDQN), referred to as MARL-DDQN, for adaptive SMO in a 3D urban environment using a time-varying and community-based user equipment (UE) mobility model. Unlike conventional single-agent RL, the proposed MARL-DDQN enables scalable, distributed decision-making with minimal signaling overhead. A realistic BS power consumption model and beamforming are integrated to accurately quantify EE, while QoS is uniquely defined in terms of throughput. The proposed method adaptively learns SMO policies to maximize EE while mitigating inter-cell interference and ensuring throughput fairness. Extensive simulations demonstrate that MARL-DDQN consistently outperforms state-of-the-art SM strategies, including the All On, iterative QoS-aware load-based (IT-QoS-LB), MARL-DDPG, and MARL-PPO, achieving up to 0. 60 Mbit/Joule EE, 8. 5 Mbps 10th-percentile throughput, and satisfying QoS constraints 95 % of the time under dynamic network scenarios. I. Introduction The exponential growth in mobile data demand has necessitated increased spectrum availability and accelerated the expansion of cellular network infrastructure. To address the limitations of the sub-6 GHz spectrum, millimeter wave (mmWave) communications, operating within the 30-300 GHz band, have emerged as a key enabler in fifth-generation (5G) networks. With significantly larger bandwidth availability, mmWave technology presents a viable solution to spectrum scarcity challenges [1]. However, mmWave signals suffer from high propagation loss, atmospheric absorption, and susceptibility to blockages, which severely limit coverage and reliability. To address coverage and growing capacity demands, 5G networks rely on densification, deploying numerous low-power mmWave BSs with inter-site distances of a few hundred meters [1]. These BSs utilize large antenna arrays to enable beamforming and spatial multiplexing, often relying on hybrid analog-digital precoding to reduce hardware complexity [2]. However, the RF chain remains a major source of power consumption, particularly the Analog-to-digital converters (ADCs) and digital-to-analog converters (DACs), whose power scales with sampling rate. Due to the higher frequencies and wider bandwidths of mmWave systems, these components require significantly higher sampling rates than sub-6 GHz systems [3], resulting in substantial energy demands.
Adaptive Dueling Double Deep Q-networks in Uniswap V3 Replication and Extension with Mamba
The report goes through the main steps of replicating and extending the article "Adaptive Liquidity Provision in Uniswap V3 with Deep Reinforcement Learning." The replication part includes how to obtain data from the Uniswap Subgraph, details of the implementation, and comments on the results. After the replication, we propose a new structure based on the original model, which combines Mamba with DDQN and a new reward function. In this new structure, we clean the data again and introduce two new baseline methods for comparison. As a result, although the model has not yet been applied to all datasets, it shows stronger theoretical support than the original model and performs better in the baseline tests.
Hybrid Stackelberg Game and Diffusion-based Auction for Two-tier Agentic AI Task Offloading in Internet of Agents
Zhong, Yue, Tong, Yongju, Kang, Jiawen, Dai, Minghui, Dai, Hong-Ning, Su, Zhou, Niyato, Dusit
The Internet of Agents (IoA) is rapidly gaining prominence as a foundational architecture for interconnected intelligent systems, designed to facilitate seamless discovery, communication, and collaborative reasoning among a vast network of Artificial Intelligence (AI) agents. Powered by Large Language and Vision-Language Models, IoA enables the development of interactive, rational agents capable of complex cooperation, moving far beyond traditional isolated models. IoA involves physical entities, i.e., Wireless Agents (WAs) with limited onboard resources, which need to offload their compute-intensive agentic AI services to nearby servers. Such servers can be Mobile Agents (MAs), e.g., vehicle agents, or Fixed Agents (FAs), e.g., end-side units agents. Given their fixed geographical locations and stable connectivity, FAs can serve as reliable communication gateways and task aggregation points. This stability allows them to effectively coordinate with and offload to an Aerial Agent (AA) tier, which has an advantage not affordable for highly mobile MAs with dynamic connectivity limitations. As such, we propose a two-tier optimization approach. The first tier employs a multi-leader multi-follower Stackelberg game. In the game, MAs and FAs act as the leaders who set resource prices. WAs are the followers to determine task offloading ratios. However, when FAs become overloaded, they can further offload tasks to available aerial resources. Therefore, the second tier introduces a Double Dutch Auction model where overloaded FAs act as the buyers to request resources, and AAs serve as the sellers for resource provision. We then develop a diffusion-based Deep Reinforcement Learning algorithm to solve the model. Numerical results demonstrate the superiority of our proposed scheme in facilitating task offloading.
MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis
Zhu, Chunzheng, Lin, Yangfang, Chen, Shen, Wang, Yijun, Lin, Jianxin
Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5\% across multiple medical VQA benchmarks, validating MedEyes's potential in building interpretable medical AI systems.