Reinforcement Learning
Model-based Lifelong Reinforcement Learning with Bayesian Exploration
We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methods.
Learn to Match with No Regret: Reinforcement Learning in Markov Matching Markets
We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market.At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret.
Gaussian Approximation for Two-Timescale Linear Stochastic Approximation
Butyrin, Bogdan, Rubtsov, Artemy, Naumov, Alexey, Ulyanov, Vladimir, Samsonov, Sergey
In this paper, we establish non-asymptotic bounds for accuracy of normal approximation for linear two-timescale stochastic approximation (TTSA) algorithms driven by martingale difference or Markov noise. Focusing on both the last iterate and Polyak-Ruppert averaging regimes, we derive bounds for normal approximation in terms of the convex distance between probability distributions. Our analysis reveals a non-trivial interaction between the fast and slow timescales: the normal approximation rate for the last iterate improves as the timescale separation increases, while it decreases in the Polyak-Ruppert averaged setting. We also provide the high-order moment bounds for the error of linear TTSA algorithm, which may be of independent interest.
A Tutorial: An Intuitive Explanation of Offline Reinforcement Learning Theory
Offline reinforcement learning (RL) aims to optimize the return given a fixed dataset of agent trajectories without additional interactions with the environment. While algorithm development has progressed rapidly, significant theoretical advances have also been made in understanding the fundamental challenges of offline RL. However, bridging these theoretical insights with practical algorithm design remains an ongoing challenge. In this survey, we explore key intuitions derived from theoretical work and their implications for offline RL algorithms. We begin by listing the conditions needed for the proofs, including function representation and data coverage assumptions. Function representation conditions tell us what to expect for generalization, and data coverage assumptions describe the quality requirement of the data. We then examine counterexamples, where offline RL is not solvable without an impractically large amount of data. These cases highlight what cannot be achieved for all algorithms and the inherent hardness of offline RL. Building on techniques to mitigate these challenges, we discuss the conditions that are sufficient for offline RL. These conditions are not merely assumptions for theoretical proofs, but they also reveal the limitations of these algorithms and remind us to search for novel solutions when the conditions cannot be satisfied.
Deep Reinforcement Learning with anticipatory reward in LSTM for Collision Avoidance of Mobile Robots
Poulet, Olivier, Guinand, Frรฉdรฉric, Guรฉrin, Franรงois
This article proposes a collision risk anticipation method based on short-term prediction of the agents position. A Long Short-Term Memory (LSTM) model, trained on past trajectories, is used to estimate the next position of each robot. This prediction allows us to define an anticipated collision risk by dynamically modulating the reward of a Deep Q-Learning Network (DQN) agent. The approach is tested in a constrained environment, where two robots move without communication or identifiers. Despite a limited sampling frequency (1 Hz), the results show a significant decrease of the collisions number and a stability improvement. The proposed method, which is computationally inexpensive, appears particularly attractive for implementation on embedded systems.
Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment
Wang, Haowen, Yue, Yun, Ye, Zhiling, Zhang, Shuowen, Fan, Lei, Liang, Jiaxin, Jiang, Jiadi, Wei, Cheng, Deng, Jingyuan, Han, Xudong, Li, Ji, Guo, Chunxiao, Wei, Peng, Wang, Jian, Gu, Jinjie
Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
Robust Reinforcement Learning over Wireless Networks with Homomorphic State Representations
Talli, Pietro, Mason, Federico, Chiariotti, Federico, Zanella, Andrea
In this work, we address the problem of training Reinforcement Learning (RL) agents over communication networks. The RL paradigm requires the agent to instantaneously perceive the state evolution to infer the effects of its actions on the environment. This is impossible if the agent receives state updates over lossy or delayed wireless systems and thus operates with partial and intermittent information. In recent years, numerous frameworks have been proposed to manage RL with imperfect feedback; however, they often offer specific solutions with a substantial computational burden. To address these limits, we propose a novel architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the training of remote RL agents exchanging observations across a non-ideal wireless channel. HR3L considers two units: the transmitter, which encodes meaningful representations of the environment, and the receiver, which decodes these messages and performs actions to maximize a reward signal. Importantly, HR3L does not require the exchange of gradient information across the wireless channel, allowing for quicker training and a lower communication overhead than state-of-the-art solutions. Experimental results demonstrate that HR3L significantly outperforms baseline methods in terms of sample efficiency and adapts to different communication scenarios, including packet losses, delayed transmissions, and capacity limitations.
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
Guo, Chaohong, Mo, Xun, Nie, Yongwei, Xu, Xuemiao, Xu, Chao, Yu, Fei, Long, Chengjiang
Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.
MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation
Lim, Yooseok, Jeon, ByoungJun, Park, Seong-A, Lee, Jisoo, Choi, Sae Won, Jeong, Chang Wook, Ryu, Ho-Geol, Lee, Hongyeol, Yang, Hyun-Lim
Sepsis, a life-threatening inflammatory response to infection, causes organ dysfunction, making early detection and optimal management critical. Previous reinforcement learning (RL) approaches to sepsis management rely primarily on structured data, such as lab results or vital signs, and on a dearth of a comprehensive understanding of the patient's condition. In this work, we propose a Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation (MORE-CLEAR) framework for sepsis control in intensive care units. MORE-CLEAR employs pre-trained large-scale language models (LLMs) to facilitate the extraction of rich semantic representations from clinical notes, preserving clinical context and improving patient state representation. Gated fusion and cross-modal attention allow dynamic weight adjustment in the context of time and the effective integration of multimodal data. Extensive cross-validation using two public (MIMIC-III and MIMIC-IV) and one private dataset demonstrates that MORE-CLEAR significantly improves estimated survival rate and policy performance compared to single-modal RL approaches. To our knowledge, this is the first to leverage LLM capabilities within a multimodal offline RL for better state representation in medical applications. This approach can potentially expedite the treatment and management of sepsis by enabling reinforcement learning models to propose enhanced actions based on a more comprehensive understanding of patient conditions.
End-to-End Humanoid Robot Safe and Comfortable Locomotion Policy
Wang, Zifan, Yang, Xun, Zhao, Jianzhuang, Zhou, Jiaming, Ma, Teli, Gao, Ziyao, Ajoudani, Arash, Liang, Junwei
The deployment of humanoid robots in unstructured, human-centric environments requires navigation capabilities that extend beyond simple locomotion to include robust perception, provable safety, and socially aware behavior. Current reinforcement learning approaches are often limited by blind controllers that lack environmental awareness or by vision-based systems that fail to perceive complex 3D obstacles. In this work, we present an end-to-end locomotion policy that directly maps raw, spatio-temporal LiDAR point clouds to motor commands, enabling robust navigation in cluttered dynamic scenes. We formulate the control problem as a Constrained Markov Decision Process (CMDP) to formally separate safety from task objectives. Our key contribution is a novel methodology that translates the principles of Control Barrier Functions (CBFs) into costs within the CMDP, allowing a model-free Penalized Proximal Policy Optimization (P3O) to enforce safety constraints during training. Furthermore, we introduce a set of comfort-oriented rewards, grounded in human-robot interaction research, to promote motions that are smooth, predictable, and less intrusive. We demonstrate the efficacy of our framework through a successful sim-to-real transfer to a physical humanoid robot, which exhibits agile and safe navigation around both static and dynamic 3D obstacles.