Reinforcement Learning
Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Dawood, Murad, Siddiquie, Usama Ahmed, Khorshidi, Shahram, Bennewitz, Maren
Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
KnowRL: Teaching Language Models to Know What They Know
Kale, Sahil, Dhami, Devendra Singh
Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
Lu, Han, Liu, Zichen, Xiong, Shaopan, He, Yancheng, Gao, Wei, Wu, Yanan, Wang, Weixun, Liu, Jiashun, Li, Yang, Zhao, Haizhou, Huang, Ju, Yang, Siran, Li, Xiaoyang, Luo, Yijia, Liu, Zihe, Pan, Ling, Yan, Junchi, Wang, Wei, Su, Wenbo, Wang, Jiamang, Qu, Lin, Zheng, Bo
Synchronous Reinforcement Learning (RL) post-training has emerged as a crucial step for enhancing Large Language Models (LLMs) with diverse capabilities. However, many systems designed to accelerate RL post-training still suffer from low resource utilization and limited scalability. We present ROLL Flash, a system that extends ROLL with native support for asynchronous RL post-training. ROLL Flash is built upon two core design principles: fine-grained parallelism and rollout-train decoupling. Guided by these principles, ROLL Flash provides flexible programming interfaces that enable a fully asynchronous training architecture and support efficient rollout mechanisms, including queue scheduling and environment-level asynchronous execution. Through comprehensive theoretical analysis and extensive experiments, we demonstrate that ROLL Flash significantly improves resource utilization and scalability over synchronous RL post-training. ROLL Flash achieves up to 2.24x speedup on RLVR tasks and 2.72x on agentic tasks, using the same GPU budget as synchronous baselines. Furthermore, we implement several popular off-policy algorithms and verify that asynchronous training can achieve performance on par with synchronous training.
Gym-TORAX: Open-source software for integrating RL with plasma control simulators
Mouchamps, Antoine, Malherbe, Arthur, Bolland, Adrien, Ernst, Damien
This paper presents Gym-TORAX, a Python package enabling the implementation of Reinforcement Learning (RL) environments for simulating plasma dynamics and control in tokamaks. Users define succinctly a set of control actions and observations, and a control objective from which Gym-TORAX creates a Gymnasium environment that wraps TORAX for simulating the plasma dynamics. The objective is formulated through rewards depending on the simulated state of the plasma and control action to optimize specific characteristics of the plasma, such as performance and stability. The resulting environment instance is then compatible with a wide range of RL algorithms and libraries and will facilitate RL research in plasma control. In its current version, one environment is readily available, based on a ramp-up scenario of the International Thermonuclear Experimental Reactor (ITER).
Emergence of hybrid computational dynamics through reinforcement learning
Kononov, Roman A., Pospelov, Nikita A., Anokhin, Konstantin V., Nekorkin, Vladimir V., Maslennikov, Oleg V.
Understanding how learning algorithms shape the computational strategies that emerge in neural networks remains a fundamental challenge in machine intelligence. While network architectures receive extensive attention, the role of the learning paradigm itself in determining emergent dynamics remains largely unexplored. Here we demonstrate that reinforcement learning (RL) and supervised learning (SL) drive recurrent neural networks (RNNs) toward fundamentally different computational solutions when trained on identical decision-making tasks. Through systematic dynamical systems analysis, we reveal that RL spontaneously discovers hybrid attractor architectures, combining stable fixed-point attractors for decision maintenance with quasi-periodic attractors for flexible evidence integration. This contrasts sharply with SL, which converges almost exclusively to simpler fixed-point-only solutions. We further show that RL sculpts functionally balanced neural populations through a powerful form of implicit regularization -- a structural signature that enhances robustness and is conspicuously absent in the more heterogeneous solutions found by SL-trained networks. The prevalence of these complex dynamics in RL is controllably modulated by weight initialization and correlates strongly with performance gains, particularly as task complexity increases. Our results establish the learning algorithm as a primary determinant of emergent computation, revealing how reward-based optimization autonomously discovers sophisticated dynamical mechanisms that are less accessible to direct gradient-based optimization. These findings provide both mechanistic insights into neural computation and actionable principles for designing adaptive AI systems.
Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
Macuglia, Maรซl, Friedrich, Paul, Ramponi, Giorgia
Deploying reinforcement learning (RL) in robotics, industry, and health care is blocked by two obstacles: the difficulty of specifying accurate rewards and the risk of unsafe, data-hungry exploration. We address this by proposing a two-stage framework that first learns a safe initial policy from a reward-free dataset of expert demonstrations, then fine-tunes it online using preference-based human feedback. We provide the first principled analysis of this offline-to-online approach and introduce BRIDGE, a unified algorithm that integrates both signals via an uncertainty-weighted objective. We derive regret bounds that shrink with the number of offline demonstrations, explicitly connecting the quantity of offline data to online sample efficiency. We validate BRIDGE in discrete and continuous control MuJoCo environments, showing it achieves lower regret than both standalone behavioral cloning and online preference-based RL. Our work establishes a theoretical foundation for designing more sample-efficient interactive agents.
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm
Chen, Yang, Zou, Menglin, Zhang, Jiaqi, Zhang, Yitan, Yang, Junyi, Gendron, Gael, Zhang, Libo, Liu, Jiamou, Witbrock, Michael J.
Inverse Reinforcement Learning (IRL) learns a reward function to explain expert demonstrations. Modern IRL methods often use the adversarial (minimax) formulation that alternates between reward and policy optimization, which often lead to unstable training. Recent non-adversarial IRL approaches improve stability by jointly learning reward and policy via energy-based formulations but lack formal guarantees. This work bridges this gap. We first present a unified view showing canonical non-adversarial methods explicitly or implicitly maximize the likelihood of expert behavior, which is equivalent to minimizing the expected return gap. This insight leads to our main contribution: Trust Region Reward Optimization (TRRO), a framework that guarantees monotonic improvement in this likelihood via a Minorization-Maximization process. We instantiate TRRO into Proximal Inverse Reward Optimization (PIRO), a practical and stable IRL algorithm. Theoretically, TRRO provides the IRL counterpart to the stability guarantees of Trust Region Policy Optimization (TRPO) in forward RL. Empirically, PIRO matches or surpasses state-of-the-art baselines in reward recovery, policy imitation with high sample efficiency on MuJoCo and Gym-Robotics benchmarks and a real-world animal behavior modeling task.
A Primer on SO(3) Action Representations in Deep Reinforcement Learning
Schuck, Martin, Samy, Sherif, Schoellig, Angela P.
Many robotic control tasks require policies to act on orientations, yet the geometry of SO(3) makes this nontrivial. Because SO(3) admits no global, smooth, minimal parameterization, common representations such as Euler angles, quaternions, rotation matrices, and Lie algebra coordinates introduce distinct constraints and failure modes. While these trade-offs are well studied for supervised learning, their implications for actions in reinforcement learning remain unclear. We systematically evaluate SO(3) action representations across three standard continuous control algorithms, PPO, SAC, and TD3, under dense and sparse rewards. We compare how representations shape exploration, interact with entropy regularization, and affect training stability through empirical studies and analyze the implications of different projections for obtaining valid rotations from Euclidean network outputs. Across a suite of robotics benchmarks, we quantify the practical impact of these choices and distill simple, implementation-ready guidelines for selecting and using rotation actions. Our results highlight that representation-induced geometry strongly influences exploration and optimization and show that representing actions as tangent vectors in the local frame yields the most reliable results across algorithms. Accurate reasoning over 3D rotations is a core requirement for machine learning algorithms applied in computer graphics, state estimation and control. In robotics and embodied intelligence, the problem extends to controlling physical orientations through learned actions, e.g., in manipulation policies that command full task-space poses or aerial vehicles that regulate attitude. These tasks rely on trained policies with action spaces including rotations in SO(3). This restriction has led to multiple parameterizations, each with its own tradeoffs (Macdonald, 2011; Barfoot, 2017). Euler angles are minimal and intuitive but suffer from order dependence, angle wrapping, and gimbal-lock singularities. Quaternions are smooth and numerically robust with a simple unit-norm constraint, but double-cover SO(3). Rotation matrices are a smooth and unique mapping, but are heavily over-parameterized and require orthonormalization. Viewing SO(3) as a Lie group, one can use tangent spaces, i.e., the Lie algebra m of skew-symmetric matrices, together with the exponential and logarithm maps to represent orientations. Tangent spaces are locally smooth, but globally exhibit singularities at large angles (Sol ` a et al., 2018). Irrespective of the choice of parameterization, any minimal 3-parameter chart must incur singularities, and global parameterizations that avoid singularities are necessarily redundant and constrained. Applications in deep learning that require reasoning over rotations and orientations have renewed interest in this topic by adding another perspective: irrespective of any mathematical properties, what is the best representation to learn from data in SO(3)?
Unveiling Uncertainty-Aware Autonomous Cooperative Learning Based Planning Strategy
Zhang, Shiyao, Deng, Liwei, Zhang, Shuyu, Yuan, Weijie, Zhang, Hong
Abstract--In future intelligent transportation systems, autonomous cooperative planning (ACP), becomes a promising technique to increase the effectiveness and security of multi-vehicle interactions. However, multiple uncertainties cannot be fully addressed for existing ACP strategies, e.g. T o address these, a novel deep reinforcement learning-based autonomous cooperative planning (DRLACP) framework is proposed to tackle various uncertainties on cooperative motion planning schemes. Specifically, the soft actor-critic (SAC) with the implementation of gate recurrent units (GRUs) is adopted to learn the deterministic optimal time-varying actions with imperfect state information occurred by planning, communication, and perception uncertainties. In addition, the real-time actions of autonomous vehicles (A Vs) are demonstrated via the Car Learning to Act (CARLA) simulation platform. Evaluation results show that the proposed DRLACP learns and performs cooperative planning effectively, which outperforms other baseline methods under different scenarios with imperfect A V state information. Y facilitating communication and interaction among formerly isolated vehicles, multi-vehicle systems have the potential to significantly accelerate task completion in transportation systems, such as platoon formation and collaborative planning operations [1], [2].
Refinery: Active Fine-tuning and Deployment-time Optimization for Contact-Rich Policies
Tang, Bingjie, Akinola, Iretiayo, Xu, Jie, Wen, Bowen, Fox, Dieter, Sukhatme, Gaurav S., Ramos, Fabio, Gupta, Abhishek, Narang, Yashraj
Simulation-based learning has enabled policies for precise, contact-rich tasks (e.g., robotic assembly) to reach high success rates (~80%) under high levels of observation noise and control error. Although such performance may be sufficient for research applications, it falls short of industry standards and makes policy chaining exceptionally brittle. A key limitation is the high variance in individual policy performance across diverse initial conditions. We introduce Refinery, an effective framework that bridges this performance gap, robustifying policy performance across initial conditions. We propose Bayesian Optimization-guided fine-tuning to improve individual policies, and Gaussian Mixture Model-based sampling during deployment to select initializations that maximize execution success. Using Refinery, we improve mean success rates by 10.98% over state-of-the-art methods in simulation-based learning for robotic assembly, reaching 91.51% in simulation and comparable performance in the real world. Furthermore, we demonstrate that these fine-tuned policies can be chained to accomplish long-horizon, multi-part assembly$\unicode{x2013}$successfully assembling up to 8 parts without requiring explicit multi-step training.