Reinforcement Learning
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
Arnal, Charles, Narozniak, Gaëtan, Cabannes, Vivien, Tang, Yunhao, Kempe, Julia, Munos, Remi
Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.
Smart Traffic Signals: Comparing MARL and Fixed-Time Strategies
Urban traffic congestion, particularly at intersections, significantly affects travel time, fuel consumption, and emissions. Traditional fixed-time signal control systems often lack the adaptability to effectively manage dynamic traffic patterns. This study explores the application of multi-agent reinforcement learning (MARL) to optimize traffic signal coordination across multiple intersections within a simulated environment. A simulation was developed to model a network of interconnected intersections with randomly generated vehicle flows to reflect realistic traffic variability. A decentralized MARL controller was implemented in which each traffic signal operates as an autonomous agent, making decisions based on local observations and information from neighboring agents. Performance was evaluated against a baseline fixed-time controller using metrics such as average vehicle wait time and overall throughput. The MARL approach demonstrated statistically significant improvements, including reduced average waiting times and improved throughput. These findings suggest that MARL-based dynamic control strategies hold substantial promise to improve urban traffic management efficiency. More research is recommended to address the challenges of scalability and real-world implementation.
Unraveling the Rainbow: can value-based methods schedule?
Corrêa, Arthur, Jesus, Alexandre, Nascimento, Paulo, Silva, Cristóvão, Moniz, Samuel
In this work, we conduct an extensive empirical study of several deep reinforcement learning algorithms on two challenging combinatorial optimization problems: the job-shop and flexible job-shop scheduling problems, both fundamental challenges with multiple industrial applications. Broadly, deep reinforcement learning algorithms fall into two categories: policy-gradient and value-based. While value-based algorithms have achieved notable success in domains such as the Arcade Learning Environment, the combinatorial optimization community has predominantly favored policy-gradient algorithms, often overlooking the potential of value-based alternatives. From our results, value-based algorithms demonstrated a lower variance and a more stable convergence profile compared to policy-gradient ones. Moreover, they achieved superior cross-size and cross-distribution generalization, that is, effectively solving instances that are substantially larger or structurally distinct from those seen during training. Finally, our analysis also suggests that the relative performance of each category of algorithms may be dependent on structural properties of the problem, such as problem flexibility and instance size. Overall, our findings challenge the prevailing assumption that policy-gradient algorithms are inherently superior for combinatorial optimization. We show instead that value-based algorithms can match or even surpass the performance of policy-gradient algorithms, suggesting that they deserve greater attention from the combinatorial optimization community. Our code is openly available at: https://github.com/AJ-Correa/Unraveling-the-Rainbow
Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions
Yao, Tianliang, Lu, Bo, Kowarschik, Markus, Yuan, Yixuan, Zhao, Hubin, Ourselin, Sebastien, Althoefer, Kaspar, Ge, Junbo, Qi, Peng
Endovascular procedures have revolutionized vascular disease treatment, yet their manual execution is challenged by the demands for high precision, operator fatigue, and radiation exposure. Robotic systems have emerged as transformative solutions to mitigate these inherent limitations. A pivotal moment has arrived, where a confluence of pressing clinical needs and breakthroughs in AI creates an opportunity for a paradigm shift toward Embodied Intelligence (EI), enabling robots to navigate complex vascular networks and adapt to dynamic physiological conditions. Data-driven approaches, leveraging advanced computer vision, medical image analysis, and machine learning, drive this evolution by enabling real-time vessel segmentation, device tracking, and anatomical landmark detection. Reinforcement learning and imitation learning further enhance navigation strategies and replicate expert techniques. This review systematically analyzes the integration of EI into endovascular robotics, identifying profound systemic challenges such as the heterogeneity in validation standards and the gap between human mimicry and machine-native capabilities. Based on this analysis, a conceptual roadmap is proposed that reframes the ultimate objective away from systems that supplant clinical decision-making. This vision of augmented intelligence, where the clinician's role evolves into that of a high-level supervisor, provides a principled foundation for the future of the field.
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
Cho, Dongkyu Derek, Song, Huan, Chowdhury, Arijit Ghosh, An, Haotian, Wang, Yawei, Thekkanal, Rohit, Sokhandan, Negin, Keshava, Sharlina, Marlowe, Hannah
This degradation persists across standard approaches including supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RL VR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RL VR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RL VR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety-capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.
Selecting Belief-State Approximations in Simulators with Latent States
State resetting is a fundamental but often overlooked capability of simulators. It supports sample-based planning by allowing resets to previously encountered simulation states, and enables calibration of simulators using real data by resetting to states observed in real-system traces. While often taken for granted, state resetting in complex simulators can be nontrivial: when the simulator comes with latent variables (states), state resetting requires sampling from the posterior over the latent state given the observable history, a.k.a. the belief state (Silver and Veness, 2010). While exact sampling is often infeasible, many approximate belief-state samplers can be constructed, raising the question of how to select among them using only sampling access to the simulator. In this paper, we show that this problem reduces to a general conditional distribution-selection task and develop a new algorithm and analysis under sampling-only access. Building on this reduction, the belief-state selection problem admits two different formulations: latent state-based selection, which directly targets the conditional distribution of the latent state, and observation-based selection, which targets the induced distribution over the observation. Interestingly, these formulations differ in how their guarantees interact with the downstream roll-out methods: perhaps surprisingly, observation-based selection may fail under the most natural roll-out method (which we call Single-Reset) but enjoys guarantees under the less conventional alternative (which we call Repeated-Reset). Together with discussion on issues such as distribution shift and the choice of sampling policies, our paper reveals a rich landscape of algorithmic choices, theoretical nuances, and open questions, in this seemingly simple problem.
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Jiang, Daniel R., Bhandari, Jalaj, Yang, Yukai, Munos, Rémi, Lu, Tyler
Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.
Predictive Safety Shield for Dyna-Q Reinforcement Learning
Pin, Jin, Hanna, Krasowski, Elena, Vanneaux
Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Shao, Rulin, Asai, Akari, Shen, Shannon Zejiang, Ivison, Hamish, Kishore, Varsha, Zhuo, Jingming, Zhao, Xinran, Park, Molly, Finlayson, Samuel G., Sontag, David, Murray, Tyler, Min, Sewon, Dasigi, Pradeep, Soldaini, Luca, Brahman, Faeze, Yih, Wen-tau, Wu, Tongshuang, Zettlemoyer, Luke, Kim, Yoon, Hajishirzi, Hannaneh, Koh, Pang Wei
Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.
X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots
Wang, Haitong, Tan, Aaron Hao, Fung, Angus, Nejat, Goldie
Existing navigation methods are primarily designed for specific robot embodiments, limiting their generalizability across diverse robot platforms. In this paper, we introduce X-Nav, a novel framework for end-to-end cross-embodiment navigation where a single unified policy can be deployed across various embodiments for both wheeled and quadrupedal robots. X-Nav consists of two learning stages: 1) multiple expert policies are trained using deep reinforcement learning with privileged observations on a wide range of randomly generated robot embodiments; and 2) a single general policy is distilled from the expert policies via navigation action chunking with transformer (Nav-ACT). The general policy directly maps visual and proprioceptive observations to low-level control commands, enabling generalization to novel robot embodiments. Simulated experiments demonstrated that X-Nav achieved zero-shot transfer to both unseen embodiments and photorealistic environments. A scalability study showed that the performance of X-Nav improves when trained with an increasing number of randomly generated embodiments. An ablation study confirmed the design choices of X-Nav. Furthermore, real-world experiments were conducted to validate the generalizability of X-Nav in real-world environments.