Reinforcement Learning
Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback
Shi, Derek, Glatt, Ruben, Klymko, Christine, Mohole, Shubham, Choi, Hongjun, Kushwaha, Shashank, Sakla, Sam, da Silva, Felipe Leno
Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards -- an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.
Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX
Radji, Waris, Michel, Thomas, Piteau, Hector
Modern reinforcement learning (RL) research (Sutton & Barto, 2018) demands extensive experimentation to achieve statistical validity, yet computational constraints severely limit experimental scale. RL papers routinely report results with fewer than five random seeds due to prohibitive training costs (Henderson et al., 2018; Colas et al., 2018; Agarwal et al., 2021; Mathieu et al., 2023; Gardner et al., 2025). While understandable from a practical standpoint, this undersampling undermines statistical reliability and impedes algorithmic progress. Environment execution creates this bottleneck: while deep learning has embraced end-to-end GPU acceleration, RL environments remain predominantly CPU-bound. Originally designed under severe hardware constraints, classic arcade games represent a solution for scalable RL experimentation. The Atari Learning Environment (ALE) (Bellemare et al., 2013) has established itself as a standard RL benchmark, although existing implementations remain fundamentally CPU-bound. As noted by Obando-Ceron & Castro (2020), the Rainbow paper (Hessel et al., 2018) required 34,200 GPU hours (equivalent to 1,425 days) of experiments, a computational cost that is prohibitively high for small research laboratories. In this paper, we propose an alternative approach for training RL agents in environments with mechanisms similar to ALE, with significantly reduced computational cost.
MTRec: Learning to Align with User Preferences via Mental Reward Models
Zhao, Mengchen, Gao, Yifan, Hou, Yaqing, Li, Xiangyang, Gu, Pengjie, Dong, Zhenhua, Tang, Ruiming, Cai, Yi
Recommendation models are predominantly trained using implicit user feedback, since explicit feedback is often costly to obtain. However, implicit feedback, such as clicks, does not always reflect users' real preferences. For example, a user might click on a news article because of its attractive headline, but end up feeling uncomfortable after reading the content. In the absence of explicit feedback, such erroneous implicit signals may severely mislead recommender systems. In this paper, we propose MTRec, a novel sequential recommendation framework designed to align with real user preferences by uncovering their internal satisfaction on recommended items. Specifically, we introduce a mental reward model to quantify user satisfaction and propose a distributional inverse reinforcement learning approach to learn it. The learned mental reward model is then used to guide recommendation models to better align with users' real preferences. Our experiments show that MTRec brings significant improvements to a variety of recommendation models. We also deploy MTRec on an industrial short video platform and observe a 7 percent increase in average user viewing time.
Towards Provable Emergence of In-Context Reinforcement Learning
Wang, Jiuqi, Chandra, Rohan, Zhang, Shangtong
Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.
STORI: A Benchmark and Taxonomy for Stochastic Environments
Barsainyan, Aryan Amit, Lim, Jing Yu, Liu, Dianbo
Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.
Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward
Weng, Han, Wu, Puzhen, Cui, Longjie, Zhan, Yi, Liu, Boyi, Song, Yuanfeng, Zeng, Dun, Yang, Yingxiang, Zhang, Qianru, Huang, Dong, Yin, Xiaoming, Sun, Yang, Chen, Xing
Reinforcement learning (RL) has been widely adopted to enhance the performance of large language models (LLMs) on Text-to-SQL tasks. However, existing methods often rely on execution-based or LLM-based Bradley-Terry reward models. The former suffers from high execution latency caused by repeated database calls, whereas the latter imposes substantial GPU memory overhead, both of which significantly hinder the efficiency and scalability of RL pipelines. To this end, we propose a novel reward model framework for RL-based Text-to-SQL named Graph-Reward-SQL, which employs the GMNScore outcome reward model. We leverage SQL graph representations to provide accurate reward signals while significantly reducing time cost and GPU memory usage. Building on this foundation, we further introduce StepRTM, a stepwise reward model that provides intermediate supervision over Common Table Expression (CTE) subqueries. This encourages both functional correctness and readability of SQL. Extensive comparative and ablation experiments on standard benchmarks, including Spider and BIRD, demonstrate that our method consistently outperforms existing reward models.
Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
Hendawy, Ahmed, Metternich, Henrik, Vincent, Thรฉo, Kallel, Mahdi, Peters, Jan, D'Eramo, Carlo
The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and of-fline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness. Reinforcement Learning (RL) has demonstrated exceptional performance and achieved major breakthroughs across a diverse spectrum of decision-making challenges. Noteworthy applications include learning complex locomotion skills (Haarnoja et al., 2018b; Rudin et al., 2022) and enabling sophisticated, real-world capabilities such as robotic manipulation (Andrychowicz et al., 2020; Lu et al., 2025). The foundation of this success lies primarily in Deep RL, initiated by the introduction of the Deep Q-Network (DQN) (Mnih et al., 2013), which marked the first successful application of deep neural networks in RL. To make that happen, Mnih et al. (2013) introduce various techniques to mitigate mainly the deadly triad issue (V an Hasselt et al., 2018) due to the usage of function approximators, off-policy data, and target bootstrapping.
A Benchmark Study of Deep Reinforcement Learning Algorithms for the Container Stowage Planning Problem
Huang, Yunqi, Chennakeshava, Nishith, Carras, Alexis, Neverov, Vladislav, Liu, Wei, Plaat, Aske, Fan, Yingjie
Container stowage planning (CSPP) is a critical component of maritime transportation and terminal operations, directly affecting supply chain efficiency. Owing to its complexity, CSPP has traditionally relied on human expertise. While reinforcement learning (RL) has recently been applied to CSPP, systematic benchmark comparisons across different algorithms remain limited. To address this gap, we develop a Gym environment that captures the fundamental features of CSPP and extend it to include crane scheduling in both multi-agent and single-agent formulations. Within this framework, we evaluate five RL algorithms: DQN, QR-DQN, A2C, PPO, and TRPO under multiple scenarios of varying complexity. The results reveal distinct performance gaps with increasing complexity, underscoring the importance of algorithm choice and problem formulation for CSPP.
A Recipe for Efficient Sim-to-Real Transfer in Manipulation with Online Imitation-Pretrained World Models
Wang, Yilin, Li, Shangzhe, Niu, Haoyi, Huang, Zhiao, Zhang, Weitong, Su, Hao
We are interested in solving the problem of imitation learning with a limited amount of real-world expert data. Existing offline imitation methods often struggle with poor data coverage and severe performance degradation. We propose a solution that leverages robot simulators to achieve online imitation learning. Our sim-to-real framework is based on world models and combines online imitation pretraining with offline finetuning. By leveraging online interactions, our approach alleviates the data coverage limitations of offline methods, leading to improved robustness and reduced performance degradation during finetuning. It also enhances generalization during domain transfer. Our empirical results demonstrate its effectiveness, improving success rates by at least 31.7% in sim-to-sim transfer and 23.3% in sim-to-real transfer over existing offline imitation learning baselines.
Beyond Imitation: Recovering Dense Rewards from Demonstrations
Li, Jiangnan, Vu, Thuy-Trang, Abbasnejad, Ehsan, Haffari, Gholamreza
Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.