Goto

Collaborating Authors

 rollout policy


f9e2800a251fa9107a008104f47c45d1-Supplemental-Conference.pdf

Neural Information Processing Systems

After the bidirectional models and rollout policies are well trained, we utilize them to generate imaginary trajectories, while conducting double check and admitting high-confidence transitions simultaneously.



OfflineReinforcementLearningwithReverse Model-basedImagination

Neural Information Processing Systems

However, in many real-world applications, collecting sufficient exploratory interactions is usually impractical, because online datacollection canbecostlyorevendangerous, suchasinhealthcare [4]andautonomous driving [5]. To address this challenge, offline RL [6, 7] develops a new learning paradigm that trains RL agents only with pre-collected offline datasets and thus can abstract away from the cost of online exploration [8-17].


OfflineReinforcementLearningwithReverse Model-basedImagination

Neural Information Processing Systems

However, in many real-world applications, collecting sufficient exploratory interactions is usually impractical, because online datacollection canbecostlyorevendangerous, suchasinhealthcare [4]andautonomous driving [5]. To address this challenge, offline RL [6, 7] develops a new learning paradigm that trains RL agents only with pre-collected offline datasets and thus can abstract away from the cost of online exploration [8-17].


Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Zheng, Chujie, Dang, Kai, Yu, Bowen, Li, Mingze, Jiang, Huiqiang, Lin, Junrong, Liu, Yuqiong, Lin, Hao, Wu, Chencan, Hu, Feng, Yang, An, Zhou, Jingren, Lin, Junyang

arXiv.org Artificial Intelligence

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.




Lagrangian Relaxation for Multi-Action Partially Observable Restless Bandits: Heuristic Policies and Indexability

Meshram, Rahul, Kaza, Kesav

arXiv.org Artificial Intelligence

Partially observable restless multi-armed bandits have found numerous applications including in recommendation systems, communication systems, public healthcare outreach systems, and in operations research. We study multi-action partially observable restless multi-armed bandits, it is a generalization of the classical restless multi-armed bandit problem -- 1) each bandit has finite states, and the current state is not observable, 2) each bandit has finite actions. In particular, we assume that more than two actions are available for each bandit. We motivate our problem with the application of public-health intervention planning. We describe the model and formulate a long term discounted optimization problem, where the state of each bandit evolves according to a Markov process, and this evolution is action dependent. The state of a bandit is not observable but one of finitely many feedback signals are observable. Each bandit yields a reward, based on the action taken on that bandit. The agent is assumed to have a budget constraint. The bandits are assumed to be independent. However, they are weakly coupled at the agent through the budget constraint. We first analyze the Lagrangian bound method for our partially observable restless bandits. The computation of optimal value functions for finite-state, finite-action POMDPs is non-trivial. Hence, the computation of Lagrangian bounds is also challenging. We describe approximations for the computation of Lagrangian bounds using point based value iteration (PBVI) and online rollout policy. We further present various properties of the value functions and provide theoretical insights on PBVI and online rollout policy. We study heuristic policies for multi-actions PORMAB. Finally, we discuss present Whittle index policies and their limitations in our model.