Reinforcement Learning
Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication
Khairy, Sami, Mittag, Gabriel, Gopal, Vishak, Cutler, Ross
The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly $1$M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by $11.41\%$ compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.
Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning
Ma, Guozheng, Li, Lu, Wang, Zilin, Wang, Haoyu, Hu, Shengchao, Rutkowski, Leszek, Tao, Dacheng
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
Nonaka, Hiroshi, Ambrozak, Simon, Miskala-Dinc, Sofia R., Ercole, Amedeo, Prins, Aviva
In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.
Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling
Tang, Xiaohang, Cheng, Zhuowen, Kumar, Satyabrat
The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.
GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Wang, Ruida, Yao, Jiarui, Pan, Rui, Diao, Shizhe, Zhang, Tong
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.
Autonomous vehicles need social awareness to find optima in multi-agent reinforcement learning routing games
Psarou, Anastasia, Gorczyca, ลukasz, Gaweล, Dominik, Kucharski, Rafaล
Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents' objectives. Notably, the proposed counterfactual formulation preserves the system's equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.
Towards Safe Maneuvering of Double-Ackermann-Steering Robots with a Soft Actor-Critic Framework
Deflesselle, Kohio, Daniel, Mรฉlodie, Magassouba, Aly, Aranda, Miguel, Ly, Olivier
We present a deep reinforcement learning framework based on Soft Actor-Critic (SAC) for safe and precise maneuvering of double-Ackermann-steering mobile robots (DASMRs). Unlike holonomic or simpler non-holonomic robots such as differential-drive robots, DASMRs face strong kinematic constraints that make classical planners brittle in cluttered environments. Our framework leverages the Hindsight Experience Replay (HER) and the CrossQ overlay to encourage maneuvering efficiency while avoiding obstacles. Simulation results with a heavy four-wheel-steering rover show that the learned policy can robustly reach up to 97% of target positions while avoiding obstacles. Our framework does not rely on handcrafted trajectories or expert demonstrations.
Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
Li, Shangzhe, Zhou, Dongruo, Zhang, Weitong
We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $ฮต$ and the policy variance $ฯ^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.
DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning
Gu, Chenyang, Pu, Yewen, Yang, Bruce, Li, Xiaofan, Gao, Huan
Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model's innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our 7B model improves over a comparable previous work by \textbf{34.1\%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9\% relative}, maintaining exceptional training stability.
Agent Learning via Early Experience
Zhang, Kai, Chen, Xiangchao, Liu, Bo, Xue, Tianci, Liao, Zeyi, Liu, Zhihan, Wang, Xiyao, Ning, Yuting, Chen, Zhaorun, Fu, Xiaohan, Xie, Jian, Sun, Yuxuan, Gou, Boyu, Qi, Qi, Meng, Zihang, Yang, Jianwei, Zhang, Ning, Li, Xian, Shah, Ashish, Huynh, Dat, Li, Hengduo, Yang, Zi, Cao, Sara, Jang, Lawrence, Zhou, Shuyan, Zhu, Jiacheng, Sun, Huan, Weston, Jason, Su, Yu, Wu, Yifan
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.