Reinforcement Learning
Implicit Updates for Average-Reward Temporal Difference Learning
Kim, Hwanwoo, Cho, Dongkyu Derek, Laber, Eric
Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($ฮป$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($ฮป$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($ฮป$). In contrast to prior finite-time analyses of average-reward TD($ฮป$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($ฮป$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($ฮป$).
Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs
Zhou, Runlin, Chen, Chixiang, Chen, Elynn
We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=ฮฆ_h(s,a)\,ฮธ^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(ฮธ^*_h,ฮฃ^*_h)$ over the task-specific parameters $ฮธ^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL\(^+\) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
Adaptive Reinforcement Learning for Dynamic Configuration Allocation in Pre-Production Testing
Ensuring reliability in modern software systems requires rigorous pre-production testing across highly heterogeneous and evolving environments. Because exhaustive evaluation is infeasible, practitioners must decide how to allocate limited testing resources across configurations where failure probabilities may drift over time. Existing combinatorial optimization approaches are static, ad hoc, and poorly suited to such non-stationary settings. We introduce a novel reinforcement learning (RL) framework that recasts configuration allocation as a sequential decision-making problem. Our method is the first to integrate Q-learning with a hybrid reward design that fuses simulated outcomes and real-time feedback, enabling both sample efficiency and robustness. In addition, we develop an adaptive online-offline training scheme that allows the agent to quickly track abrupt probability shifts while maintaining long-run stability. Extensive simulation studies demonstrate that our approach consistently outperforms static and optimization-based baselines, approaching oracle performance. This work establishes RL as a powerful new paradigm for adaptive configuration allocation, advancing beyond traditional methods and offering broad applicability to dynamic testing and resource scheduling domains.
Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion
Gzenda, Vaughn, Chhabra, Robin
Soft robotic crawlers are mobile robots that utilize soft body deformability and compliance to achieve locomotion through surface contact. Designing control strategies for such systems is challenging due to model inaccuracies, sensor noise, and the need to discover locomotor gaits. In this work, we present a model-based reinforcement learning (MB-RL) framework in which latent dynamics inferred from onboard sensors serve as a predictive model that guides an actor-critic algorithm to optimize locomotor policies. We evaluate the framework on a minimal crawler model in simulation using inertial measurement units and time-of-flight sensors as observations. The learned latent dynamics enable short-horizon motion prediction while the actor-critic discovers effective locomotor policies. This approach highlights the potential of latent-dynamics MB-RL for enabling embodied soft robotic adaptive locomotion based solely on noisy sensor feedback.
Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments
Rajesh, Kirtan, Kumar, Suvidha Rupesh
This is the preprint version of the article published in IEEE Access vol. 13, pp. 146503--146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the published version. Urban air pollution remains a pressing global concern, particularly in densely populated and traffic-intensive metropolitan areas like Delhi, where exposure to harmful pollutants severely impacts public health. Delhi, being one of the most polluted cities globally, experiences chronic air quality issues due to vehicular emissions, industrial activities, and construction dust, which exacerbate its already fragile atmospheric conditions. Traditional pollution mitigation strategies, such as static air purifying installations, often fail to maximize their impact due to suboptimal placement and limited adaptability to dynamic urban environments. This study presents a novel deep reinforcement learning (DRL) framework to optimize the placement of air purification booths to improve the air quality index (AQI) in the city of Delhi. We employ Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm, to iteratively learn and identify high-impact locations based on multiple spatial and environmental factors, including population density, traffic patterns, industrial influence, and green space constraints. Our approach is benchmarked against conventional placement strategies, including random and greedy AQI-based methods, using multi-dimensional performance evaluation metrics such as AQI improvement, spatial coverage, population and traffic impact, and spatial entropy.
Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks
Multi-task reinforcement learning often relies on task metadata -- such as brief natural-language descriptions -- to guide behavior across diverse objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned mixture-of-policies architecture for multi-task RL. LEXPOL encodes task metadata with a text encoder and uses a learned gating module to select or blend among multiple sub-policies, enabling end-to-end training across tasks. On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines in success rate and sample efficiency, without task-specific retraining. To analyze the mechanism, we further study settings with fixed expert policies obtained independently of the gate and show that the learned language gate composes these experts to produce behaviors appropriate to novel task descriptions and unseen task combinations. These results indicate that natural-language metadata can effectively index and recombine reusable skills within a single policy.
Medical Vision Language Models as Policies for Robotic Surgery
Muppidi, Akshay, Radfar, Martin
Abstract--Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.
VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization
Cao, Xinye, Guo, Hongcan, Qian, Jiawen, Nan, Guoshun, Wang, Chao, Pan, Yuqi, Hou, Tianhao, Wang, Xiaojuan, Gao, Yutong
Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.
From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning
Zeqiao, Li, Yijing, Wang, Haoyu, Wang, Zheng, Li, Peng, Li, Wenfei, Liu, Zhiqiang, Zuo
Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like behavior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving.