Reinforcement Learning
Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning
--Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. T o prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. T o address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets. EINFORCEMENT learning (RL) has achieved success across various domains. In classical online RL, agents learn optimal policies through real-time interaction. However, in real-world settings, continuous interaction is often impractical due to data collection challenges, safety concerns, and high costs.
Latent Policy Barrier: Learning Robust Visuomotor Policies by Staying In-Distribution
Visuomotor policies trained via behavior cloning are vulnerable to covariate shift, where small deviations from expert trajectories can compound into failure. Common strategies to mitigate this issue involve expanding the training distribution through human-in-the-loop corrections or synthetic data augmentation. However, these approaches are often labor-intensive, rely on strong task assumptions, or compromise the quality of imitation. We introduce Latent Policy Barrier, a framework for robust visuomotor policy learning. Inspired by Control Barrier Functions, LPB treats the latent embeddings of expert demonstrations as an implicit barrier separating safe, in-distribution states from unsafe, out-of-distribution (OOD) ones. Our approach decouples the role of precise expert imitation and OOD recovery into two separate modules: a base diffusion policy solely on expert data, and a dynamics model trained on both expert and suboptimal policy rollout data. At inference time, the dynamics model predicts future latent states and optimizes them to stay within the expert distribution. Both simulated and real-world experiments show that LPB improves both policy robustness and data efficiency, enabling reliable manipulation from limited expert data and without additional human correction or annotation.
Adversarial Attacks on Reinforcement Learning-based Medical Questionnaire Systems: Input-level Perturbation Strategies and Medical Constraint Validation
RL-based medical questionnaire systems have shown great potential in medical scenarios. However, their safety and robustness remain unresolved. This study performs a comprehensive evaluation on adversarial attack methods to identify and analyze their potential vulnerabilities. We formulate the diagnosis process as a Markov Decision Process (MDP), where the state is the patient responses and unasked questions, and the action is either to ask a question or to make a diagnosis. We implemented six prevailing major attack methods, including the Fast Gradient Signed Method (FGSM), Projected Gradient Descent (PGD), Carlini & Wagner Attack (C&W) attack, Basic Iterative Method (BIM), DeepFool, and AutoAttack, with seven epsilon values each. To ensure the generated adversarial examples remain clinically plausible, we developed a comprehensive medical validation framework consisting of 247 medical constraints, including physiological bounds, symptom correlations, and conditional medical constraints. We achieved a 97.6% success rate in generating clinically plausible adversarial samples. We performed our experiment on the National Health Interview Survey (NHIS) dataset (https://www.cdc.gov/nchs/nhis/), which consists of 182,630 samples, to predict the participant's 4-year mortality rate. We evaluated our attacks on the AdaptiveFS framework proposed in arXiv:2004.00994. Our results show that adversarial attacks could significantly impact the diagnostic accuracy, with attack success rates ranging from 33.08% (FGSM) to 64.70% (AutoAttack). Our work has demonstrated that even under strict medical constraints on the input, such RL-based medical questionnaire systems still show significant vulnerabilities.
Domain-driven Metrics for Reinforcement Learning: A Case Study on Epidemic Control using Agent-based Simulation
Gaur, Rishabh, Deshkar, Gaurav, Kshirsagar, Jayanta, Hayatnagarkar, Harshal, Venugopalan, Janani
For the development and optimization of agent-based models (ABMs) and rational agent-based models (RABMs), optimization algorithms such as reinforcement learning are extensively used. However, assessing the performance of RL-based ABMs and RABMS models is challenging due to the complexity and stochasticity of the modeled systems, and the lack of well-standardized metrics for comparing RL algorithms. In this study, we are developing domain-driven metrics for RL, while building on state-of-the-art metrics. We demonstrate our ``Domain-driven-RL-metrics'' using policy optimization on a rational ABM disease modeling case study to model masking behavior, vaccination, and lockdown in a pandemic. Our results show the use of domain-driven rewards in conjunction with traditional and state-of-the-art metrics for a few different simulation scenarios such as the differential availability of masks.
HALO: Hindsight-Augmented Learning for Online Auto-Bidding
Dong, Pusen, Cao, Chenglong, Zhou, Xinyu, You, Jirong, Xu, Linhe, Xu, Feifan, Yuan, Shuo
Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that re-purposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV .
SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning
Liu, Yihao, Li, Shuocheng, Cao, Lang, Xie, Yuhang, Zhou, Mengyu, Dong, Haoyu, Ma, Xiaojun, Han, Shi, Zhang, Dongmei
Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency, stronger generalization, and improved robustness under sparse rewards.
Periodic evaluation of defined-contribution pension fund: A dynamic risk measure approach
He, Wanting, Li, Wenyuan, Wei, Yunran
This paper introduces an innovative framework for the periodic evaluation of defined-contribution pension funds. The performance of the pension fund is evaluated not only at retirement, but also within the interim periods. In contrast to the traditional literature, we set the dynamic risk measure as the criterion and manage the tail risk of the pension fund dynamically. To effectively interact with the stochastic environment, a model-free reinforcement learning algorithm is proposed to search for optimal investment and insurance strategies. Using U.S. data, we calibrate pension members' mortality rates and enhance mortality projections through a Lee-Carter model. Our numerical results indicate that periodic evaluations lead to more risk-averse strategies, while mortality improvements encourage more risk-seeking behaviors.
High-Order Error Bounds for Markovian LSA with Richardson-Romberg Extrapolation
Levin, Ilya, Naumov, Alexey, Samsonov, Sergey
In this paper, we study the bias and high-order error bounds of the Linear Stochastic Approximation (LSA) algorithm with Polyak-Ruppert (PR) averaging under Markovian noise. We focus on the version of the algorithm with constant step size $α$ and propose a novel decomposition of the bias via a linearization technique. We analyze the structure of the bias and show that the leading-order term is linear in $α$ and cannot be eliminated by PR averaging. To address this, we apply the Richardson-Romberg (RR) extrapolation procedure, which effectively cancels the leading bias term. We derive high-order moment bounds for the RR iterates and show that the leading error term aligns with the asymptotically optimal covariance matrix of the vanilla averaged LSA iterates.
Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling
Yao, Jianpeng, Zhang, Xiaopan, Xia, Yu, Wang, Zejin, Roy-Chowdhury, Amit K., Li, Jiachen
Mobile robots navigating in crowds trained using reinforcement learning are known to suffer performance degradation when faced with out-of-distribution scenarios. We propose that by properly accounting for the uncertainties of pedestrians, a robot can learn safe navigation policies that are robust to distribution shifts. Our method augments agent observations with prediction uncertainty estimates generated by adaptive conformal inference, and it uses these estimates to guide the agent's behavior through constrained reinforcement learning. The system helps regulate the agent's actions and enables it to adapt to distribution shifts. In the in-distribution setting, our approach achieves a 96.93% success rate, which is over 8.80% higher than the previous state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times fewer intrusions into ground-truth human future trajectories. In three out-of-distribution scenarios, our method shows much stronger robustness when facing distribution shifts in velocity variations, policy changes, and transitions from individual to group dynamics. We deploy our method on a real robot, and experiments show that the robot makes safe and robust decisions when interacting with both sparse and dense crowds. Our code and videos are available on https://gen-safe-nav.github.io/.