Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning
–arXiv.org Artificial Intelligence
--Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. T o prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. T o address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets. EINFORCEMENT learning (RL) has achieved success across various domains. In classical online RL, agents learn optimal policies through real-time interaction. However, in real-world settings, continuous interaction is often impractical due to data collection challenges, safety concerns, and high costs.
arXiv.org Artificial Intelligence
Aug-11-2025