Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning

Aug-11-2025–arXiv.org Artificial Intelligence

--Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. T o prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. T o address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets. EINFORCEMENT learning (RL) has achieved success across various domains. In classical online RL, agents learn optimal policies through real-time interaction. However, in real-world settings, continuous interaction is often impractical due to data collection challenges, safety concerns, and high costs.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

Aug-11-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- Europe > Germany
  - Berlin (0.04)
- Oceania > Australia
  - New South Wales > Callaghan (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)