Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Kim, Jeonghye, Shin, Yongjae, Jung, Whiyoung, Hong, Sunghoon, Yoon, Deunsol, Sung, Youngchul, Lee, Kanghoon, Lim, Woohyung

Aug-20-2025–arXiv.org Artificial Intelligence

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

arXiv.org Artificial Intelligence

Aug-20-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.92)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Reinforcement Learning (1.00)
  - Neural Networks (1.00)