Q-Distribution guided Q-learning for offline reinforcement learning: Uncertainty penalized Q-value via consistency model
–Neural Information Processing Systems
"Distribution shift" is the main obstacle to the success of offline reinforcement learning. A learning policy may take actions beyond the behavior policy's knowledge, referred to as Out-of-Distribution (OOD) actions. The Q-values for these OOD actions can be easily overestimated. As a result, the learning policy is biased by using incorrect Q-value estimates. One common approach to avoid Q-value overestimation is to make a pessimistic adjustment.
Neural Information Processing Systems
Mar-21-2025, 14:29:41 GMT