Reviews: Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
–Neural Information Processing Systems
Summary: This paper proposes a new algorithm that help stabilize off-policy Q-learning. The idea is to introduce approximate Bellman updates that are based on constraint actions sampled only from the support of the training data distribution. The paper shows the main source of instability is the boostrapping error. The boostrapping process might use actions that do not lie in the training data distribution. This work shows a way to mitigate this issue.
Neural Information Processing Systems
Jan-26-2025, 21:33:36 GMT
- Technology: