Reviews: A Family of Robust Stochastic Operators for Reinforcement Learning
–Neural Information Processing Systems
SUMMARY: The paper considers the problem of designing a Bellman-like operator with certain properties: 1) Optimality preserving property: The greedy policy of the converged action-value function be the optimal policy. The motivation for the action-gap increasing property comes from the result of Farahmand [12] that shows that the distribution of the action-gap is a factor in the convergence to the optimal policy. Roughly speaking, when the action-gap is large, errors in estimating the action-value function Q becomes less important. The result is that we might converge to the optimal policy even though the estimated action-value function is far from the optimal one. Bellemare et al. [5] propose some operators that have these properties.
Neural Information Processing Systems
Jan-26-2025, 05:22:13 GMT