Reviews: Bridging the Gap Between Value and Policy Based Reinforcement Learning
–Neural Information Processing Systems
SUMMARY: The paper considers entropy regularized discounted Markov Decision Process (MDP), and shows the relation between the optimal value, action-value, and policy. Moreover, it shows that the optimal value function and policy satisfy a temporal consistency in the form of Bellman-like equation (Theorem 1), which can also be extended to its n-step version (Corollary 2). The paper introduces Path Consistent Learning by enforcing the temporal consistency, which is essentially a Bellman residual minimization procedure (Section 5). SUMMARY OF EVALUATION: Quality: Parts of the paper is sound (Section 3 and 4); parts of it is not (Section 5) Clarity: The paper is well-written. Originality: Some results seem to be novel, but similar ideas and analysis have been proposed/done before.
Neural Information Processing Systems
Oct-8-2024, 13:48:04 GMT
- Technology: