Reviews: Bridging the Gap Between Value and Policy Based Reinforcement Learning

Oct-8-2024, 13:48:04 GMT–Neural Information Processing Systems

SUMMARY: The paper considers entropy regularized discounted Markov Decision Process (MDP), and shows the relation between the optimal value, action-value, and policy. Moreover, it shows that the optimal value function and policy satisfy a temporal consistency in the form of Bellman-like equation (Theorem 1), which can also be extended to its n-step version (Corollary 2). The paper introduces Path Consistent Learning by enforcing the temporal consistency, which is essentially a Bellman residual minimization procedure (Section 5). SUMMARY OF EVALUATION: Quality: Parts of the paper is sound (Section 3 and 4); parts of it is not (Section 5) Clarity: The paper is well-written. Originality: Some results seem to be novel, but similar ideas and analysis have been proposed/done before.

reinforcement learning, temporal consistency, value and policy, (9 more...)

Neural Information Processing Systems

Oct-8-2024, 13:48:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)