Improving Policies without Measuring Merits

Dec-31-1996–Neural Information Processing Systems

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the advantages ofactions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function. For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages. 1 Introd uction In deciding how to change a policy at a state, an agent only needs to know the differences (called advantages) between the total return based on taking each action a for one step and then following the policy forever after, and the total return based on always following the policy (the conventional value of the state under the policy). The advantages are like differentials - they do not depend on the local levels of the total return. Indeed, Werbos (1991) defined Dual Heuristic Programming (DHP), using these facts, learning the derivatives of these total returns with respect to the state.

artificial intelligence, reinforcement learning, value function, (19 more...)

Neural Information Processing Systems

Dec-31-1996

Conferences PDF

Add feedback

Country:
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.14)
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.15)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Reinforcement Learning (0.95)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
Improving Policies without Measuring Merits
Improving Policies without Measuring Merits

Similar Docs Excel Report more

Title	Similarity	Source
None found