A Convergent Form of Approximate Policy Iteration
Perkins, Theodore J., Precup, Doina
–Neural Information Processing Systems
We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a "policy improvement operator" to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
Neural Information Processing Systems
Dec-31-2003
- Country:
- North America
- Canada > Quebec
- Montreal (0.14)
- United States > Massachusetts
- Hampshire County > Amherst (0.14)
- Canada > Quebec
- North America