On-line Policy Improvement using Monte-Carlo Search
Tesauro, Gerald, Galperin, Gregory R.
–Neural Information Processing Systems
Policy iteration is known to have rapid and robust convergence properties, and for Markov tasks with lookup-table state-space representations, it is guaranteed to convergence to the optimal policy. Online Policy Improvement using Monte-Carlo Search 1069 In typical uses of policy iteration, the policy improvement step is an extensive off-line procedure. For example, in dynamic programming, one performs a sweep through all states in the state space. Reinforcement learning provides another approach to policy improvement; recently, several authors have investigated using RL in conjunction with nonlinear function approximators to represent the value functions and/or policies (Tesauro, 1992; Crites and Barto, 1996; Zhang and Dietterich, 1996). These studies are based on following actual state-space trajectories rather than sweeps through the full state space, but are still too slow to compute improved policies in real time.
Neural Information Processing Systems
Dec-31-1997
- Country:
- North America > United States > Massachusetts > Middlesex County (0.14)
- Industry:
- Leisure & Entertainment > Games > Backgammon (0.50)
- Technology: