Dual Policy Iteration
Sun, Wen, Gordon, Geoffrey J., Boots, Byron, Bagnell, J.
–Neural Information Processing Systems
Recently, a novel class of Approximate Policy Iteration (API) algorithms have demonstrated impressive practical performance (e.g., ExIt from [1], AlphaGo-Zero from [2]). This new family of algorithms maintains, and alternately optimizes, two policies: a fast, reactive policy (e.g., a deep neural network) deployed at test time, and a slow, non-reactive policy (e.g., Tree Search), that can plan multiple steps ahead. The reactive policy is updated under supervision from the non-reactive policy, while the non-reactive policy is improved with guidance from the reactive policy. In this work we study this Dual Policy Iteration (DPI) strategy in an alternating optimization framework and provide a convergence analysis that extends existing API theory. We also develop a special instance of this framework which reduces the update of non-reactive policies to model-based optimal control using learned local models, and provides a theoretically sound way of unifying model-free and model-based RL approaches with unknown dynamics. We demonstrate the efficacy of our approach on various continuous control Markov Decision Processes.
Neural Information Processing Systems
Dec-31-2018
- Country:
- Asia > Middle East
- Jordan (0.04)
- Republic of Türkiye > Karaman Province
- Karaman (0.04)
- Europe
- Italy > Lazio
- Rome (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Italy > Lazio
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- New Jersey (0.04)
- New York (0.04)
- Pennsylvania > Allegheny County
- Pittsburgh (0.04)
- Canada > Quebec
- Asia > Middle East
- Industry:
- Leisure & Entertainment > Games > Go (0.35)
- Technology: