Actor-Critic based Improper Reinforcement Learning
Zaki, Mohammadi, Mohan, Avinash, Gopalan, Aditya, Mannor, Shie
–arXiv.org Artificial Intelligence
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. Towards this, we propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic (AC) based scheme and a Natural Actor-Critic (NAC) scheme depending on the available information. Both algorithms operate over a class of improper mixtures of the given controllers. For the first case, we derive convergence rate guarantees assuming access to a gradient oracle. For the AC-based approach we provide convergence rate guarantees to a stationary point in the basic AC case and to a global optimum in the NAC case. Numerical results on (i) the standard control theoretic benchmark of stabilizing an cartpole; and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable.
arXiv.org Artificial Intelligence
Jul-19-2022
- Country:
- North America > United States
- New York > New York County
- New York City (0.04)
- Massachusetts > Middlesex County
- Belmont (0.04)
- New York > New York County
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Hungary > Budapest
- Budapest (0.04)
- France
- Île-de-France > Paris
- Paris (0.04)
- Hauts-de-France > Nord
- Lille (0.04)
- Île-de-France > Paris
- United Kingdom > England
- Asia > Middle East
- North America > United States
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Leisure & Entertainment (0.67)