On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

Yang, Zhuoran, Chen, Yongxin, Hong, Mingyi, Wang, Zhaoran

arXiv.org Machine Learning 

Compared with the classical policy gradient algorithm 1992), actor-critic tracks the action-value function (critic) in policy gradient in an online(Williams, manner, and alternatively updates the policy (actor) and the critic. On the one hand, the online update of critic significantly reduces the variance of policy gradient and hence leads to faster convergence. On the other hand, it also introduces algorithmic instability, which is often observed in practice (Islam et al., 2017) and parallels the notoriously unstable training of generative adversarial and Vinyals, 2016). Such instability of actor-critic originates from severalnetworks (Pfau intertwining challenges, including(i) function approximation of actor and critic, (ii) improper choice of stepsizes, (iii) the noise arising from stochastic approximation, (iv) the asynchrony between actor and critic, and (v) possibly off-policy data used in the update of critic. As a result, the convergence of actor-critic remains much less well understood than that of policy gradient, which itself is open. Consequently, the practical use of actor-critic often lacks theoretical guidance. In this paper, we aim to theoretically understand the algorithmic instability of actor-critic.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found