Global Optimality of Single-Timescale Actor-Critic under Continuous State-Action Space: A Study on Linear Quadratic Regulator
Chen, Xuyang, Duan, Jingliang, Zhao, Lin
–arXiv.org Artificial Intelligence
In addition to a policy update, AC methods employ a parallel critic update to bootstrap the Q-value for policy gradient estimation, which often enjoys reduced variance and fast convergence in training. Despite the empirical success, theoretical analysis of AC in the most practical form remains challenging. Existing works mostly focus on either the double-loop or the two-timescale variants. In double-loop AC, the actor is updated in the outer loop only after the critic takes sufficiently many steps to have an accurate estimation of the Q-value in the inner loop [ Y anget al., 2019; Kumar et al., 2019; Wang et al., 2019 ] . Hence, the convergence of the critic is decoupled from that of the actor. The analysis is separated into a policy evaluation sub-problem in the inner loop and a perturbed gradient descent in the outer loop. In two-timescale AC, the actor and the critic are updated simultaneously in each iteration using stepsizes of different timescales. The actor stepsize (denoted by α t in the sequel) is typically smaller than that of the critic (denoted by β t in the sequel), with their ratio going to zero as the iteration number goes to infinity (i.e., lim t α t/β t = 0). The two-timescale allows the critic to approximate the correct Q-value asymptotically.
arXiv.org Artificial Intelligence
May-9-2025