Weak Convergence Analysis of Online Neural Actor-Critic Algorithms
Lam, Samuel Chun-Hei, Sirignano, Justin, Wang, Ziheng
Neural network actor-critic algorithms are one of the most popular methods in deep reinforcement learning. A neural network model is trained to select the policy (the "actor") while another neural network (the "critic") is simultaneously trained to learn the value function given the actor's policy. Specifically, the actor selects an action and, given the action, a new state transition occurs according to a Markov stochastic process and a reward (a measurement of the success/failure) is observed. The critic must learn to approximate the value function - the solution to the Bellman equation - given the actor's policy. Given the critic's estimate for the value function of the current policy, the actor must be updated to improve the value function (i.e., increase the expected reward). Actor-critic algorithms are well-established methods in reinforcement learning [17, 15]; the key recent advance is using (deep) neural networks as the actor/critic and training their parameters using gradient descent methods [26, 10, 25, 2, 29]. Analysis of neural network actor-critic algorithms is challenging due to: (1) the non-convexity of the neural networks, (2) the complex feedback loop between the actor and critic (the actor determines the sequence of data samples which are used to train the critic and the critic is used to train the actor), and (3) the simultaneous online updates of both the actor and critic which lead to (3A) the distribution of the data, which depends upon the actor, constantly evolving in time and (3B) the actor being updated with a noisy, biased estimate of the value function.
Mar-25-2024
- Country:
- North America > United States
- Massachusetts (0.04)
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- North America > United States
- Genre:
- Research Report (0.50)
- Technology: