Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

Khodadadian, Sajad, Chen, Zaiwei, Maguluri, Siva Theja

Feb-18-2021–arXiv.org Machine Learning

Reinforcement Learning (RL) is a paradigm where an agent aims at maximizing its cumulative reward by searching for an optimal policy, in an environment modeled as a Markov Decision Process (MDP) (Sutton and Barto, 2018). RL algorithms have achieved tremendous successes in a wide range of applications such as self-driving cars with Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), and AlphaGo in the game of Go (Silver et al., 2016). The algorithms in RL can be categorized into value space methods, such as Q-learning (Watkins and Dayan, 1992), TD-learning (Sutton, 1988), and policy space methods, such as actor-critic (AC) (Konda and Tsitsiklis, 2000). Despite great empirical successes (Bahdanau et al., 2016; Wang et al., 2016), the finite-sample convergence of AC type of algorithms are not completely characterized theoretically. An AC algorithm can be thought as a generalized policy iteration (Puterman, 1995), and consists of two phases, namely actor and critic. The objective of the actor is to improve the policy, while the critic aims at evaluating the performance of a specific policy. A step of the actor can be thought as a step of Stochastic Gradient Ascent (Bottou et al., 2018) with preconditioning.

algorithm, artificial intelligence, ground transportation, (16 more...)

arXiv.org Machine Learning

Feb-18-2021

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Industry:
- Information Technology (0.54)
- Leisure & Entertainment > Games
  - Go (0.54)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.35)
  - Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found