Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

arXiv.org Machine Learning 

Designing efficient algorithms that learn and plan in sequential decision-making tasks with large state and action spaces has become a central task of modern reinforcement learning (RL) in recent years. RL often assumes the environment as a Markov Decision Process (MDP), described by a tuple of state space, action space, reward function, and transition probability function. Due to a large number of possible states and actions, traditional tabular reinforcement learning methods such as Q-learning (Watkins, 1989), which directly access each state-action pair, are computationally intractable. A common approach to cope with high-dimensional state and action spaces is to utilize feature mappings such as linear functions or neural networks to map states and actions to a low-dimensional space. Recently, a large body of literature has been devoted to provide regret bounds for online RL with linear function approximation. These works can be divided into two main categories. The first category of works is of model-free style, which directly parameterizes the action-value function as a linear function of some given feature mapping. For instance, Jin et al. (2020) studied the episodic MDPs with linear MDP assumption, which assumes that both transition probability function and reward function can be represented as a linear function of a given feature mapping.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found