A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Chen, Weiqin, Squillante, Mark S., Wu, Chai Wah, Paternain, Santiago

arXiv.org Artificial Intelligence 

For many years now, reinforcement learning (RL) has succeeded in solving a wide variety of decision-making problems and control for robotics [1, 2, 3, 4, 5]. Generally speaking, modelfree methods [6, 7] often suffer from high sample complexity that can require an inordinate amount of samples, making them unsuitable for robotic applications where collecting large amounts of data is time-consuming, costly and potentially dangerous for the system and its surroundings [8, 9, 10, 11, 12]. On the other hand, model-based RL methods have been successful in demonstrating significantly reduced sample complexity and in outperforming model-free approaches for various decision making under uncertainty problems (see, e.g., [13, 14]). However, such modelbased approaches can suffer from the difficulty of learning an appropriate model and from worse asymptotic performance than model-free approaches due to model bias from inherently assuming the learned system dynamics model accurately represents the true system environment (see, e.g., [15, 16, 17]). In this paper we propose a novel form of RL that seeks to directly learn an optimal control policy for a general underlying (unknown) dynamical system and to directly apply the corresponding learned optimal control policy within the dynamical system. This general approach is in strong contrast to many traditional model-based RL methods that, after learning the system dynamics model which is often of high complexity and dimensionality, then use this system dynamics model to compute an approximate solution of a corresponding (stochastic) dynamic programming problem, often applying model predictive control (see, e.g., [18]). Our control-based RL (CBRL) approach instead directly learns the unknown parameters that derive, through control-theoretic means, an optimal control policy function from a family of control policy functions, often of much lower complexity and dimensionality, from which the optimal control policy is directly obtained. The theoretical foundation and analysis of our CRBL approach is presented within the context of a general Markov decision process (MDP) framework that extends the family of policies associated with the classical Bellman operator to a family of control-policy functions mapping a vector of (unknown) parameters from a corresponding parameter set to a control policy which is optimal under those parameters, and that extends the domain of these control policies from a single state to span across all (or a large subset of) states, with the (unknown) parameter vector encoding global and local information that needs to be learned. Within the context of this MDP framework and our general CBRL approach, we establish theoretical results on convergence and optimality with respect to (w.r.t.) a CBRL contraction operator, analogous to the Bellman operator.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found