Logistic $Q$-Learning

Bas-Serrano, Joan, Curi, Sebastian, Krause, Andreas, Neu, Gergely

arXiv.org Artificial Intelligence 

While REPS is elegantly derived from a principled We propose a new reinforcement learning algorithm linear-programing (LP) formulation of optimal control derived from a regularized linearprogramming in MDPs, it has the serious shortcoming that its faithful formulation of optimal control implementation requires access to the true MDP in MDPs. The method is closely related to for both the policy evaluation and improvement steps, the classic Relative Entropy Policy Search even at deployment time. The usual way to address (REPS) algorithm of Peters et al. (2010), with this limitation is to use an empirical approximation to the key difference that our method introduces the policy evaluation step and to project the policy a Q-function that enables efficient exact from the improvement step into a parametric space model-free implementation. The main (Deisenroth et al., 2013), losing all the theoretical feature of our algorithm (called Q-REPS) is guarantees of REPS in the process.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found