A general Markov decision process formalism for action-state entropy-regularized reward maximization

Grytskyy, Dmytro, Ramírez-Ruiz, Jorge, Moreno-Bote, Rubén

arXiv.org Artificial Intelligence 

It is well known that classical reinforcement learning, understood as learning from external rewards, has severe limitations. While it has been posited that reward is "enough" to learn any behavior [1], agents interacting with the real world often have only access to sparse rewards. Many approaches have been proposed to overcome the sparse reward limitation, endowing agents with additional signals to be optimized along with the rewards. These include minimizing surprise by refining predictions [2-7], novelty seeking by visiting states with low visit counts [8-10], generating actions that leads to predictable transitions (empowerment) [11-13], or seeking pure state entropy [14] and related forms of pure exploration objectives [3, 15-19], to name a few. A popular choice for augmenting the reward signal -the one that we focus on in this paper-is with entropy regularization [20-28]. The idea is that the agent will be driven, all else equal, to visit states and taking actions that make the agent act as random as possible (pure entropy regularization, e.g., [25]) or penalize the agent for having a policy very different from a default policy (KL regularization, e.g., [20]). Using this type of regularization can lead to better exploration [14], more variable and realistic behaviors [29], more efficient learning [25, 30] and more robust solutions [21] against noise and adversarial attacks [19] than classical reinforcement learning algorithms. While the above approaches use entropy as a regularizer to the optimization reward problem, the specific type of entropy regularizer varies widely across studies, and as a result the approaches and the solutions are hectic. For instance, some use pure action entropy regularization [24-26, 31], others employ purely state entropy [14], others take advantage of KL action regularization [23, 28, 32], and yet others combine action and state pure entropy in balanced [22, 33] or arbitrary ways [29].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found