Goto

Collaborating Authors

 maximum entropy reinforcement learning


A Diffusion Model Framework for Maximum Entropy Reinforcement Learning

Sanokowski, Sebastian, Patil, Kaustubh, Knoll, Alois

arXiv.org Machine Learning

Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.


Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow

Neural Information Processing Systems

Existing Maximum-Entropy (MaxEnt) Reinforcement Learning (RL) methods for continuous action spaces are typically formulated based on actor-critic frameworks and optimized through alternating steps of policy evaluation and policy improvement. In the policy evaluation steps, the critic is updated to capture the soft Q-function. In the policy improvement steps, the actor is adjusted in accordance with the updated soft Q-function. In this paper, we introduce a new MaxEnt RL framework modeled using Energy-Based Normalizing Flows (EBFlow). Our method enables the calculation of the soft value function used in the policy evaluation target without Monte Carlo approximation.


Average-Reward Maximum Entropy Reinforcement Learning for Global Policy in Double Pendulum Tasks

Choe, Jean Seong Bjorn, Choi, Bumkyu, Kim, Jong-kook

arXiv.org Artificial Intelligence

-- This report presents our reinforcement learning-based approach for the swing-up and stabilisation tasks of the acrobot and pendubot, tailored specifcially to the updated guidelines of the 3rd AI Olympics at ICRA 2025. Building upon our previously developed A verage-Reward Entropy Advantage Policy Optimization (AR-EAPO) algorithm, we refined our solution to effectively address the new competition scenarios and evaluation metrics. Extensive simulations validate that our controller robustly manages these revised tasks, demonstrating adaptability and effectiveness within the updated framework. Building upon prior competitions at IJCAI 2023 [3] and IROS 2024 [4], the current edition places particular emphasis on global policy robustness, requiring solutions for reliable swing-up stabilisation tasks from arbitrary initial configurations under significantly increased external disturbances. The competition maintains its use of two different configurations: the acrobot, characterised by an inactive shoulder joint, and the pendubot, with an inactive elbow joint.


Agent Teaming in Mixed-Motive Situations – an AAAI Fall symposium

AIHub

Professor Subbarao Khambhampati's (Arizona State University) keynote discussed the dual nature of mental modeling in cooperation and competition. The importance of obfuscatory behavior, controlled observability planning, and the use of explanations for model reconciliation was emphasized, particularly regarding trust-building in human-robot interactions. Professor Gita Sukthankar's (University of Central Florida) talk focused on challenges in teamwork, using a case study on software engineering teams. Innovative techniques for distinguishing effective teams from ineffective ones were explored, setting the stage for discussions on the complexities of mixed-motive scenarios. Dr Marc Steinberg (Office of Naval Research) moderated an interactive discussion exploring research challenges in mixed-motive teams, including modeling humans, experimental setups, and measuring and assessing mixed-motive situations.


Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Hu, Dailin, Abbeel, Pieter, Fox, Roy

arXiv.org Artificial Intelligence

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as Soft Q-Learning (SQL) and Soft Actor-Critic trade off reward and policy entropy, which has the potential to improve training stability and robustness. Most MaxEnt RL methods, however, use a constant tradeoff coefficient (temperature), contrary to the intuition that the temperature should be high early in training to avoid overfitting to noisy value estimates and decrease later in training as we increasingly trust high value estimates to truly lead to good rewards. Moreover, our confidence in value estimates is state-dependent, increasing every time we use more evidence to update an estimate. In this paper, we present a simple state-based temperature scheduling approach, and instantiate it for SQL as Count-Based Soft Q-Learning (CBSQL). We evaluate our approach on a toy domain as well as in several Atari 2600 domains and show promising results.