Goto

Collaborating Authors

 Maximum Entropy


Maximum Entropy Reinforcement Learning with Diffusion Policy

arXiv.org Artificial Intelligence

The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.


Appendix for " Learning Neural Set Functions Under the Optimal Subset Oracle " 15 B Derivations 16 B.1 Derivations of the Maximum Entropy Distribution

Neural Information Processing Systems

The probabilistic greedy model (PGM) solves optimization (1) with a differentiable extension of greedy maximization algorithm (Tschiatschek et al., 2018). To alleviate this problem, Tschiatschek et al. (2018) finally construct the set mass function by enumerating all possible permutations p However, maximizing the log likelihood of (14) is prohibitively expensive and unscalable due to the exponential time complexity of enumerating all permutations. Although one can apply Monte Carlo approximation to avoid that, i.e., approximating log p B.1 Derivations of the Maximum Entropy Distribution The first step to solve problem (2) is to construct a proper set mass function p Here, one would care about what the most appropriate set mass function should be? Generally we prefer the model to assume nothing about what is unknown. More formally, we should choose the most "uniform" distribution, which maximizes the Shannon entropy H(p) = This principle is known as "noninformative prior" (Jeffreys, 1946), which has been widely applied in many physical systems (Jaynes, 1957a,b). It turns out that the energy-based model is the only distribution with maximum entropy. More specifically, the following theorem holds: Theorem 1.


DIME:Diffusion-Based Maximum Entropy Reinforcement Learning

arXiv.org Artificial Intelligence

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges--primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.


Evidence on the Regularisation Properties of Maximum-Entropy Reinforcement Learning

arXiv.org Artificial Intelligence

The generalisation and robustness properties of policies learnt through Maximum-Entropy Reinforcement Learning are investigated on chaotic dynamical systems with Gaussian noise on the observable. First, the robustness under noise contamination of the agent's observation of entropy regularised policies is observed. Second, notions of statistical learning theory, such as complexity measures on the learnt model, are borrowed to explain and predict the phenomenon. Results show the existence of a relationship between entropy-regularised policy optimisation and robustness to noise, which can be described by the chosen complexity measures.


Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness: Supplementary Material Long Zhao 1 Ting Liu 2 Xi Peng

Neural Information Processing Systems

To bound the deviation of the entropy estimates, we use McDiarmid's inequality [13], in a manner similar to [1]. For this, we must bound the change in value of each of the entropy estimations when a single instance in S is arbitrarily changed. A useful and easily proven inequality in that regard is the following: for any natural m and for any a [0, 1 1/m] and 1/m, |(a +) log(a +) a log(a)| log(m) m. (1) With this in equality, a careful application of McDiarmid's inequality leads to the following lemma. For any δ (0, 1), with probability of at least 1 δ over the sample set, we have that, |Ĥ(T) E[Ĥ(T)]| |T | log(m) log(2/δ) . First, we bound the change caused by a single replacement in Ĥ(T).


Review for NeurIPS paper: Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Neural Information Processing Systems

Weaknesses: It is not clear what are the main technical contributions of the paper. The paper oversells it theoretical results and the motivation for the proposed regularizer is weak. The paper misrepresents its contributions in terms of cosmetic theorems and lemma. See the points (i) - (iv) below. In the Appendix Line 70, it is written that After extending it to the case when Y is a deterministic function of X, we get the bound in Theorem 3''.


Review for NeurIPS paper: Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Neural Information Processing Systems

The paper was extensively discussed among the reviewers. The final outcome was that all the reviewers agreed that the theoretical part of the paper is not significantly novel and the authors have to rewrite that part (please see the updated reviews), however, the approach is novel and experimental part is strong. To evaluate the experimental part further, a new reviewer was added after the rebuttal who has a good understanding on the experimental side of the topic of adversarial data augmentation. The new reviewer confirmed that the usefulness of the entropy-based regularization term toward providing robustness against unseen shifts is significant.



Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

Correctness: The main technical content seems to be correct. I have the following questions though: When using the linear assumption for the reward and the dynamics, the feature selection/setting is crutial. To relax the linear assumption, it is also mentioned, features can be pre-trained. What would be the recommended way to pre-learn it? For possible violation of the assumptions, how it would affect the results in practice?


A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e.