Goto

Collaborating Authors

 Maximum Entropy


Maximum Entropy Reinforcement Learning with Diffusion Policy

arXiv.org Artificial Intelligence

The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.


DIME:Diffusion-Based Maximum Entropy Reinforcement Learning

arXiv.org Artificial Intelligence

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges--primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.


Evidence on the Regularisation Properties of Maximum-Entropy Reinforcement Learning

arXiv.org Artificial Intelligence

The generalisation and robustness properties of policies learnt through Maximum-Entropy Reinforcement Learning are investigated on chaotic dynamical systems with Gaussian noise on the observable. First, the robustness under noise contamination of the agent's observation of entropy regularised policies is observed. Second, notions of statistical learning theory, such as complexity measures on the learnt model, are borrowed to explain and predict the phenomenon. Results show the existence of a relationship between entropy-regularised policy optimisation and robustness to noise, which can be described by the chosen complexity measures.


Review for NeurIPS paper: Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Neural Information Processing Systems

Weaknesses: It is not clear what are the main technical contributions of the paper. The paper oversells it theoretical results and the motivation for the proposed regularizer is weak. The paper misrepresents its contributions in terms of cosmetic theorems and lemma. See the points (i) - (iv) below. In the Appendix Line 70, it is written that After extending it to the case when Y is a deterministic function of X, we get the bound in Theorem 3''.


Review for NeurIPS paper: Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness

Neural Information Processing Systems

The paper was extensively discussed among the reviewers. The final outcome was that all the reviewers agreed that the theoretical part of the paper is not significantly novel and the authors have to rewrite that part (please see the updated reviews), however, the approach is novel and experimental part is strong. To evaluate the experimental part further, a new reviewer was added after the rebuttal who has a good understanding on the experimental side of the topic of adversarial data augmentation. The new reviewer confirmed that the usefulness of the entropy-based regularization term toward providing robustness against unseen shifts is significant.


Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

Correctness: The main technical content seems to be correct. I have the following questions though: When using the linear assumption for the reward and the dynamics, the feature selection/setting is crutial. To relax the linear assumption, it is also mentioned, features can be pre-trained. What would be the recommended way to pre-learn it? For possible violation of the assumptions, how it would affect the results in practice?


Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs

Neural Information Processing Systems

This is a borderline paper. The paper is technically sound and addressing OPE in average-reward setting is an important problem. Despite that the work is an extension of Duan and Wang (for discounted setting) to the average-reward setting, the algorithm is somewhat different, as Duan and Wang uses FQE whereas the current paper performs stationary-distribution estimation. That said, there are a few weaknesses that the paper should try to address or at least discuss: 1. The entropy maximization is a novel algorithmic element which does not appear in previous approaches in the discounted setting.


Reviews: Maximum Entropy Monte-Carlo Planning

Neural Information Processing Systems

This paper proposes a new MCTS algorithm, Maximum Entropy for Tree Search (MENTS), which combines the maximum entropy policy optimization framework with MCTS for more efficient online planning in sequential decision problems. The main idea is to replace the Monte Carlo value estimate with the softmax value estimate as in the maximum entropy policy optimization framework, such that the state value can be estimated and back-propagated more efficiently in the search tree. Another main novelty is that it proposes an optimal algorithm, Empirical Exponential Weight (E2W), to be the tree policy to do more exploration. It shows that MENTS can achieve an exponential convergence rate towards finding the optimal action at the root of the tree, which is much faster than the polynomial convergence rate of the UCT method. The experimental results also demonstrate that MENTS performs significantly better than UCT in terms of sample efficiency, in both synthetic problems and Atari games.


Reviews: Maximum Entropy Monte-Carlo Planning

Neural Information Processing Systems

This paper presents an appealing idea to combine current max-entropy methods in RL with Monte-Carlo Tree Search. A theoretical result shows improved rate of convergence, while empirical results show improved sample efficiency. The initial reviews were quite positive; I only noted a small number of issues mentioned in the reviews of R1 and R3. In our discussions after reading the author feedback, R3 noted that some of his concerns have not been addressed. R2 replied, saying that these concerns are relatively minor and can be addressed in the final version.


Self-Supervised Learning via Maximum Entropy Coding

Neural Information Processing Systems

A mainstream type of current self-supervised learning methods pursues a general-purpose representation that can be well transferred to downstream tasks, typically by optimizing on a given pretext task such as instance discrimination. In this work, we argue that existing pretext tasks inevitably introduce biases into the learned representation, which in turn leads to biased transfer performance on various downstream tasks. To cope with this issue, we propose Maximum Entropy Coding (MEC), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen downstream tasks. Inspired by the principle of maximum entropy in information theory, we hypothesize that a generalizable representation should be the one that admits the maximum entropy among all plausible representations. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective that allows fast computation.