Maximum Entropy
Review for NeurIPS paper: A Maximum-Entropy Approach to Off-Policy Evaluation in Average-Reward MDPs
This is a borderline paper. The paper is technically sound and addressing OPE in average-reward setting is an important problem. Despite that the work is an extension of Duan and Wang (for discounted setting) to the average-reward setting, the algorithm is somewhat different, as Duan and Wang uses FQE whereas the current paper performs stationary-distribution estimation. That said, there are a few weaknesses that the paper should try to address or at least discuss: 1. The entropy maximization is a novel algorithmic element which does not appear in previous approaches in the discounted setting.
Distributional Policy Evaluation: a Maximum Entropy approach to Representation Learning
The Maximum Entropy (Max-Ent) framework has been effectively employed in a variety of Reinforcement Learning (RL) tasks. In this paper, we first propose a novel Max-Ent framework for policy evaluation in a distributional RL setting, named Distributional Maximum Entropy Policy Evaluation (D-Max-Ent PE). We derive a generalization-error bound that depends on the complexity of the representation employed, showing that this framework can explicitly take into account the features used to represent the state space while evaluating a policy. Then, we exploit these favorable properties to drive the representation learning of the state space in a Structural Risk Minimization fashion. We employ state-aggregation functions as feature functions and we specialize the D-Max-Ent approach into an algorithm, named D-Max-Ent Progressive Factorization, which constructs a progressively finer-grained representation of the state space by balancing the trade-off between preserving information (bias) and reducing the effective number of states, i.e., the complexity of the representation space (variance). Finally, we report the results of some illustrative numerical simulations, showing that the proposed algorithm matches the expected theoretical behavior and highlighting the relationship between aggregations and sample regimes.
Reviews: Maximum Entropy Monte-Carlo Planning
This paper proposes a new MCTS algorithm, Maximum Entropy for Tree Search (MENTS), which combines the maximum entropy policy optimization framework with MCTS for more efficient online planning in sequential decision problems. The main idea is to replace the Monte Carlo value estimate with the softmax value estimate as in the maximum entropy policy optimization framework, such that the state value can be estimated and back-propagated more efficiently in the search tree. Another main novelty is that it proposes an optimal algorithm, Empirical Exponential Weight (E2W), to be the tree policy to do more exploration. It shows that MENTS can achieve an exponential convergence rate towards finding the optimal action at the root of the tree, which is much faster than the polynomial convergence rate of the UCT method. The experimental results also demonstrate that MENTS performs significantly better than UCT in terms of sample efficiency, in both synthetic problems and Atari games.
Maximum Entropy Monte-Carlo Planning
Chenjun Xiao, Ruitong Huang, Jincheng Mei, Dale Schuurmans, Martin Mรผller
We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT. Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.
Reviews: Maximum Entropy Monte-Carlo Planning
This paper presents an appealing idea to combine current max-entropy methods in RL with Monte-Carlo Tree Search. A theoretical result shows improved rate of convergence, while empirical results show improved sample efficiency. The initial reviews were quite positive; I only noted a small number of issues mentioned in the reviews of R1 and R3. In our discussions after reading the author feedback, R3 noted that some of his concerns have not been addressed. R2 replied, saying that these concerns are relatively minor and can be addressed in the final version.
A Maximum entropy calculation Under the constraint on the borders, ฯ and ฯ
Average pixel displacement magnitude ฮด We derive here the large-c asymptotic behavior of ฮด (Eq.3). The asymptotic relations for ฯ that are reported in the main text are computed in a similar fashion. In Fig.8, we check the agreement between asymptotic prediction and empirical measurements. If ฮด 1, our results strongly depend on the choice of interpolation method. Condition for diffeomorphism in the (T, c) plane For a given value of c, there exists a temperature scale beyond which the transformation is not injective anymore, affecting the topology of the image and creating spurious boundaries, see Fig.9a-c for an illustration.
A Bounded rationality, maximum entropy, and Boltzmann-rational policies
Given the constraint that the human's expected reward is satisfactory, how should we pick a distribution to model the human's choices? The principle of maximum entropy [52] gives us a guide. If we want to encode no extra information in the distribution, then we ought to pick the distribution that maximizes entropy subject to the constraint on the satisficing threshold.
Self-Supervised Learning via Maximum Entropy Coding
A mainstream type of current self-supervised learning methods pursues a general-purpose representation that can be well transferred to downstream tasks, typically by optimizing on a given pretext task such as instance discrimination. In this work, we argue that existing pretext tasks inevitably introduce biases into the learned representation, which in turn leads to biased transfer performance on various downstream tasks. To cope with this issue, we propose Maximum Entropy Coding (MEC), a more principled objective that explicitly optimizes on the structure of the representation, so that the learned representation is less biased and thus generalizes better to unseen downstream tasks. Inspired by the principle of maximum entropy in information theory, we hypothesize that a generalizable representation should be the one that admits the maximum entropy among all plausible representations. To make the objective end-to-end trainable, we propose to leverage the minimal coding length in lossy data coding as a computationally tractable surrogate for the entropy, and further derive a scalable reformulation of the objective that allows fast computation.
Data-Driven Priors in the Maximum Entropy on the Mean Method for Linear Inverse Problems
King-Roskamp, Matthew, Choksi, Rustum, Hoheisel, Tim
We establish the theoretical framework for implementing the maximumn entropy on the mean (MEM) method for linear inverse problems in the setting of approximate (data-driven) priors. We prove a.s. convergence for empirical means and further develop general estimates for the difference between the MEM solutions with different priors $\mu$ and $\nu$ based upon the epigraphical distance between their respective log-moment generating functions. These estimates allow us to establish a rate of convergence in expectation for empirical means. We illustrate our results with denoising on MNIST and Fashion-MNIST data sets.
Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures
Bolland, Adrien, Lambrechts, Gaspard, Ernst, Damien
We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) visited during the next time steps. We first prove that an optimal exploration policy, which maximizes the expected discounted sum of intrinsic rewards, is also a policy that maximizes a lower bound on the state-action value function of the decision process under some assumptions. We also prove that the visitation distribution used in the intrinsic reward definition is the fixed point of a contraction operator. Following, we describe how to adapt existing algorithms to learn this fixed point and compute the intrinsic rewards to enhance exploration. A new practical off-policy maximum entropy reinforcement learning algorithm is finally introduced. Empirically, exploration policies have good state-action space coverage, and high-performing control policies are computed efficiently.