AITopics | maxent rl

Collaborating Authors

maxent rl

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

5ffaa9f5182c2a36843f438bb1fdbdea-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 23:14:55 GMT

experiment, objective, rpc, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A More Analysis

Neural Information Processing SystemsAug-18-2025, 11:23:53 GMT

This section describes how the objective for the encoder, model, and policy (Eq. The remaining difference between this objective and Eq. 5 is that the Q value term is scaled by This prior cannot be predicted from prior observations. Maximum entropy (MaxEnt) RL is a special case of our compression objective. In practice we perform gradient steps using the Adam [24] optimizer. An optimal agent must balance these information costs against the value of information gained from these observations.

artificial intelligence, experiment, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement Benjamin Eysenbach

Neural Information Processing SystemsAug-15-2025, 16:31:18 GMT

Several prior works have found that relabeling past experience with different reward functions can improve sample efficiency.

inverse rl, reward function, trajectory, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada (0.04)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

5ffaa9f5182c2a36843f438bb1fdbdea-Supplemental.pdf

Neural Information Processing SystemsAug-14-2025, 19:03:05 GMT

experiment, objective, rpc, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)

Add feedback

Rectifying Reinforcement Learning for Reward Matching

He, Haoran, Bengio, Emmanuel, Cai, Qingpeng, Pan, Ling

arXiv.org Artificial IntelligenceJun-4-2024

The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong resemblance to reinforcement learning (RL), that typically aims to maximize reward, due to their sequential decision-making processes. Recent works have studied connections between GFlowNets and maximum entropy (MaxEnt) RL, which modifies the standard objective of RL agents by learning an entropy-regularized objective. However, a critical theoretical gap persists: despite the apparent similarities in their sequential decision-making nature, a direct link between GFlowNets and standard RL has yet to be discovered, while bridging this gap could further unlock the potential of both fields. In this paper, we establish a new connection between GFlowNets and policy evaluation for a uniform policy. Surprisingly, we find that the resulting value function for the uniform policy has a close relationship to the flows in GFlowNets. Leveraging these insights, we further propose a novel rectified policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets, offering a new perspective. We compare RPE, MaxEnt RL, and GFlowNets in a number of benchmarks, and show that RPE achieves competitive results compared to previous approaches. This work sheds light on the previously unexplored connection between (non-MaxEnt) RL and GFlowNets, potentially opening new avenues for future research in both fields.

arxiv preprint arxiv, gflownet, policy evaluation, (12 more...)

arXiv.org Artificial Intelligence

2406.02213

Country:

Europe > Austria > Vienna (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Discrete Probabilistic Inference as Control in Multi-path Environments

Deleu, Tristan, Nouri, Padideh, Malkin, Nikolay, Precup, Doina, Bengio, Yoshua

arXiv.org Artificial IntelligenceFeb-15-2024

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

gflownet, international conference, objective, (15 more...)

arXiv.org Artificial Intelligence

2402.10309

Country:

North America > Canada > Quebec > Montreal (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Ningxia Hui Autonomous Region > Yinchuan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)
(2 more...)

Add feedback

SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics

Flet-Berliac, Yannis, Basu, Debabrota

arXiv.org Artificial IntelligenceJul-5-2022

Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.

adversary, constraint, saac, (12 more...)

arXiv.org Artificial Intelligence

2204.09424

Country:

North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Games > Computer Games (0.41)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Closed-Form Analytical Results for Maximum Entropy Reinforcement Learning

Arriojas, Argenis, Tiomkin, Stas, Kulkarni, Rahul V.

arXiv.org Machine LearningJun-7-2021

We introduce a mapping between Maximum Entropy Reinforcement Learning (MaxEnt RL) and Markovian processes conditioned on rare events. In the long time limit, this mapping allows us to derive analytical expressions for the optimal policy, dynamics and initial state distributions for the general case of stochastic dynamics in MaxEnt RL. We find that soft-$\mathcal{Q}$ functions in MaxEnt RL can be obtained from the Perron-Frobenius eigenvalue and the corresponding left eigenvector of a regular, non-negative matrix derived from the underlying Markov Decision Process (MDP). The results derived lead to novel algorithms for model-based and model-free MaxEnt RL, which we validate by numerical simulations. The mapping established in this work opens further avenues for the application of novel analytical and computational approaches to problems in MaxEnt RL. We make our code available at: https://github.com/argearriojas/maxent-rl-mdp-scripts

eigenvector, maxent rl, trajectory distribution, (14 more...)

arXiv.org Machine Learning

2106.03931

Country:

North America > United States (0.08)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Maximum entropy RL (provably) solves some robust RL problems

AIHubApr-14-2021, 13:09:00 GMT

Nearly all real-world applications of reinforcement learning involve some degree of shift between the training environment and the testing environment. However, prior work has observed that even small shifts in the environment cause most RL algorithms to perform markedly worse. As we aim to scale reinforcement learning algorithms and apply them in the real world, it is increasingly important to learn policies that are robust to changes in the environment. Broadly, prior approaches to handling distribution shift in RL aim to maximize performance in either the average case or the worst case. While these methods have been successfully applied to a number of areas (e.g., self-driving cars, robot locomotion and manipulation), their success rests critically on the design of the distribution of environments.

algorithm, maxent rl, rl algorithm, (13 more...)

AIHub

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Maximum Entropy (0.41)

Add feedback

Filters

Collaborating Authors

maxent rl

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

e9f85782949743dcc42079e629332b5f-Supplemental.pdf

5ffaa9f5182c2a36843f438bb1fdbdea-Supplemental.pdf

A More Analysis

Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement Benjamin Eysenbach

5ffaa9f5182c2a36843f438bb1fdbdea-Supplemental.pdf

Rectifying Reinforcement Learning for Reward Matching

Discrete Probabilistic Inference as Control in Multi-path Environments

SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics

Closed-Form Analytical Results for Maximum Entropy Reinforcement Learning

Maximum entropy RL (provably) solves some robust RL problems