AITopics

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Neural Information Processing SystemsFeb-15-2026, 05:07:01 GMT

Efficient Potential-based Exploration in Reinforcement Learning using Inverse Dynamic Bisimulation Metric

While a number of RL methods have been proposed to boost exploration by designing an intrinsic reward signal as exploration bonus.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Country:

Asia > Macao (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Hong Kong (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.84)

Neural Information Processing SystemsDec-26-2025, 03:47:34 GMT

Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards

We propose a new complexity measure for Markov decision processes (MDPs), the maximum expected hitting cost (MEHC). This measure tightens the closely related notion of diameter [JOA10] by accounting for the reward structure. We show that this parameter replaces diameter in the upper bound on the optimal value span of an extended MDP, thus refining the associated upper bounds on the regret of several UCRL2-like algorithms. Furthermore, we show that potential-based reward shaping [NHR99] can induce equivalent reward functions with varying informativeness, as measured by MEHC. By analyzing the change in the maximum expected hitting cost, this work presents a formal understanding of the effect of potential-based reward shaping on regret (and sample complexity) in the undiscounted average reward setting. We further establish that shaping can reduce or increase MEHC by at most a factor of two in a large class of MDPs with finite MEHC and unsaturated optimal average rewards.

hitting cost, markov decision process and informativeness, name change, (4 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.44)

Michael Gimelfarb, Scott Sanner, Chi-Guhn Lee

Reinforcement Learning with Multiple Experts: A Bayesian Model Combination Approach

Neural Information Processing SystemsNov-20-2025, 19:38:05 GMT

Potential-based reward shaping incorporates prior domain knowledge in the form of additional rewards provided during training to speed up convergence of reinforcement learning algorithms, without changing the optimal policies (Ng et al. [1999]).

algorithm, autonomous agent and multiagent system, reinforcement learning, (9 more...)

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 23:27:46 GMT

79f7f00cbe3003cea4d0c2326b4c0b42-Paper-Conference.pdf

artificial intelligence, exploration, machine learning, (16 more...)

Country:

Asia > Macao (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Hong Kong (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Neural Information Processing SystemsAug-16-2025, 15:13:13 GMT

Explicable Reward Design for Reinforcement Learning Agents

A reward function plays the central role during the learning/training process of a reinforcement learning (RL) agent. Given a "task" the agent is expected to perform (i.e., the desired learning outcome), there are typically many different reward specifications under which an optimal policy

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Germany > Saarland > Saarbrücken (0.04)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Straka, Matej, Schmid, Martin

Artificial Generals Intelligence: Mastering Generals.io with Reinforcement Learning

arXiv.org Artificial IntelligenceJul-11-2025

We introduce a real-time strategy game environment based on Generals.io, a game with thousands of weekly active players. Our environment is fully compatible with Gymnasium and PettingZoo and is capable of running thousands of frames per second on commodity hardware. We also present a reference agent, trained with supervised pre-training and self-play, which reached the top 0.003% of the 1v1 human leaderboard after only 36 hours on a single H100 GPU. To accelerate learning, we incorporate potential-based reward shaping and memory features. Our contributions of a modular RTS benchmark and a competitive baseline agent provide an accessible yet challenging platform for advancing multi-agent reinforcement learning research. The documented code, together with examples and tutorials, is available at https://github.com/strakam/generals-bots.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2507.06825

Country: Europe > Czechia (0.14)

Genre: Research Report (0.52)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Muslimani, Calarina, Johnstonbaugh, Kerrick, Chandramouli, Suyog, Booth, Serena, Knox, W. Bradley, Taylor, Matthew E.

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

arXiv.org Artificial IntelligenceMar-7-2025

Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment -- assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an 11 -- person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by 1.5x, was preferred by 82% of users and increased the success rate of selecting reward functions that produced performant policies by 41%.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2503.05996

Country:

North America > Canada > Alberta (0.14)
Asia > Middle East > Jordan (0.04)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

arXiv.org Artificial IntelligenceFeb-19-2025

Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning

Ma, Hao, Wang, Shijie, Pu, Zhiqiang, Zhao, Siyao, Ai, Xiaolin

Guiding the policy of multi-agent reinforcement learning to align with human common sense is a difficult problem, largely due to the complexity of modeling common sense as a reward, especially in complex and long-horizon multi-agent tasks. Recent works have shown the effectiveness of reward shaping, such as potential-based rewards, to enhance policy alignment. The existing works, however, primarily rely on experts to design rule-based rewards, which are often labor-intensive and lack a high-level semantic understanding of common sense. To solve this problem, we propose a hierarchical vision-based reward shaping method. At the bottom layer, a visual-language model (VLM) serves as a generic potential function, guiding the policy to align with human common sense through its intrinsic semantic understanding. To help the policy adapts to uncertainty and changes in long-horizon tasks, the top layer features an adaptive skill selection module based on a visual large language model (vLLM). The module uses instructions, video replays, and training records to dynamically select suitable potential function from a pre-designed pool. Besides, our method is theoretically proven to preserve the optimal policy. Extensive experiments conducted in the Google Research Football environment demonstrate that our method not only achieves a higher win rate but also effectively aligns the policy with human common sense.

agent, potential function, potential-based reward, (15 more...)

2502.1343

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Leisure & Entertainment > Sports > Football (0.68)
Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Kapoor, Aditya, Tessera, Kale-ab, Baranwal, Mayank, Khadilkar, Harshad, Albrecht, Stefano, Sun, Mingfei

$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

arXiv.org Artificial IntelligenceFeb-7-2025

In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2502.04864

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)