AITopics

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)
(5 more...)

Industry:

Education (0.46)
Leisure & Entertainment > Games (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Neural Information Processing SystemsFeb-11-2026, 13:56:14 GMT

RUDDER: Return Decomposition for Delayed Rewards

anonymous

reinforcement learning; delayed reward; reward redistribution; return decomposition; bias-variance; credit assignment; LSTM

infinitesimal change, reward redistribution, rudder, (13 more...)

Industry: Leisure & Entertainment > Games (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Neural Information Processing SystemsFeb-10-2026, 22:07:03 GMT

Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.

dimension, machine learning, reinforcement learning, (12 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsOct-8-2025, 13:17:17 GMT

402e12102d6ec3ea3df40ce1b23d423a-Paper-Conference.pdf

causal structure, dimension, markovian reward, (12 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Jose A. Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, Sepp Hochreiter

RUDDER: Return Decomposition for Delayed Rewards

Neural Information Processing SystemsOct-2-2025, 05:30:57 GMT

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs).

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Country:

North America > United States (0.46)
Europe (0.46)

Industry:

Education (0.46)
Leisure & Entertainment > Games (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Neural Information Processing SystemsOct-2-2025, 05:30:42 GMT

RUDDER: Return Decomposition for Delayed Rewards

anonymous

reinforcement learning; delayed reward; reward redistribution; return decomposition; bias-variance; credit assignment; LSTM

artificial intelligence, machine learning, reward redistribution, (16 more...)

Industry: Leisure & Entertainment > Games (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Xiao, Minheng, Jiao, Zhenbang

Likelihood Reward Redistribution

arXiv.org Artificial IntelligenceMar-20-2025

In many practical reinforcement learning scenarios, feedback is provided only at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state--action pairs. In this paper, we propose a \emph{Likelihood Reward Redistribution} (LRR) framework that addresses this issue by modeling each per-step reward with a parametric probability distribution whose parameters depend on the state--action pair. By maximizing the likelihood of the observed episodic return via a leave-one-out (LOO) strategy that leverages the entire trajectory, our framework inherently introduces an uncertainty regularization term into the surrogate objective. Moreover, we show that the conventional mean squared error (MSE) loss for reward redistribution emerges as a special case of our likelihood framework when the uncertainty is fixed under the Gaussian distribution. When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on Box-2d and MuJoCo benchmarks.

machine learning, reinforcement learning, state-action pair, (14 more...)

2503.17409

Country: North America > United States > Ohio (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Kapoor, Aditya, Tessera, Kale-ab, Baranwal, Mayank, Khadilkar, Harshad, Albrecht, Stefano, Sun, Mingfei

$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

arXiv.org Artificial IntelligenceFeb-7-2025

In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2502.04864

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Kapoor, Aditya, Swamy, Sushant, Tessera, Kale-ab, Baranwal, Mayank, Sun, Mingfei, Khadilkar, Harshad, Albrecht, Stefano V.

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning

arXiv.org Artificial IntelligenceDec-19-2024

In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR$^2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$ stabilizes and accelerates the learning process. Additionally, we show that when TAR$^2$ is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.

credit assignment, machine learning, reinforcement learning, (13 more...)

2412.14779

Country:

Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre:

Overview (0.88)
Research Report > New Finding (0.34)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)

arXiv.org Artificial IntelligenceNov-12-2024

R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

Li, Jiahui, Chang, Tai-wei, Zhang, Fengda, Kuang, Kun, Chen, Long

Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This involves the initial training of a reward model based on pairwise human feedback. The reward model is subsequently utilized in reinforcement learning to assess the scores of each generated sentence as a whole, further guiding the optimization of LLMs. However, current approaches have a significant shortcoming: \emph{They allocate a single, sparse, and delayed reward to an entire sequence of output}. This may overlook some significant individual contributions of each token towards the desired outcome. To overcome this limitation, our paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation. Specifically, our method treats the reward prediction task of the reward model as a regression problem. As a result, the redistributed rewards are computed by evaluating the specific contribution of each token to the reward model's output. This detailed approach improves the model's understanding of language nuances, leading to more precise enhancements in its performance. Our method is crafted to integrate seamlessly with most current techniques while incurring minimal computational costs. Through comprehensive experiments across diverse datasets and tasks, we have verified the effectiveness and superiority of our approach.

large language model, machine learning, reinforcement learning, (18 more...)

2411.08302

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)