reward redistribution
Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach
A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Jordan (0.04)
- (5 more...)
- Education (0.46)
- Leisure & Entertainment > Games (0.30)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Jordan (0.04)
- (5 more...)
- Education (0.46)
- Leisure & Entertainment > Games (0.30)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)
Likelihood Reward Redistribution
In many practical reinforcement learning scenarios, feedback is provided only at the end of a long horizon, leading to sparse and delayed rewards. Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state--action pairs. In this paper, we propose a \emph{Likelihood Reward Redistribution} (LRR) framework that addresses this issue by modeling each per-step reward with a parametric probability distribution whose parameters depend on the state--action pair. By maximizing the likelihood of the observed episodic return via a leave-one-out (LOO) strategy that leverages the entire trajectory, our framework inherently introduces an uncertainty regularization term into the surrogate objective. Moreover, we show that the conventional mean squared error (MSE) loss for reward redistribution emerges as a special case of our likelihood framework when the uncertainty is fixed under the Gaussian distribution. When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals, resulting in superior sample efficiency and policy performance on Box-2d and MuJoCo benchmarks.
$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning
Kapoor, Aditya, Tessera, Kale-ab, Baranwal, Mayank, Khadilkar, Harshad, Albrecht, Stefano, Sun, Mingfei
In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning
Kapoor, Aditya, Swamy, Sushant, Tessera, Kale-ab, Baranwal, Mayank, Sun, Mingfei, Khadilkar, Harshad, Albrecht, Stefano V.
In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards, particularly in long-horizon tasks where it is challenging to evaluate actions at intermediate time steps. We introduce Temporal-Agent Reward Redistribution (TAR$^2$), a novel approach designed to address the agent-temporal credit assignment problem by redistributing sparse rewards both temporally and across agents. TAR$^2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards. We theoretically prove that TAR$^2$ is equivalent to potential-based reward shaping, ensuring that the optimal policy remains unchanged. Empirical results demonstrate that TAR$^2$ stabilizes and accelerates the learning process. Additionally, we show that when TAR$^2$ is integrated with single-agent reinforcement learning algorithms, it performs as well as or better than traditional multi-agent reinforcement learning methods.
- Asia > India > Maharashtra > Mumbai (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Overview (0.88)
- Research Report > New Finding (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)