td-error
- Research Report > New Finding (0.93)
- Overview (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
solid [ R1, R3, R4 ], our experimental results valuable [ R2, R3, R4] and our paper well-written [ R1, R3, R4]
We only included a single environment (Pusher-v2) in the main paper in order to save space. We will include the suggested references into the paper. See also About multi-step rollouts . The reviewer suggests that the paper should first "show that minimizing the TD-error is not Notice, however, that despite being commonly used and thought of as "intuitive", Furthermore, Figure 1 shows indeed that minimizing the TD-error can lead to a critic being far away from the ideal one. We did not write that "model-based RL has no advantage in terms of sample-efficiency than model-free RL".
Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization
Griesbach, Sebastian, D'Eramo, Carlo
Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- Europe > United Kingdom > England > Greater London > London (0.14)
- Europe > Austria > Vienna (0.14)
- (12 more...)
- Research Report > New Finding (0.93)
- Overview (0.67)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Deterministic Exploration via Stationary Bellman Error Maximization
Griesbach, Sebastian, D'Eramo, Carlo
Exploration is a crucial and distinctive aspect of reinforcement learning (RL) that remains a fundamental open problem. Several methods have been proposed to tackle this challenge. Commonly used methods inject random noise directly into the actions, indirectly via entropy maximization, or add intrinsic rewards that encourage the agent to steer to novel regions of the state space. Another previously seen idea is to use the Bellman error as a separate optimization objective for exploration. In this paper, we introduce three modifications to stabilize the latter and arrive at a deterministic exploration policy. Our separate exploration agent is informed about the state of the exploitation, thus enabling it to account for previous experiences. Further components are introduced to make the exploration objective agnostic toward the episode length and to mitigate instability introduced by far-off-policy learning. Our experimental results show that our approach can outperform $\varepsilon$-greedy in dense and sparse reward settings.
- North America > United States > Massachusetts (0.14)
- Europe > United Kingdom > England (0.14)
- Europe > Germany > Bavaria (0.14)
- Asia > Japan (0.14)
- Energy > Oil & Gas (0.47)
- Leisure & Entertainment > Games (0.46)
Reviews: Transfer of Value Functions via Variational Methods
Update: ----------- I had a look at the author response: It seems reasonable, contains a lot of additional information / additional experiments which do address my main concerns with the paper. Had these comparisons been part of the paper in the first place I would have voted for accepting the paper. I am now a bit on the fence about this as the paper could be accepted but will require a major revision, I will engage in discussion with the other reviewers and ultimately the AC has to decide whether such big changes to the experimental section are acceptable within the review process. Original review: --------------------- The paper presents a method for transfer learning via a variational inference formulation in a reinforcement learning (RL) setting. The proposed approach is sound, novel and interesting and could be widely applicable (it make no overly restrictive assumptions on the form of the learned (Q-)value function).
DIFFER: Decomposing Individual Reward for Fair Experience Replay in Multi-Agent Reinforcement Learning
Hu, Xunhan, Zhao, Jian, Zhou, Wengang, Feng, Ruili, Li, Houqiang
Cooperative multi-agent reinforcement learning (MARL) is a challenging task, as agents must learn complex and diverse individual strategies from a shared team reward. However, existing methods struggle to distinguish and exploit important individual experiences, as they lack an effective way to decompose the team reward into individual rewards. To address this challenge, we propose DIFFER, a powerful theoretical framework for decomposing individual rewards to enable fair experience replay in MARL. By enforcing the invariance of network gradients, we establish a partial differential equation whose solution yields the underlying individual reward function. The individual TD-error can then be computed from the solved closed-form individual rewards, indicating the importance of each piece of experience in the learning task and guiding the training process. Our method elegantly achieves an equivalence to the original learning framework when individual experiences are homogeneous, while also adapting to achieve more muscular efficiency and fairness when diversity is observed.Our extensive experiments on popular benchmarks validate the effectiveness of our theory and method, demonstrating significant improvements in learning efficiency and fairness.
- Transportation (0.47)
- Information Technology (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.35)