AITopics | guidance reward

Collaborating Authors

guidance reward

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning Guidance Rewards with Trajectory-space Smoothing

Neural Information Processing SystemsDec-23-2025, 17:38:47 GMT

Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense guidance rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.

artificial intelligence, machine learning, reinforcement learning, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

0912d0f15f1394268c66639e39b26215-Supplemental.pdf

Neural Information Processing SystemsNov-13-2025, 07:53:51 GMT

guidance reward, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Robots (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

0912d0f15f1394268c66639e39b26215-Paper.pdf

Neural Information Processing SystemsNov-13-2025, 07:53:44 GMT

guidance reward, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
North America > Canada (0.04)

Industry:

Media > Television (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

report the final policy performance (mean std) over the seeds. Due to space constraints, we omit the learning curves

Neural Information Processing SystemsNov-13-2025, 07:53:32 GMT

We thank all the reviewers for their constructive feedback on improving the paper. Q. Are exploration and credit assignment (due to delayed rewards) the same? We agree that it's important to clarify this distinction and We'll include this in the revision. Q. Unintended output in provided Q. IRCR if there are indeed dense rewards? We have added a distributional variant of SAC (EXP .

artificial intelligence, guidance reward, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.41)
Information Technology > Artificial Intelligence > Robots (0.31)

Add feedback

0912d0f15f1394268c66639e39b26215-Supplemental.pdf

Neural Information Processing SystemsOct-1-2025, 23:52:04 GMT

guidance reward, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Robots (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

0912d0f15f1394268c66639e39b26215-Paper.pdf

Neural Information Processing SystemsOct-1-2025, 23:51:57 GMT

guidance reward, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
North America > Canada (0.04)

Industry:

Media > Television (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

report the final policy performance (mean std) over the seeds. Due to space constraints, we omit the learning curves

Neural Information Processing SystemsOct-1-2025, 23:51:46 GMT

artificial intelligence, guidance reward, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.41)
Information Technology > Artificial Intelligence > Robots (0.31)

Add feedback

Natural Humanoid Robot Locomotion with Generative Motion Prior

Zhang, Haodong, Zhang, Liang, Chen, Zhenghan, Chen, Lu, Wang, Yue, Xiong, Rong

arXiv.org Artificial IntelligenceMar-11-2025

Natural and lifelike locomotion remains a fundamental challenge for humanoid robots to interact with human society. However, previous methods either neglect motion naturalness or rely on unstable and ambiguous style rewards. In this paper, we propose a novel Generative Motion Prior (GMP) that provides fine-grained motion-level supervision for the task of natural humanoid robot locomotion. To leverage natural human motions, we first employ whole-body motion retargeting to effectively transfer them to the robot. Subsequently, we train a generative model offline to predict future natural reference motions for the robot based on a conditional variational auto-encoder. During policy training, the generative motion prior serves as a frozen online motion generator, delivering precise and comprehensive supervision at the trajectory level, including joint angles and keypoint positions. The generative motion prior significantly enhances training stability and improves interpretability by offering detailed and dense guidance throughout the learning process. Experimental results in both simulation and real-world environments demonstrate that our method achieves superior motion naturalness compared to existing approaches. Project page can be found at https://sites.google.com/view/humanoid-gmp

humanoid robot, locomotion, robot, (13 more...)

arXiv.org Artificial Intelligence

2503.09015

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots > Humanoid Robots (1.00)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Policy Optimization with Smooth Guidance Rewards Learned from Sparse-Reward Demonstrations

Wang, Guojian, Wu, Faguo, Zhang, Xiao, Chen, Tianyuan

arXiv.org Artificial IntelligenceDec-30-2023

The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized temporal credit assignment (CA) to achieve impressive results in multiple hard tasks. However, many CA methods relied on complex architectures or introduced sensitive hyperparameters to estimate the impact of state-action pairs. Meanwhile, the premise of the feasibility of CA methods is to obtain trajectories with sparse rewards, which can be troublesome in sparse-reward environments with large state spaces. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG) that leverages a small set of sparse-reward demonstrations to make reliable and effective long-term credit assignments while efficiently facilitating exploration. The key idea is that the relative impact of state-action pairs can be indirectly estimated using offline demonstrations rather than directly leveraging the sparse reward trajectories generated by the agent. Specifically, we first obtain the trajectory importance by considering both the trajectory-level distance to demonstrations and the returns of the relevant trajectories. Then, the guidance reward is calculated for each state-action pair by smoothly averaging the importance of the trajectories through it, merging the demonstration's distribution and reward information. We theoretically analyze the performance improvement bound caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed compared to benchmark DRL algorithms. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.

machine learning, reinforcement learning, trajectory, (16 more...)

arXiv.org Artificial Intelligence

2401.00162

Country:

Asia > China > Beijing > Beijing (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Learning Guidance Rewards with Trajectory-space Smoothing

Gangwani, Tanmay, Zhou, Yuan, Peng, Jian

arXiv.org Machine LearningOct-23-2020

Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.

guidance reward, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2010.12718

Country: