AITopics | reacher

This group of approaches reuses policies learned on source tasks for target tasks. There is a series of studies that directly exploits the smoothness ofoptimal valuesacross taskswithfunction approximators. Figure 9: The performance profiles [2, 15] of the inference with GPI and constrained GPI on Reacher. For its use in the zero-shot transfer problem, we first set four fixed goal locations at (0.1,0.0),(0.0,0.1),( Our first observation is that while the transferred agents perform comparably on some tasks, constrained GPI makes significant differences on the others, especially more on the "Harsh" target tasks with many 1's as elements in their task vectors.

approximator, artificial intelligence, reacher, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.36)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

questions raised by each reviewer separately

Neural Information Processing SystemsOct-3-2025, 03:57:21 GMT

In fact, the bound can be improved by considering a data-dependent measure, e.g., Rademacher

artificial intelligence, behavior policy, coindice, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.32)

Add feedback

488e4104520c6aab692863cc1dba45af-Supplemental.pdf

Neural Information Processing SystemsOct-2-2025, 20:17:56 GMT

artificial intelligence, configuration, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

488e4104520c6aab692863cc1dba45af-Paper.pdf

Neural Information Processing SystemsOct-2-2025, 20:17:49 GMT

artificial intelligence, estimator, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

1d8dc55c1f6cf124af840ce1d92d1896-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-2-2025, 19:46:55 GMT

large language model, machine learning, target task, (21 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.31)

Add feedback

Actor-Critic Reinforcement Learning with Phased Actor

Wu, Ruofan, Zhong, Junmin, Si, Jennie

arXiv.org Artificial IntelligenceApr-17-2024

Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.

paac, policy gradient, variance, (16 more...)

arXiv.org Artificial Intelligence

2404.11834

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.86)

Industry: Leisure & Entertainment > Games (0.46)

Technology: