reacher
- North America > United States (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
ARelatedWork
This group of approaches reuses policies learned on source tasks for target tasks. There is a series of studies that directly exploits the smoothness ofoptimal valuesacross taskswithfunction approximators. Figure 9: The performance profiles [2, 15] of the inference with GPI and constrained GPI on Reacher. For its use in the zero-shot transfer problem, we first set four fixed goal locations at (0.1,0.0),(0.0,0.1),( Our first observation is that while the transferred agents perform comparably on some tasks, constrained GPI makes significant differences on the others, especially more on the "Harsh" target tasks with many 1's as elements in their task vectors.
- North America > Canada > Alberta (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
Actor-Critic Reinforcement Learning with Phased Actor
Wu, Ruofan, Zhong, Junmin, Si, Jennie
Policy gradient methods in actor-critic reinforcement learning (RL) have become perhaps the most promising approaches to solving continuous optimal control problems. However, the trial-and-error nature of RL and the inherent randomness associated with solution approximations cause variations in the learned optimal values and policies. This has significantly hindered their successful deployment in real life applications where control responses need to meet dynamic performance criteria deterministically. Here we propose a novel phased actor in actor-critic (PAAC) method, aiming at improving policy gradient estimation and thus the quality of the control policy. Specifically, PAAC accounts for both $Q$ value and TD error in its actor update. We prove qualitative properties of PAAC for learning convergence of the value and policy, solution optimality, and stability of system dynamics. Additionally, we show variance reduction in policy gradient estimation. PAAC performance is systematically and quantitatively evaluated in this study using DeepMind Control Suite (DMC). Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate. As PAAC can be piggybacked onto general policy gradient learning frameworks, we select well-known methods such as direct heuristic dynamic programming (dHDP), deep deterministic policy gradient (DDPG) and their variants to demonstrate the effectiveness of PAAC. Consequently we provide a unified view on these related policy gradient algorithms.