DNA: Proximal Policy Optimization with a Dual Network Architecture
–Neural Information Processing Systems
This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower \textit{variance} return estimate.
Neural Information Processing Systems
Feb-6-2026, 08:00:04 GMT
- Technology: