DNA: Proximal Policy Optimization with a Dual Network Architecture

Feb-6-2026, 08:00:04 GMT–Neural Information Processing Systems

This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower \textit{variance} return estimate.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Feb-6-2026, 08:00:04 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.41)