Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

Jain, Ayush, Kosaka, Norio, Li, Xinhu, Kim, Kyung-Min, Bıyık, Erdem, Lim, Joseph J.

Oct-15-2024–arXiv.org Machine Learning

In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of action, having several local optima or discontinuities. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds optimal actions more frequently and outperforms alternate actor architectures.

actor, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

Oct-15-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.28)

Genre:
- Research Report (0.82)

Industry:
- Government (0.93)
- Leisure & Entertainment (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (1.00)
    - Reinforcement Learning (1.00)
    - Statistical Learning > Gradient Descent (0.87)
  - Representation & Reasoning > Optimization (1.00)
  - Robots (1.00)