MADE: Exploration via Maximizing Deviation from Explored Regions

May-26-2025, 19:33:49 GMT–Neural Information Processing Systems

In online reinforcement learning (RL), efficient exploration remains particularly challenging in high-dimensional environments with sparse rewards. In low-dimensional environments, where tabular parameterization is possible, count-based upper confidence bound (UCB) exploration methods achieve minimax near-optimal rates. However, it remains unclear how to efficiently implement UCB in realistic RL tasks that involve non-linear function approximation. To address this, we propose a new exploration approach via maximizing the deviation of the occupancy of the next policy from the explored regions. We add this term as an adaptive regularizer to the standard RL objective to balance exploration vs. exploitation.

artificial intelligence, machine learning, reinforcement learning, (8 more...)

Neural Information Processing Systems

May-26-2025, 19:33:49 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.62)
  - Machine Learning > Reinforcement Learning (0.42)