Goto

Collaborating Authors

 exploration bonus


analysis of Algorithm

Neural Information Processing Systems

In this section, we provide a convergence rate analysis for Algorithm 1. Similar to Hazan et al. [36], Algorithm 1 has access to an approximate density oracle and an approximate planner defined below: Visitation density oracle: We assume access to an approximate density estimator that takes in a policy and a density approximation error d 0 as inputs and returns ห†d such that kd ห†d k1 d. Approximate planning oracle: We assume access to an approximate planner that, given any MDP M and error tolerance p 0, returns a policy such that JM() max JM() p. A.1 Proof of Theorem 1 We first give the following proposition that captures certain properties of the proposed objective. The proof is postponed to the end of this section. Taking the above proposition as given for the moment, we prove Theorem 1 following steps similar to those of Hazan et al. [36, Theorem 4.1]. Since k returned by the approximate planning oracle is an p-optimal policy in Mk, we have (1) 1hd k,rki (1) 1hd,rki p for any policy, including?. Therefore, It is straightforward to check that setting 0.1 1, p 0.1, d 0.1 1, 0.1, and the number of iterations K 1 log(10B 1) yields the claim of Theorem 1. Remark 2. Since the temperature parameter k in Proposition 1 goes to zero as k increases, one can show that the expected value of policy returned by Algorithm 1 converges to the maximum performance J(?).



Exploration by Learning Diverse Skills through Successor State Representations

Neural Information Processing Systems

The ability to perform different skills can encourage agents to explore. In this work, we aim to construct a set of diverse skills that uniformly cover the state space. We propose a formalization of this search for diverse skills, building on a previous definition based on the mutual information between states and skills. We consider the distribution of states reached by a policy conditioned on each skill and leverage the successor state representation to maximize the difference between these skill distributions. We call this approach LEADS: Learning Diverse Skills through Successor State Representations. We demonstrate our approach on a set of maze navigation and robotic control tasks which show that our method is capable of constructing a diverse set of skills which exhaustively cover the state space without relying on reward or exploration bonuses. Our findings demonstrate that this new formalization promotes more robust and efficient exploration by combining mutual information maximization and exploration bonuses.


Rethinking Exploration in Reinforcement Learning with Effective Metric-Based Exploration Bonus

Neural Information Processing Systems

Enhancing exploration in reinforcement learning (RL) through the incorporation of intrinsic rewards, specifically by leveraging *state discrepancy* measures within various metric spaces as exploration bonuses, has emerged as a prevalent strategy to encourage agents to visit novel states. The critical factor lies in how to quantify the difference between adjacent states as *novelty* for promoting effective exploration.Nonetheless, existing methods that evaluate state discrepancy in the latent space under $L_1$ or $L_2$ norm often depend on count-based episodic terms as scaling factors for exploration bonuses, significantly limiting their scalability. Additionally, methods that utilize the bisimulation metric for evaluating state discrepancies face a theory-practice gap due to improper approximations in metric learning, particularly struggling with *hard exploration* tasks. To overcome these challenges, we introduce the **E**ffective **M**etric-based **E**xploration-bonus (EME). EME critically examines and addresses the inherent limitations and approximation inaccuracies of current metric-based state discrepancy methods for exploration, proposing a robust metric for state discrepancy evaluation backed by comprehensive theoretical analysis. Furthermore, we propose the diversity-enhanced scaling factor integrated into the exploration bonus to be dynamically adjusted by the variance of prediction from an ensemble of reward models, thereby enhancing exploration effectiveness in particularly challenging scenarios. Extensive experiments are conducted on hard exploration tasks within Atari games, Minigrid, Robosuite, and Habitat, which illustrate our method's scalability to various scenarios.