Meta-Learning to Explore via Memory Density Feedback

Mar-4-2025–arXiv.org Artificial Intelligence

Exploration algorithms for reinforcement learning typically replace or augment the reward function with an additional "intrinsic" reward that trains the agent to seek previously unseen states of the environment. Here, we consider an exploration algorithm that exploits meta-learning, or learning to learn, such that the agent learns to maximize its exploration progress within a single episode, even between epochs of training. The agent learns a policy that aims to minimize the probability density of new observations with respect to all of its memories. In addition, it receives as feedback evaluations of the current observation density and retains that feedback in a recurrent network. By remembering trajectories of density, the agent learns to navigate a complex and growing landscape of familiarity in real-time, allowing it to maximize its exploration progress even in completely novel states of the environment for which its policy has not been trained. Introduction In reinforcement learning (RL), exploration refers to algorithms that induce an agent to observe as much of a given task as possible. All RL algorithms include some form of random exploration, such as the epsilon-greedy policy or by additionally training to maximize the policy's entropy. These algorithms are necessary for the agent to find rewarding states and expand its policy, but often fall short when rewards are sparsely distributed, that is, requiring non-obvious and improbable sequences of action.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Mar-4-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks (1.00)
  - Reinforcement Learning (1.00)
  - Statistical Learning (1.00)