Exploiting the Replay Memory Before Exploring the Environment: Enhancing Reinforcement Learning Through Empirical MDP Iteration

May-31-2025, 15:37:12 GMT–Neural Information Processing Systems

Reinforcement learning (RL) algorithms are typically based on optimizing a Markov Decision Process (MDP) using the optimal Bellman equation. Recent studies have revealed that focusing the optimization of Bellman equations solely on in-sample actions tends to result in more stable optimization, especially in the presence of function approximation. Upon on these findings, in this paper, we propose an Empirical MDP Iteration (EMIT) framework. For each of these empirical MDPs, it learns an estimated Q-function denoted as \widehat{Q} . The key strength is that by restricting the Bellman update to in-sample bootstrapping, each empirical MDP converges to a unique optimal \widehat{Q} function.

algorithm, empirical mdp iteration, enhancing reinforcement learning, (8 more...)

Neural Information Processing Systems

May-31-2025, 15:37:12 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)