Switching the Loss Reduces the Cost in Batch Reinforcement Learning

Ayoub, Alex, Wang, Kaiwen, Liu, Vincent, Robertson, Samuel, McInerney, James, Liang, Dawen, Kallus, Nathan, Szepesvári, Csaba

arXiv.org Artificial Intelligence 

In offline reinforcement learning (RL), also known as batch RL, we often want agents that learn how to achieve a goal from a fixed dataset using as few samples as possible. A standard approach in this setting is fitted Q-iteration (FQI) [Ernst et al., 2005], which iteratively minimizes the regression error on the batch dataset. In this work we propose a simple and principled improvement to FQI, using log-loss (FQI-log), and prove that it can achieve a much faster convergence rate. In particular, the number of samples it requires to learn a near-optimal policy scales with the cost of the optimal policy, leading to a so-called small-cost bound, the RL analogue of a small-loss bound in supervised learning. We highlight that FQI-log is the first computationally efficient batch RL algorithm to achieve a small-cost bound.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found