Switching the Loss Reduces the Cost in Batch Reinforcement Learning
Ayoub, Alex, Wang, Kaiwen, Liu, Vincent, Robertson, Samuel, McInerney, James, Liang, Dawen, Kallus, Nathan, Szepesvári, Csaba
–arXiv.org Artificial Intelligence
In offline reinforcement learning (RL), also known as batch RL, we often want agents that learn how to achieve a goal from a fixed dataset using as few samples as possible. A standard approach in this setting is fitted Q-iteration (FQI) [Ernst et al., 2005], which iteratively minimizes the regression error on the batch dataset. In this work we propose a simple and principled improvement to FQI, using log-loss (FQI-log), and prove that it can achieve a much faster convergence rate. In particular, the number of samples it requires to learn a near-optimal policy scales with the cost of the optimal policy, leading to a so-called small-cost bound, the RL analogue of a small-loss bound in supervised learning. We highlight that FQI-log is the first computationally efficient batch RL algorithm to achieve a small-cost bound.
arXiv.org Artificial Intelligence
Mar-12-2024
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.14)
- North America > Canada
- Alberta (0.14)
- Europe > United Kingdom
- Genre:
- Research Report > New Finding (0.67)
- Technology: