Kernelized Reinforcement Learning with Order Optimal Regret Bounds
Vakili, Sattar, Olkhovskaya, Julia
Reinforcement learning (RL) has shown empirical success in various real world settings with complex models and large state-action spaces. The existing analytical results, however, typically focus on settings with a small number of state-actions or simple models such as linearly modeled state-action value functions. To derive RL policies that efficiently handle large state-action spaces with more general value functions, some recent works have considered nonlinear function approximation using kernel ridge regression. We propose $\pi$-KRVI, an optimistic modification of least-squares value iteration, when the state-action value function is represented by a reproducing kernel Hilbert space (RKHS). We prove the first order-optimal regret guarantees under a general setting. Our results show a significant polynomial in the number of episodes improvement over the state of the art. In particular, with highly non-smooth kernels (such as Neural Tangent kernel or some Mat\'ern kernels) the existing results lead to trivial (superlinear in the number of episodes) regret bounds. We show a sublinear regret bound that is order optimal in the case of Mat\'ern kernels where a lower bound on regret is known.
Nov-28-2023
- Country:
- Asia > Middle East
- Jordan (0.04)
- Europe
- Netherlands
- North Holland > Amsterdam (0.04)
- South Holland > Delft (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Oxfordshire > Oxford (0.04)
- Netherlands
- North America
- Canada > Alberta (0.14)
- United States > Massachusetts
- Middlesex County > Cambridge (0.04)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.54)
- Technology: