Tractable and Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation
Cho, Taehyun, Han, Seungyub, Lee, Kyungjae, Ju, Seokhun, Kim, Dohyeong, Lee, Jungwoo
Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity, but a comprehensive theoretical understanding of its effectiveness remains elusive. In this paper, we present a regret analysis for distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting. We first introduce a key notion of Bellman unbiasedness for a tractable and exactly learnable update via statistical functional dynamic programming. Our theoretical results show that approximating the infinite-dimensional return distribution with a finite number of moment functionals is the only method to learn the statistical information unbiasedly, including nonlinear statistical functionals. Second, we propose a provably efficient algorithm, $\texttt{SF-LSVI}$, achieving a regret bound of $\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K})$ where $H$ is the horizon, $K$ is the number of episodes, and $d_E$ is the eluder dimension of a function class.
Jul-30-2024
- Country:
- Europe > United Kingdom
- England
- Greater London > London (0.04)
- Cambridgeshire > Cambridge (0.04)
- England
- Asia
- Middle East > Jordan (0.04)
- South Korea > Seoul
- Seoul (0.04)
- Europe > United Kingdom
- Genre:
- Research Report > New Finding (0.34)