The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
–Neural Information Processing Systems
As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation.
Neural Information Processing Systems
Oct-8-2025, 01:18:51 GMT
- Country:
- Asia > Middle East
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- Genre:
- Research Report (0.45)
- Workflow (0.46)