How Does Variance Shape the Regret in Contextual Bandits?

Dec-26-2025, 16:30:59 GMT–Neural Information Processing Systems

We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax regret bounds, we show that the eluder dimension $d_{\text{elu}}$$-$a measure of the complexity of the function class$-$plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of $\Omega( \sqrt{ \min (A, d_{\text{elu}}) \Lambda } + d_{\text{elu}})$ is unavoidable when $d_{\text{elu}} \leq \sqrt{A T}$, where $A$ is the number of actions, $T$ is the total number of rounds, and $\Lambda$ is the total variance over $T$ rounds. For the $A\leq d_{\text{elu}}$ regime, we derive a nearly matching upper bound $\tilde{O}( \sqrt{ A\Lambda } + d_{\text{elu} })$ for the special case where the variance is revealed at the beginning of each round.

artificial intelligence, elu, machine learning, (11 more...)

Neural Information Processing Systems

Dec-26-2025, 16:30:59 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.62)
  - Representation & Reasoning (0.80)