AITopics | learning theory

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

2605.29645

Country: Asia > Middle East > Israel (0.28)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.93)

Add feedback

Improved Model-based Reinforcement Learning with Smooth Kernels

Long, Kun, Li, Yuqiang, Wu, Xianyi

arXiv.org Machine LearningMay-11-2026

For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

2605.07218

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

Chen, Fan, Qian, Jian, Rakhlin, Alexander, Zhivotovskiy, Nikita

arXiv.org Machine LearningMay-5-2026

Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emph{not scale-invariant}: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension $d=1$; moreover, in $d=1$ we obtain $O(\log T)$ scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for $d>1$ we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emph{doubly-uniform} regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph{``Uniform regret bounds over $\mathbb{R}^d$ for the sequential linear regression problem with the square loss''} (ALT 2019): in $d=1$ we give an explicit algorithm with $O(\log T)$ doubly-uniform regret, whereas for $d>1$ sublinear doubly-uniform regret is impossible. Finally, under a natural \emph{smoothness} condition (bounded Radon--Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for $d>1$ without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales.

artificial intelligence, machine learning, martingale, (19 more...)

arXiv.org Machine Learning

2605.01628

Country: Europe (0.28)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

d54e440c92affd396117e161bbab5e78-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 04:47:33 GMT

artificial intelligence, data mining, machine learning, (20 more...)

Neural Information Processing Systems

Country: North America > United States > Maryland (0.14)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science > Data Mining > Big Data (0.49)

Add feedback

1fc6c343d8dbb4c369ab6e04225f5a65-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 01:52:12 GMT

artificial intelligence, machine learning, proceedings, (14 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

15a50c8ba6a0002a2fa7e5d8c0a40bd9-Supplemental.pdf

Neural Information Processing SystemsMay-1-2026, 01:50:19 GMT

artificial intelligence, data mining, machine learning, (15 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.47)

Add feedback

An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits

Neural Information Processing SystemsApr-30-2026, 02:09:01 GMT

In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020b] works well.

artificial intelligence, data mining, machine learning, (18 more...)

Neural Information Processing Systems

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.64)

Add feedback

d9b564716709357b4bccec9fc9ad04d2-Paper-Conference.pdf

Neural Information Processing SystemsApr-29-2026, 23:33:16 GMT

artificial intelligence, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Country: Europe (0.27)

Genre: Research Report (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining (0.68)

Add feedback

On the Complexity of Adversarial Decision Making

Neural Information Processing SystemsApr-28-2026, 05:27:41 GMT

A central problem in online learning and decision making--from bandits to reinforcement learning--is to understand what modeling assumptions lead to sampleefficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show--via new upper and lower bounds--that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. [17] in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results--both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy [47] and the Exploration-by-Optimization objective of Lattimore and György [32].

data mining, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.47)

Industry: Education (0.54)

Technology: