AITopics | bandit feedback

Collaborating Authors

bandit feedback

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback

Neural Information Processing SystemsJun-22-2026, 23:22:11 GMT

We study the problem of no-regret learning algorithms for general monotone and smooth games and their last-iterate convergence properties. Specifically, we investigate the problem under bandit feedback and strongly uncoupled dynamics, which allows modular development of the multi-player system that applies to a wide range of real applications. We propose a mirror-descent-based algorithm, which converges in O(T 1/4)under Bregman divergence and is also no-regret, where the choice of Bregman divergence is determined by the convexity of the game. The result is achieved by the dedicated use of two regularizations and the analysis of the fixed point thereof. The convergence rate is further improved to O(T 1/2)in the case of strongly monotone games. Motivated by practical tasks where the game evolves over time, the algorithm is extended to time-varying monotone games. We provide the first non-asymptotic result in converging monotone games and give improved results for equilibrium tracking games.

artificial intelligence, machine learning, monotone game, (16 more...)

Neural Information Processing Systems

Country: Asia > China (0.28)

Genre: Research Report > Experimental Study (0.93)

Industry: Leisure & Entertainment > Games (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Non-Uniform Multiclass Learning with Bandit Feedback

Neural Information Processing SystemsJun-22-2026, 22:33:06 GMT

We study the problem of multiclass learning with bandit feedback in both the i.i.d.

artificial intelligence, bandit feedback, machine learning, (15 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Overview (0.92)

Industry: Education > Educational Setting > Online (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.69)

Add feedback

bd5c3c51db72a6614bb71ce5318a78d0-Paper-Conference.pdf

Neural Information Processing SystemsJun-22-2026, 12:06:56 GMT

We study online decision making problems under resource constraints, where both reward and cost functions are drawn from distributions that may change adversarially over time. We focus on two canonical settings: (i) online resource allocation where rewards and costs are observed before action selection, and (ii)online learning with resource constraints where they are observed after action selection, under full feedback or bandit feedback. It is well known that achieving sublinear regret in these settings is impossible when reward and cost distributions may change arbitrarily over time. To address this challenge, we analyze a framework in which the learner is guided by a spending plan--a sequence prescribing expected resource usage across rounds. We design general (primal-)dual methods that achieve sublinear regret with respect to baselines that follow the spending plan. Crucially, the performance of our algorithms improves when the spending plan ensures a well-balanced distribution of the budget across rounds. We additionally provide a robust variant of our methods to handle worst-case scenarios where the spending plan is highly imbalanced. To conclude, we study the regret of our algorithms when competing against benchmarks that deviate from the prescribed spending plan.

artificial intelligence, machine learning, regret minimizer, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.27)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.67)
Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Comparing Uniform Price and Discriminatory Multi-Unit Auctions through Regret Minimization

Neural Information Processing SystemsJun-18-2026, 12:57:52 GMT

Repeated multi-unit auctions, where a seller allocates multiple identical items over many rounds, are common mechanisms in electricity markets and treasury auctions. We compare the two predominant formats: uniform-price and discriminatory auctions, focusing on the perspective of a single bidder learning to bid against stochastic adversaries. We characterize the learning difficulty in each format, showing that the regret scales similarly for both auction formats under both fullinformation and bandit feedback, as Θ( T)and Θ(T2/3), respectively. However, analysis beyond worst-case regret reveals structural differences: uniform-price auctions may admit faster learning rates, with regret scaling as Θ( T)in settings where discriminatory auctions remain at Θ(T2/3). Finally, we provide a specific analysis for auctions in which the other participants are symmetric and have unitdemand, and show that in these instances, a similar regret rate separation appears.

artificial intelligence, auction, machine learning, (20 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Energy (0.54)
Banking & Finance (0.45)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Uniform Wrappers: Bridging Concave to Quadratizable Functions in Online Optimization

Neural Information Processing SystemsJun-17-2026, 10:22:11 GMT

This paper presents novel contributions to the field of online optimization, particularly focusing on the adaptation of algorithms from concave optimization to more challenging classes of functions. Key contributions include the introduction of uniform wrappers, a class of meta-algorithms that could be used for algorithmic conversions such as converting algorithms for convex optimization into those for quadratizable optimization. Moreover, we propose a guideline that, given a base algorithm Afor concave optimization and a uniform wrapper W, describes how to convert a proof of the regret bound of A in the concave setting into a proof of the regret bound of W(A)for quadratizable setting. Through this framework, the paper demonstrates improved regret guarantees for various classes of DR-submodular functions under zeroth-order feedback. Furthermore, the paper extends zeroth-order online algorithms to bandit feedback and offline counterparts, achieving notable improvements in regret/sample complexity compared to existing approaches.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America (0.46)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Bandit and Delayed Feedback in Online Structured Prediction

Neural Information Processing SystemsJun-15-2026, 17:31:58 GMT

Online structured prediction is a task of sequentially predicting outputs with complex structures based on inputs and past observations, encompassing online classification. Recent studies showed that in the full-information setting, we can achieve finite bounds on the surrogate regret, i.e., the extra target loss relative to the best possible surrogate loss. In practice, however, full-information feedback is often unrealistic as it requires immediate access to the whole structure of complex outputs. Motivated by this, we propose algorithms that work with less demanding feedback, bandit and delayed feedback. For bandit feedback, by using a standard inverseweighted gradient estimator, we achieve a surrogate regret bound of O( KT) for the time horizon T and the size of the output set K. However, K can be extremely large when outputs are highly complex, resulting in an undesirable bound. To address this issue, we propose another algorithm that achieves a surrogate regret bound of O(T2/3), which is independent of K. This is achieved with a carefully designed pseudo-inverse matrix estimator. Furthermore, we numerically compare the performance of these algorithms, as well as existing ones. Regarding delayed feedback, we provide algorithms and regret analyses that cover various scenarios, including full-information and bandit feedback, as well as fixed and variable delays.

artificial intelligence, inductive learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education > Educational Setting > Online (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.71)

Add feedback

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

Neural Information Processing SystemsJun-14-2026, 14:10:46 GMT

We study online learning in finite-horizon episodic Markov decision processes (MDPs) under the challenging aggregate bandit feedback model, where the learner observes only the cumulative loss incurred in each episode, rather than individual losses at each state-action pair. While prior work in this setting has focused exclusively on worst-case analysis, we initiate the study of best-of-both-worlds (BOBW) algorithms that achieve low regret in both stochastic and adversarial environments. We propose the first BOBW algorithms for episodic tabular MDPs with aggregate bandit feedback. In the case of known transitions, our algorithms achieve O(logT) regret in stochastic settings and O( T) regret in adversarial ones. Importantly, we also establish matching lower bounds, showing the optimality of our algorithms in this setting. We further extend our approach to unknowntransition settings by incorporating confidence-based techniques. Our results rely on a combination of FTRL over occupancy measures, self-bounding techniques, and new loss estimators inspired by recent advances in online shortest path problems. Along the way, we also provide the first individual-gap-dependent lower bounds and demonstrate near-optimal BOBW algorithms for shortest path problems with bandit feedback.

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.65)

Industry: Education > Educational Setting > Online (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.45)

Add feedback

Non-Uniform Multiclass Learning with Bandit Feedback

Neural Information Processing SystemsJun-14-2026, 05:38:21 GMT

We study the problem of multiclass learning with bandit feedback in both the i.i.d.

adversarial online model, artificial intelligence, machine learning, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.42)

Add feedback

Bandit and Delayed Feedback in Online Structured Prediction

Neural Information Processing SystemsJun-11-2026, 06:36:57 GMT

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Genre: Research Report (0.59)

Industry: Education > Educational Setting > Online (0.59)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Capacity-Constrained Online Convex Optimization with Delayed Feedback

Ryabchenko, Alexander, Attias, Idan, Roy, Daniel M.

arXiv.org Machine LearningJun-11-2026

Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = Ω(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + σ_{\text{max}}/C)$, where $σ_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < σ_{\text{max}}$, while remaining sublinear.

artificial intelligence, convex optimization, machine learning, (14 more...)

arXiv.org Machine Learning

2606.11711

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback