Goto

Collaborating Authors

 bandit problem






Verification Based Solution for Structured MAB Problems

Neural Information Processing Systems

We consider the problem of finding the best arm in a stochastic Multi-armed Bandit (MAB) game and propose a general framework based on verification that applies to multiple well-motivated generalizations of the classic MAB problem. In these generalizations, additional structure is known in advance, causing the task of verifying the optimality of a candidate to be easier than discovering the best arm. Our results are focused on the scenario where the failure probability must be very low; we essentially show that in this high confidence regime, identifying the best arm is as easy as the task of verification. We demonstrate the effectiveness of our framework by applying it, and matching or improving the state-of-the art results in the problems of: Linear bandits, Dueling bandits with the Condorcet assumption, Copeland dueling bandits, Unimodal bandits and Graphical bandits.


Multi-Agent Learning with Heterogeneous Linear Contextual Bandits

Neural Information Processing Systems

As trained intelligent systems become increasingly pervasive, multi-agent learning has emerged as a popular framework for studying complex interactions between autonomous agents. Yet, a formal understanding of how and when learners in heterogeneous environments benefit from sharing their respective experiences is still in its infancy. In this paper, we seek answers to these questions in the context of linear contextual bandits. We present a novel distributed learning algorithm based on the upper confidence bound (UCB) algorithm, which we refer to as H-LINUCB, wherein agents cooperatively minimize the group regret under the coordination of a central server. In the setting where the level of heterogeneity or dissimilarity across the environments is known to the agents, we show that H-LINUCB is provably optimal in regimes where the tasks are highly similar or highly dissimilar.



Context-lumpable stochastic bandits

Neural Information Processing Systems

We consider a contextual bandit problem with S contexts and K actions. In each round t = 1,2,... the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into r min{S,K}groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an ε-optimal policy after using at most eO(r(S+K)/ε2) samples with high probability and provide a matching Ω(r(S + K)/ε2) lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time T is bounded by eO( p r3(S+K)T). To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and eO( p poly(r)(S+K)T)minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios.


An Exploration-by-Optimization Approach to Best of Both Worlds in Linear Bandits

Neural Information Processing Systems

In this paper, we consider how to construct best-of-both-worlds linear bandit algorithms that achieve nearly optimal performance for both stochastic and adversarial environments. For this purpose, we show that a natural approach referred to as exploration by optimization [Lattimore and Szepesvári, 2020b] works well.


On the Complexity of Adversarial Decision Making

Neural Information Processing Systems

A central problem in online learning and decision making--from bandits to reinforcement learning--is to understand what modeling assumptions lead to sampleefficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show--via new upper and lower bounds--that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. [17] in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results--both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy [47] and the Exploration-by-Optimization objective of Lattimore and György [32].