Big Data
Improved Bayes Regret Bounds for Multi-Task Hierarchical Bayesian Bandit Algorithms 1
Hierarchical Bayesian bandit refers to the multi-task bandit problem in which bandit tasks are assumed to be drawn from the same distribution. In this work, we provide improved Bayes regret bounds for hierarchical Bayesian bandit algorithms in the multi-task linear bandit and semi-bandit settings. For the multi-task linear bandit, we first analyze the preexisting hierarchical Thompson sampling (HierTS) algorithm, and improve its gap-independent Bayes regret bound from O(m n log n log (mn)) to O(m n log n) in the case of infinite action set, with m being the number of tasks and n the number of iterations per task. In the case of finite action set, we propose a novel hierarchical Bayesian bandit algorithm, named hierarchical BayesUCB (HierBayesUCB), that achieves the logarithmic but gap-dependent regret bound O(m log (mn) log n) under mild assumptions. All of the above regret bounds hold in many variants of hierarchical Bayesian linear bandit problem, including when the tasks are solved sequentially or concurrently. Furthermore, we extend the aforementioned HierTS and HierBayesUCB algorithms to the multi-task combinatorial semi-bandit setting. Concretely, our combinatorial HierTS algorithm attains comparable Bayes regret bound O(m n log n) with respect to the latest one. Moreover, our combinatorial HierBayesUCB yields a sharper Bayes regret bound O(m log (mn) log n). Experiments are conducted to validate the soundness of our theoretical results for multi-task bandit algorithms.
as the reviewers were happy with the motivation and practicality of our work. believe that our novelty is in proposing Thompson sampling latent bandit algorithms using offline-learned graphical
We would like to thank the reviewers for their insightful reviews. The primary weakness that several reviewers brought up was that the methods and analysis were straightforward. Reviewer #1 "algorithm... is not designed keeping short horizons in mind": Our algorithms quickly personalize by Reviewer #2 "suffers an exploration-exploitation tradeoff": You are correct in noting that our algorithm depends on "unified analyses cannot cover instance-dependent bounds": We derive Bayes regret bounds, which contain an expectation Reviewer #3 Thank you for your detailed corrections! We will update the paper with your clarifications. "the available epsilon-bounds are wildly pessimistic": You are correct in noting that our regret bounds require that the Updating our regret bounds to reflect this is a future line of work.
Stochastic contextual bandits with graph feedback: from independence number to MAS number Yuxiao Wen Yanjun Han Zhengyuan Zhou,* New York University
We consider contextual bandits with graph feedback, a class of interactive learning problems with richer structures than vanilla contextual bandits, where taking an action reveals the rewards for all neighboring actions in the feedback graph under all contexts. Unlike the multi-armed bandits setting where a growing literature has painted a near-complete understanding of graph feedback, much remains unexplored in the contextual bandits counterpart.
Adapting to Misspecification in Contextual Bandits
A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, yet typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexible, yet degrade gracefully in the face of model misspecification? We introduce a new family of oracle-efficient algorithms for ε-misspecified contextual bandits that adapt to unknown model misspecification--both for finite and infinite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and--in particular--optimal dependence on the misspecification level, with no prior knowledge. Specializing to linear contextual bandits with infinite actions in d dimensions, we obtain the first algorithm that achieves the optimal Õ(d T + ε dT) regret bound for unknown ε. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin [21], which we believe will be useful more broadly.
Adapting to Misspecification in Contextual Bandits
A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, yet typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexible, yet degrade gracefully in the face of model misspecification? We introduce a new family of oracle-efficient algorithms for ε-misspecified contextual bandits that adapt to unknown model misspecification--both for finite and infinite action settings. Given access to an online oracle for square loss regression, our algorithm attains optimal regret and--in particular--optimal dependence on the misspecification level, with no prior knowledge. Specializing to linear contextual bandits with infinite actions in d dimensions, we obtain the first algorithm that achieves the optimal Õ(d T + ε dT) regret bound for unknown ε. On a conceptual level, our results are enabled by a new optimization-based perspective on the regression oracle reduction framework of Foster and Rakhlin [20], which we believe will be useful more broadly.
Anonymous Bandits for Multi-User Systems
In this work, we present and study a new framework for online learning in systems with multiple users that provide user anonymity. Specifically, we extend the notion of bandits to obey the standard k-anonymity constraint by requiring each observation to be an aggregation of rewards for at least k users. This provides a simple yet effective framework where one can learn a clustering of users in an online fashion without observing any user's individual decision. We initiate the study of anonymous bandits and provide the first sublinear regret algorithms and lower bounds for this setting.
Homomorphic Matrix Completion
In recommendation systems, global positioning, system identification, and mobile social networks, it is a fundamental routine that a server completes a low-rank matrix from an observed subset of its entries. However, sending data to a cloud server raises up the data privacy concern due to eavesdropping attacks and the singlepoint failure problem, e.g., the Netflix prize contest was canceled after a privacy lawsuit. In this paper, we propose a homomorphic matrix completion algorithm for privacy-preserving purpose. First, we formulate a homomorphic matrix completion problem where a server performs matrix completion on cyphertexts, and propose an encryption scheme that is fast and easy to implement. Secondly, we prove that the proposed scheme satisfies the homomorphism property that decrypting the recovered matrix on cyphertexts will obtain the target matrix (on plaintexts). Thirdly, we prove that the proposed scheme satisfies an (,)-differential privacy property.
An Empirical Process Approach to the Union Bound: Practical Algorithms for Combinatorial and Linear Bandits
This paper proposes near-optimal algorithms for the pure-exploration linear bandit problem in the fixed confidence and fixed budget settings. Leveraging ideas from the theory of suprema of empirical processes, we provide an algorithm whose sample complexity scales with the geometry of the instance and avoids an explicit union bound over the number of arms. Unlike previous approaches which sample based on minimizing a worst-case variance (e.g. G-optimal design), we define an experimental design objective based on the Gaussian-width of the underlying arm set. We provide a novel lower bound in terms of this objective that highlights its fundamental role in the sample complexity. The sample complexity of our fixed confidence algorithm matches this lower bound, and in addition is computationally efficient for combinatorial classes, e.g.
problems to be compatible with multiple adversarial bandit algorithms which allows us to obtain previously unattainable
We would like to thank the reviewers for their time. To understand this result, consider k arms of dimension d < k. This is also the first result in literature that can combine multiple types of model selection. Reviewer 1: "The selection of the range": The regret is multiplied by at most a factor of the number of bases M, which In the paper we choose the largest ɛ to be 100, 000 but in practice such a large ɛ is unreasonable. Reviewer 2: Thank you for your comments.
Online Sign Identification: Minimization of the Number of Errors in Thresholding Bandits
In the fixed budget thresholding bandit problem, an algorithm sequentially allocates a budgeted number of samples to different distributions. It then predicts whether the mean of each distribution is larger or lower than a given threshold. We introduce a large family of algorithms (containing most existing relevant ones), inspired by the Frank-Wolfe algorithm, and provide a thorough yet generic analysis of their performance. This allowed us to construct new explicit algorithms, for a broad class of problems, whose losses are within a small constant factor of the non-adaptive oracle ones. Quite interestingly, we observed that adaptive methods empirically greatly out-perform non-adaptive oracles, an uncommon behavior in standard online learning settings, such as regret minimization. We explain this surprising phenomenon on an insightful toy problem.