Goto

Collaborating Authors

 bayesucb




Finite-Time Logarithmic Bayes Regret Upper Bounds

Atsidakou, Alexia, Kveton, Branislav, Katariya, Sumeet, Caramanis, Constantine, Sanghavi, Sujay

arXiv.org Machine Learning

We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In Gaussian bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of random bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. To show the generality of our techniques, we apply them to linear bandits. Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the existing lower bounds.


Cost-Efficient Online Decision Making: A Combinatorial Multi-Armed Bandit Approach

Rahbar, Arman, Åkerblom, Niklas, Chehreghani, Morteza Haghir

arXiv.org Artificial Intelligence

Online decision making plays a crucial role in numerous real-world applications. In many scenarios, the decision is made based on performing a sequence of tests on the incoming data points. However, performing all tests can be expensive and is not always possible. In this paper, we provide a novel formulation of the online decision making problem based on combinatorial multi-armed bandits and take the cost of performing tests into account. Based on this formulation, we provide a new framework for cost-efficient online decision making which can utilize posterior sampling or BayesUCB for exploration. We provide a rigorous theoretical analysis for our framework and present various experimental results that demonstrate its applicability to real-world problems.


A Contextual Combinatorial Semi-Bandit Approach to Network Bottleneck Identification

Hoseini, Fazeleh, Åkerblom, Niklas, Chehreghani, Morteza Haghir

arXiv.org Artificial Intelligence

Bottleneck identification is an essential task in network analysis with numerous important applications, such as traffic planning and road network management. For example, in a road network, the road segment with the highest cost is described as a path-specific bottleneck on a path between a source node and a destination node. The cost or weight can be defined according to specific criteria, such as travel time, energy consumption, etc. The aim is to find a path which minimizes the bottleneck among all paths connecting the source and destination nodes. Bottleneck identification can thus be characterized, in a given road network graph, as finding a path with the smallest maximum edge weight among the paths connecting the source node and the destination node, i.e., finding the minimax edge. By negating the edge weights, bottleneck identification can also be viewed as the widest path problem or the maximum capacity path problem [20].


Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk

Chen, Tianrui, Gangrade, Aditya, Saligrama, Venkatesh

arXiv.org Machine Learning

We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints. Each arm is associated with an unknown law on safety risks and rewards, and the learner's goal is to maximise reward whilst not playing unsafe arms, as determined by a given threshold on the mean risk. We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation, regardless of the gain in reward due to the same. This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense. We describe doubly optimistic strategies for this scenario, which maintain optimistic indices for both safety risk and reward. We show that schema based on both frequentist and Bayesian indices satisfy tight gap-dependent logarithmic regret bounds, and further that these play unsafe arms only logarithmically many times in total. This theoretical analysis is complemented by simulation studies demonstrating the effectiveness of the proposed schema, and probing the domains in which their use is appropriate.