Goto

Collaborating Authors

 exp4




Nonstochastic Bandits with Infinitely Many Experts

arXiv.org Machine Learning

We study the problem of nonstochastic bandits with infinitely many experts: A learner aims to maximize the total reward by taking actions sequentially based on bandit feedback while benchmarking against a countably infinite set of experts. We propose a variant of Exp4.P that, for finitely many experts, enables inference of correct expert rankings while preserving the order of the regret upper bound. We then incorporate the variant into a meta-algorithm that works on infinitely many experts. We prove a high-probability upper bound of $\tilde{\mathcal{O}} \big( i^*K + \sqrt{KT} \big)$ on the regret, up to polylog factors, where $i^*$ is the unknown position of the best expert, $K$ is the number of actions, and $T$ is the time horizon. We also provide an example of structured experts and discuss how to expedite learning in such case. Our meta-learning algorithm achieves the tightest regret upper bound for the setting considered when $i^* = \tilde{\mathcal{O}} \big( \sqrt{T/K} \big)$. If a prior distribution is assumed to exist for $i^*$, the probability of satisfying a tight regret bound increases with $T$, the rate of which can be fast.


When Humans and Machines Make Joint Decisions: A Non-Symmetric Bandit Model

arXiv.org Machine Learning

As machine learning algorithms have become on par or even superior to humans in a number of decision making problems (Rajpurkar et al., 2017; Silver et al., 2018), the idea that humans might be assisted by computer programs across a large variety of tasks has gained momentum. Already today, automated decision making systems are being used to predict cardiac arrests, credit scores, and recidivism (Tonekaboni et al., 2018; Board of Governors, 2007; Angwin et al., 2016). An emerging literature asks how humans, who often remain the final decision makers, can and should interact with such systems (Tonekaboni et al., 2019; Lucic et al., 2020). Machine learners, in turn, want to understand how algorithms should be designed such that interaction with humans is as fruitful as possible (Carroll et al., 2019). In most real-world decision making problems, humans have access to information that is unobservable to any algorithm. In the medical domain, doctors can obtain important information from personal interaction with patients (Goldenberg and Engelhardt, 2019). In judicial bail, judges may base their decision on the behavior of the defendant in the courtroom (Lakkaraju et al., 2017). From a machine learning point of view, such unobserved variables arise not due to a failure of the algorithm's designer to collect them. Instead, it is a property of many real-world decision problems that formulating all relevant aspects as inputs to an algorithm is impossible. 1


Piecewise-Stationary Off-Policy Optimization

arXiv.org Artificial Intelligence

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.


Minimax Optimal Algorithms for Adversarial Bandit Problem with Multiple Plays

arXiv.org Artificial Intelligence

We investigate the adversarial bandit problem with multiple plays under semi-bandit feedback. We introduce a highly efficient algorithm that asymptotically achieves the performance of the best switching $m$-arm strategy with minimax optimal regret bounds. To construct our algorithm, we introduce a new expert advice algorithm for the multiple-play setting. By using our expert advice algorithm, we additionally improve the best-known high-probability bound for the multi-play setting by $O(\sqrt{m})$. Our results are guaranteed to hold in an individual sequence manner since we have no statistical assumption on the bandit arm gains. Through an extensive set of experiments involving synthetic and real data, we demonstrate significant performance gains achieved by the proposed algorithm with respect to the state-of-the-art algorithms.


Contextual Bandits with Continuous Actions: Smoothing, Zooming, and Adapting

arXiv.org Machine Learning

We study contextual bandit learning with an abstract policy class and continuous action space. We obtain two qualitatively different regret bounds: one competes with a smoothed version of the policy class under no continuity assumptions, while the other requires standard Lipschitz assumptions. Both bounds exhibit data-dependent "zooming" behavior and, with no tuning, yield improved guarantees for benign problems. We also study adapting to unknown smoothness parameters, establishing a price-of-adaptivity and deriving optimal adaptive algorithms that require no additional information.


Risk-Aware Algorithms for Adversarial Contextual Bandits

arXiv.org Machine Learning

In this work we consider adversarial contextual bandits with risk constraints. At each round, nature prepares a context, a cost for each arm, and additionally a risk for each arm. The learner leverages the context to pull an arm and then receives the corresponding cost and risk associated with the pulled arm. In addition to minimizing the cumulative cost, the learner also needs to satisfy long-term risk constraints -- the average of the cumulative risk from all pulled arms should not be larger than a pre-defined threshold. To address this problem, we first study the full information setting where in each round the learner receives an adversarial convex loss and a convex constraint. We develop a meta algorithm leveraging online mirror descent for the full information setting and extend it to contextual bandit with risk constraints setting using expert advice. Our algorithms can achieve near-optimal regret in terms of minimizing the total cost, while successfully maintaining a sublinear growth of cumulative risk constraint violation.