Tekin, Cem
Lexicographic Multiarmed Bandit
Hüyük, Alihan, Tekin, Cem
We consider a multiobjective multiarmed bandit problem with lexicographically ordered objectives. In this problem, the goal of the learner is to select arms that are lexicographic optimal as much as possible without knowing the arm reward distributions beforehand. We capture this goal by defining a multidimensional form of regret that measures the loss of the learner due to not selecting lexicographic optimal arms, and then, consider two settings where the learner has prior information on the expected arm rewards. In the first setting, the learner only knows for each objective the lexicographic optimal expected reward. In the second setting, it only knows for each objective near-lexicographic optimal expected rewards. For both settings we prove that the learner achieves expected regret uniformly bounded in time. The algorithm we propose for the second setting also attains bounded regret for the multiarmed bandit with satisficing objectives. In addition, we also consider the harder prior-free case, and show that the learner can still achieve sublinear in time gap-free regret. Finally, we experimentally evaluate performance of the proposed algorithms in a variety of multiobjective learning problems.
Thompson Sampling for Combinatorial Network Optimization in Unknown Environments
Hüyük, Alihan, Tekin, Cem
Influence maximization, item recommendation, adaptive routing and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given that the stochasticity of the environment is perfectly known. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. All of these settings fit into the general combinatorial learning framework called combinatorial multi-armed bandit with probabilistically triggered arms. We consider a very powerful Bayesian algorithm, Combinatorial Thompson Sampling (CTS), and analyze its regret under the semi-bandit feedback model. Assuming that the learner does not know the expected base arm outcomes beforehand but has access to an exact oracle, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves $O(\sum_{i =1}^m \log T / (p_i \Delta_i))$ regret, where $m$ denotes the number of base arms, $p_i$ denotes the minimum non-zero triggering probability of base arm $i$, $\Delta_i$ denotes the minimum suboptimality gap of base arm $i$ and $T$ denotes the time horizon. In addition, we prove that when triggering probabilities are at least $p^*>0$ for all base arms, CTS achieves $O(1/p^*\log(1/p^*))$ regret independent of the time horizon. We also numerically compare CTS with algorithms that use the principle of optimism in the face of uncertainty in several combinatorial networking problems, and show that CTS outperforms these algorithms by at least an order of magnitude in the majority of the cases.
Exploiting Relevance for Online Decision-Making in High-Dimensions
Turgay, Eralp, Bulucu, Cem, Tekin, Cem
Many sequential decision-making tasks require choosing at each decision step the right action out of the vast set of possibilities by extracting actionable intelligence from high-dimensional data streams. Most of the times, the high-dimensionality of actions and data makes learning of the optimal actions by traditional learning methods impracticable. In this work, we investigate how to discover and leverage the low-dimensional structure in actions and data to enable fast learning. As our learning model, we consider a structured contextual multi-armed bandit (CMAB) with high-dimensional arm (action) and context (data) sets, where the rewards depend only on a few relevant dimensions of the joint context-arm set. We depart from the prior work by assuming a high-dimensional and uncountable arm set, and allow relevant context dimensions to vary for each arm. We propose a new online learning algorithm called CMAB with Relevance Learning (CMAB-RL) and prove that its time-averaged regret asymptotically goes to zero. CMAB-RL enjoys a substantially improved regret bound compared to classical CMAB algorithms whose regrets depend on the dimensions $d_x$ and $d_a$ of the context and arm sets. Importantly, we show that if the learner knows upper bounds $\overline{d}_x$ and $\overline{d}_a$ on the number of relevant context and arm dimensions, then CMAB-RL achieves $\tilde{O}(T^{1 - 1 /(2 + 2\overline{d}_x + \overline{d}_a)})$ regret. Finally, we illustrate how CMAB algorithms can be used for optimal personalized blood glucose control in type 1 diabetes mellitus patients, and show that CMAB-RL outperforms other contextual MAB algorithms in this task, where the contexts represent multimodal physiological data streams obtained from sensor readings and the arms represent bolus insulin doses that are appropriate for injection.
Long term impact of fair machine learning in sequential decision making: representation disparity and group retention
Zhang, Xueru, Khalili, Mohammad Mahdi, Tekin, Cem, Liu, Mingyan
Machine learning models trained on data from multiple demographic groups can inherit representation disparity (Hashimoto et al., 2018) that may exist in the data: the group contributing less to the training process may suffer higher loss in model accuracy; this in turn can degrade population retention in these groups over time in terms of their contribution to the training process of future models, which then exacerbates representation disparity in the long run. In this study, we seek to understand the interplay between the model accuracy and the underlying group representation and how they evolve in a sequential decision setting over an infinite horizon, and how the use of fair machine learning plays a role in this process. Using a simple user dynamics (arrival and departure) model, we characterize the long-term property of using machine learning models under a set of fairness criteria imposed on each stage of the decision process, including the commonly used statistical parity and equal opportunity fairness. We show that under this particular arrival/departure model, both these criteria cause the representation disparity to worsen over time, resulting in groups diminishing entirely from the sample pool, while the criterion of equalized loss fares much better. Our results serve to highlight the fact that fairness cannot be defined outside the larger feedback loop where past actions taken by users (who are either subject to the decisions made by the algorithm or whose data are used to train the algorithm or both) will determine future observations and decisions.
Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms
Hüyük, Alihan, Tekin, Cem
We analyze the regret of combinatorial Thompson sampling (CTS) for the combinatorial multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting. We assume that the learner has access to an exact optimization oracle but does not know the expected base arm outcomes beforehand. When the expected reward function is Lipschitz continuous in the expected base arm outcomes, we derive $O(\sum_{i =1}^m \log T / (p_i \Delta_i))$ regret bound for CTS, where $m$ denotes the number of base arms, $p_i$ denotes the minimum non-zero triggering probability of base arm $i$ and $\Delta_i$ denotes the minimum suboptimality gap of base arm $i$. We also show that CTS outperforms combinatorial upper confidence bound (CUCB) via numerical experiments.
Multi-objective Contextual Bandit Problem with Similarity Information
Turğay, Eralp, Öner, Doruk, Tekin, Cem
In this paper we propose the multi-objective contextual bandit problem with similarity information. This problem extends the classical contextual bandit problem with similarity information by introducing multiple and possibly conflicting objectives. Since the best arm in each objective can be different given the context, learning the best arm based on a single objective can jeopardize the rewards obtained from the other objectives. In order to evaluate the performance of the learner in this setup, we use a performance metric called the contextual Pareto regret. Essentially, the contextual Pareto regret is the sum of the distances of the arms chosen by the learner to the context dependent Pareto front. For this problem, we develop a new online learning algorithm called Pareto Contextual Zooming (PCZ), which exploits the idea of contextual zooming to learn the arms that are close to the Pareto front for each observed context by adaptively partitioning the joint context-arm set according to the observed rewards and locations of the context-arm pairs selected in the past. Then, we prove that PCZ achieves $\tilde O (T^{(1+d_p)/(2+d_p)})$ Pareto regret where $d_p$ is the Pareto zooming dimension that depends on the size of the set of near-optimal context-arm pairs. Moreover, we show that this regret bound is nearly optimal by providing an almost matching $\Omega (T^{(1+d_p)/(2+d_p)})$ lower bound.
Episodic Multi-armed Bandits
Tekin, Cem, van der Schaar, Mihaela
We introduce a new class of reinforcement learning methods referred to as {\em episodic multi-armed bandits} (eMAB). In eMAB the learner proceeds in {\em episodes}, each composed of several {\em steps}, in which it chooses an action and observes a feedback signal. Moreover, in each step, it can take a special action, called the $stop$ action, that ends the current episode. After the $stop$ action is taken, the learner collects a terminal reward, and observes the costs and terminal rewards associated with each step of the episode. The goal of the learner is to maximize its cumulative gain (i.e., the terminal reward minus costs) over all episodes by learning to choose the best sequence of actions based on the feedback. First, we define an {\em oracle} benchmark, which sequentially selects the actions that maximize the expected immediate gain. Then, we propose our online learning algorithm, named {\em FeedBack Adaptive Learning} (FeedBAL), and prove that its regret with respect to the benchmark is bounded with high probability and increases logarithmically in expectation. Moreover, the regret only has polynomial dependence on the number of steps, actions and states. eMAB can be used to model applications that involve humans in the loop, ranging from personalized medical screening to personalized web-based education, where sequences of actions are taken in each episode, and optimal behavior requires adapting the chosen actions based on the feedback.
Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms: A Case with Bounded Regret
Sarıtaç, A. Ömer, Tekin, Cem
In this paper, we study the combinatorial multi-armed bandit problem (CMAB) with probabilistically triggered arms (PTAs). Under the assumption that the arm triggering probabilities (ATPs) are positive for all arms, we prove that a class of upper confidence bound (UCB) policies, named Combinatorial UCB with exploration rate $\kappa$ (CUCB-$\kappa$), and Combinatorial Thompson Sampling (CTS), which estimates the expected states of the arms via Thompson sampling, achieve bounded regret. In addition, we prove that CUCB-$0$ and CTS incur $O(\sqrt{T})$ gap-independent regret. These results improve the results in previous works, which show $O(\log T)$ gap-dependent and $O(\sqrt{T\log T})$ gap-independent regrets, respectively, under no assumptions on the ATPs. Then, we numerically evaluate the performance of CUCB-$\kappa$ and CTS in a real-world movie recommendation problem, where the actions correspond to recommending a set of movies, the arms correspond to the edges between the movies and the users, and the goal is to maximize the total number of users that are attracted by at least one movie. Our numerical results complement our theoretical findings on bounded regret. Apart from this problem, our results also directly apply to the online influence maximization (OIM) problem studied in numerous prior works.
Adaptive Ensemble Learning with Confidence Bounds
Tekin, Cem, Yoon, Jinsung, van der Schaar, Mihaela
Extracting actionable intelligence from distributed, heterogeneous, correlated and high-dimensional data sources requires run-time processing and learning both locally and globally. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner, which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do, these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, our approach yields performance guarantees with respect to the optimal local prediction strategy, and is also able to adapt its predictions in a data-driven manner. We illustrate the performance of Hedged Bandits in the context of medical informatics and show that it outperforms numerous online and offline ensemble learning methods.
Adaptive Ensemble Learning with Confidence Bounds for Personalized Diagnosis
Tekin, Cem (Bilkent University) | Yoon, Jinsung (University of California, Los Angeles) | Schaar, Mihaela van der (University of California, Los Angeles)
With the advances in the field of medical informatics, automated clinical decision support systems are becoming the de facto standard in personalized diagnosis. In order to establish high accuracy and confidence in personalized diagnosis, massive amounts of distributed, heterogeneous, correlated and high-dimensional patient data from different sources such as wearable sensors, mobile applications, Electronic Health Record (EHR) databases etc. need to be processed. This requires learning both locally and globally due to privacy constraints and/or distributed nature of the multi-modal medical data. In the last decade, a large number of meta-learning techniques have been proposed in which local learners make online predictions based on their locally-collected data instances, and feed these predictions to an ensemble learner,which fuses them and issues a global prediction. However, most of these works do not provide performance guarantees or, when they do,these guarantees are asymptotic. None of these existing works provide confidence estimates about the issued predictions or rate of learning guarantees for the ensemble learner. In this paper, we provide a systematic ensemble learning method called Hedged Bandits, which comes with both long run (asymptotic) and short run (rate of learning) performance guarantees. Moreover, we show that our proposed method outperforms all existing ensemble learning techniques, even in the presence of concept drift.