Degenne, Rémy
The Batch Complexity of Bandit Pure Exploration
Tuynman, Adrienne, Degenne, Rémy
A Multi Armed Bandit (MAB) is a model of a sequential interaction that was introduced in (Thompson, 1933) to create better medical trials. This framework has since been expanded to various fields, and has seen applications to online advertising and recommendation systems. In a MAB, an algorithm chooses at each time an arm among a finite number (it pulls it) and then observes a sample from a probability distribution associated with the arm. The goal of the interaction will be to identify quickly which arm has the distribution with highest mean. By making use of past observed rewards to continuously update the way they sample, MAB algorithms reach their objective faster than traditional fixed randomized trials. For applications like online advertising, obtaining feedback can be quick, if for example the feedback is a click on an advertisement.
Best-Arm Identification in Unimodal Bandits
Poiani, Riccardo, Jourdan, Marc, Kaufmann, Emilie, Degenne, Rémy
We study the fixed-confidence best-arm identification problem in unimodal bandits, in which the means of the arms increase with the index of the arm up to their maximum, then decrease. We derive two lower bounds on the stopping time of any algorithm. The instance-dependent lower bound suggests that due to the unimodal structure, only three arms contribute to the leading confidence-dependent cost. However, a worst-case lower bound shows that a linear dependence on the number of arms is unavoidable in the confidence-independent cost. We propose modifications of Track-and-Stop and a Top Two algorithm that leverage the unimodal structure. Both versions of Track-and-Stop are asymptotically optimal for one-parameter exponential families. The Top Two algorithm is asymptotically near-optimal for Gaussian distributions and we prove a non-asymptotic guarantee matching the worse-case lower bound. The algorithms can be implemented efficiently and we demonstrate their competitive empirical performance.
Optimal Multi-Fidelity Best-Arm Identification
Poiani, Riccardo, Degenne, Rémy, Kaufmann, Emilie, Metelli, Alberto Maria, Restelli, Marcello
In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible. We study multi-fidelity best-arm identification, in which the algorithm can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost. Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm. Our first contribution is a tight, instance-dependent lower bound on the cost complexity. The study of the optimization problem featured in the lower bound provides new insights to devise computationally efficient algorithms, and leads us to propose a gradient-based approach with asymptotically optimal cost complexity. We demonstrate the benefits of the new algorithm compared to existing methods in experiments. Our theoretical and empirical findings also shed light on an intriguing concept of optimal fidelity for each arm.
Finding good policies in average-reward Markov Decision Processes without prior knowledge
Tuynman, Adrienne, Degenne, Rémy, Kaufmann, Emilie
We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $\Omega(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,\delta)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
Non-Asymptotic Analysis of a UCB-based Top Two Algorithm
Jourdan, Marc, Degenne, Rémy
A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
An $\varepsilon$-Best-Arm Identification Algorithm for Fixed-Confidence and Beyond
Jourdan, Marc, Degenne, Rémy, Kaufmann, Emilie
We propose EB-TC$\varepsilon$, a novel sampling rule for $\varepsilon$-best arm identification in stochastic bandits. It is the first instance of Top Two algorithm analyzed for approximate best arm identification. EB-TC$\varepsilon$ is an *anytime* sampling rule that can therefore be employed without modification for fixed confidence or fixed budget identification (without prior knowledge of the budget). We provide three types of theoretical guarantees for EB-TC$\varepsilon$. First, we prove bounds on its expected sample complexity in the fixed confidence setting, notably showing its asymptotic optimality in combination with an adaptive tuning of its exploration parameter. We complement these findings with upper bounds on its probability of error at any time and for any error parameter, which further yield upper bounds on its simple regret at any time. Finally, we show through numerical simulations that EB-TC$\varepsilon$ performs favorably compared to existing algorithms, in different settings.
On the Existence of a Complexity in Fixed Budget Bandit Identification
Degenne, Rémy
In fixed budget bandit identification, an algorithm sequentially observes samples from several distributions up to a given final time. It then answers a query about the set of distributions. A good algorithm will have a small probability of error. While that probability decreases exponentially with the final time, the best attainable rate is not known precisely for most identification tasks. We show that if a fixed budget task admits a complexity, defined as a lower bound on the probability of error which is attained by the same algorithm on all bandit problems, then that complexity is determined by the best non-adaptive sampling procedure for that problem. We show that there is no such complexity for several fixed budget identification tasks including Bernoulli best arm identification with two arms: there is no single algorithm that attains everywhere the best possible rate.
Dealing with Unknown Variances in Best-Arm Identification
Jourdan, Marc, Degenne, Rémy, Kaufmann, Emilie
The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification
Réda, Clémence, Tirinzoni, Andrea, Degenne, Rémy
We study the problem of the identification of m arms with largest means under a fixed error rate $\delta$ (fixed-confidence Top-m identification), for misspecified linear bandit models. This problem is motivated by practical applications, especially in medicine and recommendation systems, where linear models are popular due to their simplicity and the existence of efficient algorithms, but in which data inevitably deviates from linearity. In this work, we first derive a tractable lower bound on the sample complexity of any $\delta$-correct algorithm for the general Top-m identification problem. We show that knowing the scale of the deviation from linearity is necessary to exploit the structure of the problem. We then describe the first algorithm for this setting, which is both practical and adapts to the amount of misspecification. We derive an upper bound to its sample complexity which confirms this adaptivity and that matches the lower bound when $\delta$ $\rightarrow$ 0. Finally, we evaluate our algorithm on both synthetic and real-world data, showing competitive performance with respect to existing baselines.
Gamification of Pure Exploration for Linear Bandits
Degenne, Rémy, Ménard, Pierre, Shang, Xuedong, Valko, Michal
We investigate an active pure-exploration setting, that includes best-arm identification, in the context Since the early work of Robbins (1952), a great amount of of linear stochastic bandits. While asymptotically literature explores MAB in their standard stochastic setting optimal algorithms exist for standard multiarm with its numerous extensions and variants. Even-Dar et al. bandits, the existence of such algorithms for (2002) and Bubeck et al. (2009) are among the first to study the best-arm identification in linear bandits has the pure exploration setting for stochastic bandits. A nonexhaustive been elusive despite several attempts to address list of pure exploration game includes best-arm it. First, we provide a thorough comparison and identification (BAI), top-m identification (Kalyanakrishnan new insight over different notions of optimality in & Stone, 2010), threshold bandits (Locatelli et al., 2016), the linear case, including G-optimality, transductive minimum threshold (Kaufmann et al., 2018), signed bandits optimality from optimal experimental design (Ménard, 2019), pure exploration combinatorial bandits and asymptotic optimality.