combinatorial pure exploration
Combinatorial Pure Exploration of Multi-Armed Bandits
We study the {\em combinatorial pure exploration (CPE)} problem in the stochastic multi-armed bandit setting, where a learner explores a set of arms with the objective of identifying the optimal member of a \emph{decision class}, which is a collection of subsets of arms with certain combinatorial structures such as size-$K$ subsets, matchings, spanning trees or paths, etc. The CPE problem represents a rich class of pure exploration tasks which covers not only many existing models but also novel cases where the object of interest has a non-trivial combinatorial structure. In this paper, we provide a series of results for the general CPE problem. We present general learning algorithms which work for all decision classes that admit offline maximization oracles in both fixed confidence and fixed budget settings. We prove problem-dependent upper bounds of our algorithms. Our analysis exploits the combinatorial structures of the decision classes and introduces a new analytic tool. We also establish a general problem-dependent lower bound for the CPE problem. Our results show that the proposed algorithms achieve the optimal sample complexity (within logarithmic factors) for many decision classes. In addition, applying our results back to the problems of top-$K$ arms identification and multiple bandit best arms identification, we recover the best available upper bounds up to constant factors and partially resolve a conjecture on the lower bounds.
An Optimal Algorithm for the Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Nakamura, Shintaro, Sugiyama, Masashi
We study the real-valued combinatorial pure exploration problem in the stochastic multi-armed bandit (R-CPE-MAB). We study the case where the size of the action set is polynomial with respect to the number of arms. In such a case, the R-CPE-MAB can be seen as a special case of the so-called transductive linear bandits. Existing methods in the R-CPE-MAB and transductive linear bandits have a gap of problem-dependent constant terms and logarithmic terms between the upper and lower bounds of the sample complexity, respectively. We close these gaps by proposing an algorithm named the combinatorial gap-based exploration (CombGapE) algorithm, whose sample complexity upper bound matches the lower bound. Finally, we numerically show that the CombGapE algorithm outperforms existing methods significantly.
Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit
Nakamura, Shintaro, Sugiyama, Masashi
We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given $d$ stochastic arms, and the reward of each arm $s\in\{1, \ldots, d\}$ follows an unknown distribution with mean $\mu_s$. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal \emph{action} $\boldsymbol{\pi}^{*} = \argmax_{\boldsymbol{\pi} \in \mathcal{A}} \boldsymbol{\mu}^{\top}\boldsymbol{\pi}$ from a finite-sized real-valued \emph{action set} $\mathcal{A}\subset \mathbb{R}^{d}$ with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set $\mathcal{A}$ is polynomial in $d$. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.
Combinatorial Pure Exploration with Full-bandit Feedback and Beyond: Solving Combinatorial Optimization under Uncertainty with Limited Observation
Kuroki, Yuko, Honda, Junya, Sugiyama, Masashi
Combinatorial optimization is one of the fundamental research fields that has been extensively studied in theoretical computer science and operations research. When developing an algorithm for combinatorial optimization, it is commonly assumed that parameters such as edge weights are exactly known as inputs. However, this assumption may not be fulfilled since input parameters are often uncertain or initially unknown in many applications such as recommender systems, crowdsourcing, communication networks, and online advertisement. To resolve such uncertainty, the problem of combinatorial pure exploration of multi-armed bandits (CPE) and its variants have recieved increasing attention. Earlier work on CPE has studied the semi-bandit feedback or assumed that the outcome from each individual edge is always accessible at all rounds. However, due to practical constraints such as a budget ceiling or privacy concern, such strong feedback is not always available in recent applications. In this article, we review recently proposed techniques for combinatorial pure exploration problems with limited feedback.
Combinatorial Pure Exploration of Dueling Bandit
Chen, Wei, Du, Yihan, Huang, Longbo, Zhao, Haoyu
In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm ${\sf CAR}$-${\sf Cond}$ with sample complexity analysis. ${\sf CAR}$-${\sf Cond}$ is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.
Combinatorial Pure Exploration of Multi-Armed Bandits
Chen, Shouyuan, Lin, Tian, King, Irwin, Lyu, Michael R., Chen, Wei
We study the {\em combinatorial pure exploration (CPE)} problem in the stochastic multi-armed bandit setting, where a learner explores a set of arms with the objective of identifying the optimal member of a \emph{decision class}, which is a collection of subsets of arms with certain combinatorial structures such as size-$K$ subsets, matchings, spanning trees or paths, etc. The CPE problem represents a rich class of pure exploration tasks which covers not only many existing models but also novel cases where the object of interest has a non-trivial combinatorial structure. In this paper, we provide a series of results for the general CPE problem. We present general learning algorithms which work for all decision classes that admit offline maximization oracles in both fixed confidence and fixed budget settings. We prove problem-dependent upper bounds of our algorithms.
Combinatorial Pure Exploration with Continuous and Separable Reward Functions and Its Applications (Extended Version)
Huang, Weiran, Ok, Jungseul, Li, Liang, Chen, Wei
We study the Combinatorial Pure Exploration problem with Continuous and Separable reward functions (CPE-CS) in the stochastic multi-armed bandit setting. In a CPE-CS instance, we are given several stochastic arms with unknown distributions, as well as a collection of possible decisions. Each decision has a reward according to the distributions of arms. The goal is to identify the decision with the maximum reward, using as few arm samples as possible. The problem generalizes the combinatorial pure exploration problem with linear rewards, which has attracted significant attention in recent years. In this paper, we propose an adaptive learning algorithm for the CPE-CS problem, and analyze its sample complexity. In particular, we introduce a new hardness measure called the consistent optimality hardness, and give both the upper and lower bounds of sample complexity. Moreover, we give examples to demonstrate that our solution has the capacity to deal with non-linear reward functions.
Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization
Cao, Tongyi, Krishnamurthy, Akshay
We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework. In this problem, we are given K distributions and a collection of subsets $\mathcal{V} \subset 2^K$ of these distributions, and we would like to find the subset $v \in \mathcal{V}$ that has largest cumulative mean, while collecting, in a sequential fashion, as few samples from the distributions as possible. We study both the fixed budget and fixed confidence settings, and our algorithms essentially achieve state-of-the-art performance in all settings, improving on previous guarantees for structures like matchings and submatrices that have large augmenting sets. Moreover, our algorithms can be implemented efficiently whenever the decision set V admits linear optimization. Our analysis involves precise concentration-of-measure arguments and a new algorithm for linear programming with exponentially many constraints.