mean reward function
To update or not to update? Delayed Nonparametric Bandits with Randomized Allocation
Contextual bandits provide a natural framework to model a lot of practical sequential decision making problems in various fields. Woodroofe (1979) started studying multiarmed bandit problems with side information in a parametric framework, and Yang and Zhu (2002) initiated an investigation from a nonparametric perspective. See Lai (2001);Bartroff et al. (2008) for reviews on general sequential problems and Bubeck and Cesa-Bianchi (2012) for bandits exclusively. In recent years, bandit problems have gained popularity and have been studied extensively under different names, such as contextual bandits, multi-armed bandits with covariates (MABC), associative bandit problems and multi-armed bandits with side information. For example, when treating patients of a disease, the doctor needs to decide which treatment amongst several competing treatments would be the best for the current patient, given the patient's covariate information and data available from previous patients. Most of the bandit algorithms assume instantaneous observance of rewards, but in most practical situations, rewards are only obtained at some delayed time. For example, it is often the case that several other patients have to be treated before the outcome for the current patient is observed. One way to tackle this problem is to adopt black-box procedures incorporating delayed rewards using the already existing no-delay policies in the stochastic bandits setting.
Exploiting Correlation in Finite-Armed Structured Bandits
Gupta, Samarth, Joshi, Gauri, Yağan, Osman
We consider a correlated multi-armed bandit problem in which rewards of arms are correlated through a hidden parameter. Our approach exploits the correlation among arms to identify some arms as sub-optimal and pulls them only $\mathcal{O}(1)$ times. This results in significant reduction in cumulative regret, and in fact our algorithm achieves bounded (i.e., $\mathcal{O}(1)$) regret whenever possible; explicit conditions needed for bounded regret to be possible are also provided by analyzing regret lower bounds. We propose several variants of our approach that generalize classical bandit algorithms such as UCB, Thompson sampling, KL-UCB to the structured bandit setting, and empirically demonstrate their superiority via simulations.
Stochastic continuum armed bandit problem of few linear parameters in high dimensions
Tyagi, Hemant, Stich, Sebastian, Gärtner, Bernd
We consider a stochastic continuum armed bandit problem where the arms are indexed by the $\ell_2$ ball $B_{d}(1+\nu)$ of radius $1+\nu$ in $\mathbb{R}^d$. The reward functions $r :B_{d}(1+\nu) \rightarrow \mathbb{R}$ are considered to intrinsically depend on $k \ll d$ unknown linear parameters so that $r(\mathbf{x}) = g(\mathbf{A} \mathbf{x})$ where $\mathbf{A}$ is a full rank $k \times d$ matrix. Assuming the mean reward function to be smooth we make use of results from low-rank matrix recovery literature and derive an efficient randomized algorithm which achieves a regret bound of $O(C(k,d) n^{\frac{1+k}{2+k}} (\log n)^{\frac{1}{2+k}})$ with high probability. Here $C(k,d)$ is at most polynomial in $d$ and $k$ and $n$ is the number of rounds or the sampling budget which is assumed to be known beforehand.