An Optimization-based Algorithm for Non-stationary Kernel Bandits without Prior Knowledge

Hong, Kihyuk, Li, Yuhang, Tewari, Ambuj

arXiv.org Artificial Intelligence 

The linear bandit (LB) problem [1] and the kernel bandit (KB) problem [2] are important paradigms for sequential decision making under uncertainty. They extend the multi-armed bandit (MAB) problem [3] by modeling the reward function with the side information of each arm provided as a feature vector. LB assumes the reward function is linear. KB extends LB to model non-linearity by assuming the reward function lies in the RKHS induced by a kernel. A recent line of work studies the non-stationary variants of LB and KB where the reward functions can vary over time subject to two main types of non-stationarity budgets: the number of changes and the total variation in the sequence of reward functions. A common algorithm design principle for adapting to non-stationarity is the principle of forgetting the past. It has been applied to the non-stationary MAB to design nearly minimax optimal algorithms [4, 5]. Similarly, the principle has been applied to the non-stationary LB [6-9] and the non-stationary KB [10, 11]. Recently, Zhao et al. [12] found an error in a key technical lemma by Cheung et al. [6] that affects the concentration bound of regression-based reward estimates under non-stationarity.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found