Best-of-Both-Worlds Algorithms for Linear Contextual Bandits

Kuroki, Yuko, Rumi, Alberto, Tsuchiya, Taira, Vitale, Fabio, Cesa-Bianchi, Nicolò

arXiv.org Machine Learning 

Because of their relevance in practical applications, contextual bandits are a fundamental model of sequential decision-making with partial feedback. In particular, linear contextual bandits [Abe and Long, 1999, Auer, 2002], in which contexts are feature vectors and the loss is a linear function of the context, are among the most studied variants of contextual bandits. Traditionally, contextual bandits (and, in particular, their linear variant) have been investigated under stochastic assumptions on the generation of rewards. Namely, the loss of each action is a fixed and unknown linear function of the context to which some zero-mean noise is added. For this setting, efficient and nearly optimal algorithms, like OFUL [Abbasi-Yadkori et al., 2011] and a contextual variant of Thompson Sampling [Agrawal and Goyal, 2013], have been proposed in the past. Recently, Neu and Olkhovskaya [2020] introduced an adversarial variant of linear contextual bandits, where there are K arms and the linear loss associated with each arm is adversarially chosen in each round. They prove an upper bound on the regret of order dKT disregarding logarithmic factors, where d is the dimensionality of contexts and T is the time horizon. A matching lower bound Ω ( dKT) for this model is implied by the results of Zierahn et al. [2023]. The upper bound has been recently extended by Olkhovskaya et al. [2023], who show first and second-order regret bounds respectively of the order of K dL

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found