Goto

Collaborating Authors

 weighted linear bandit


Weighted Linear Bandits for Non-Stationary Environments

Neural Information Processing Systems

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments. We provide theoretical guarantees on the behavior of D-LinUCB in both slowly-varying and abruptly-changing environments. We obtain an upper bound on the dynamic regret that is of order d B T is a measure of non-stationarity (d and T being, respectively, dimension and horizon). This rate is known to be optimal. We also illustrate the empirical performance of D-LinUCB and compare it with recently proposed alternatives in simulated environments.



Reviews: Weighted Linear Bandits for Non-Stationary Environments

Neural Information Processing Systems

Update (after reading the rebuttals): After reading the rebuttal of authors, I have addressed my concerns on the novelty of the new self-normalized concentration, since the key point is that the coefficient of regularizer is changing. I indeed appreciate this work. The idea of this paper is natural but there indeed exist technical challenges, and the authors address these issues elegantly. So I think it deserves an acceptance. Nevertheless, there are still many typos in current verison besides those listed before, for example, in Theorem 2, eq.


Weighted Linear Bandits for Non-Stationary Environments

Neural Information Processing Systems

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments.


Weighted Linear Bandits for Non-Stationary Environments

Neural Information Processing Systems

We consider a stochastic linear bandit model in which the available actions correspond to arbitrary context vectors whose associated rewards follow a non-stationary linear regression model. In this setting, the unknown regression parameter is allowed to vary in time. To address this problem, we propose D-LinUCB, a novel optimistic algorithm based on discounted linear regression, where exponential weights are used to smoothly forget the past. This involves studying the deviations of the sequential weighted least-squares estimator under generic assumptions. As a by-product, we obtain novel deviation results that can be used beyond non-stationary environments.