On Slowly-varying Non-stationary Bandits

Krishnamurthy, Ramakrishnan, Gopalan, Aditya

Oct-25-2021–arXiv.org Machine Learning

Reinforcement learning, and specifically bandit optimization, in dynamically changing environments has remained an active topic of study in machine learning. A variety of non-stationary bandit settings have been studied incorporating a range of structural assumptions. At one end are classical stochastic models such as restless bandits [Whittle, 1988], where the state of the arms governs the bandit problem at any instant, but the transitions between these problems (states) follow probabilistic dynamics. At the other extreme are settings with non-stochastic and arbitrarily changing rewards such as prediction with expert advice (and the EXP3 algorithm)[Cesa-Bianchi and Lugosi, 2006; Auer et al., 2002]. In between these extremes lie settings of changing environments where the adversary (environment) is assumed to be limited in its ability to change the rewards, i.e., a structural constraint is put on the amount of change in the rewards across time. These include the abrupt change (or switching experts) model [Garivier and Moulines, 2011], where at most k arbitrary changes to the reward distributions are allowed in the entire time horizon, and the variation-budgeted (drifting) change model [Besbes et al., 2014], in which the total magnitude of changes (of rewards) across successive time steps is constrained to be within an overall budget. In this paper, we focus on slowly-varying bandits - a different and arguably commonly encountered, yet less studied, model of non-stationary bandits. In this setting, the arms are allowed to change arbitrarily over time as long as the amount of change in their mean rewards between two successive time steps is bounded uniformly across the horizon. Many real-world settings naturally involve observables whose distributions are'smooth' over time, in the sense that their instantaneous rate of change is not too large, e.g., slowly drifting distributions in natural language tasks [Lu et al., 2020], data from physical transducers (position, velocity, power, temperature, chemical concentration), and slowly fading wireless

algorithm, nooze, passive phase, (15 more...)

arXiv.org Machine Learning

Oct-25-2021

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (1.00)
    - Representation & Reasoning > Optimization (0.46)
  - Data Science > Data Mining
    - Big Data (0.67)