Bandit Convex Optimisation
–arXiv.org Artificial Intelligence
Bandit problems are most commonly thought of in terms of sequential decision making or as an elementary version of reinforcement learning. This is certainly true, but they are also intimately connected with optimisation. These notes focus on the convex bandit problem, where the set of actions is a convex subset of euclidean space and the function mapping actions to losses is convex. The learner chooses actions and observes the corresponding loss, possibly with additive noise. The duty of the learner is to minimise the loss. When phrased like this the problem seems more like zeroth-order optimisation and these notes largely take that viewpoint. We do borrow two key notions from the bandit community that give the algorithms and analysis a unique flavour: We focus on the cumulative regret as a performance criterion; and We (mostly) consider the adversarial setting, where the loss function is changing from one query to the next. These differences mean the algorithms and theorems presented here are often not directly comparable to standard settings in the optimisation literature. Generally speaking, the theorems presented here hold with fewer assumptions and consequentially are sometimes weaker.
arXiv.org Artificial Intelligence
Feb-9-2024
- Country:
- Asia > Middle East (0.14)
- Europe > France (0.14)
- North America > United States (0.14)
- Genre:
- Research Report (0.81)
- Summary/Review (0.92)