Minimax Policy for Heavy-tailed Multi-armed Bandits

Jul-20-2020–arXiv.org Machine Learning

In these contexts, exploration means learning In the worst-case setting, the lower and upper bounds the environment while exploitation means taking empirically are distribution-free. Assuming the rewards are bounded, computed best actions. When finite time performance is concerned, Audibert and Bubeck [6] establish a Ω( KT) lower bound i.e., scenarios in which one cannot learn indefinitely, on the minimax regret. They also studied a modified UCB ensuring a good balance of exploration and exploitation is algorithm called Minimax Optimal Strategy in the Stochastic the key to a good performance. MAB and its variations are case (MOSS) and proved that it achieves an order-optimal prototypical models for these problems, and they are widely worst-case regret while maintaining a logarithm distributiondependent used in many areas such as network routing, recommendation regret. Degenne and Perchet [7] extend MOSS to systems and resource allocation; see [1, Chapter 1]. an anytime version called MOSSanytime. The stochastic MAB problem was originally proposed by The rewards being bounded or sub-Gaussian is a common Robbins [2]. In this problem, at each time, an agent chooses assumption that gives sample mean an exponential an arm from a set of K arms and receives the associated convergence and simplifies the MAB problem.

algorithm, artificial intelligence, big data, (20 more...)

arXiv.org Machine Learning

Jul-20-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning (0.87)
    - Representation & Reasoning > Search (0.82)
  - Data Science > Data Mining
    - Big Data (0.64)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found