Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems

Gouverneur, Amaury, Rodríguez-Gálvez, Borja, Oechtering, Tobias J., Skoglund, Mikael

arXiv.org Machine Learning 

Bandit problems are a class of decision problems in which an agent interacts sequentially with an unknown environment by choosing actions and earning rewards in return. The goal of the agent is to maximize its expected cumulative reward, which is the expected sum of rewards that it will earn throughout its interaction with the environment. This necessitates a delicate balance between the exploration of different actions to gather information for potential future rewards, and the exploitation of known actions to receive immediate gains. The theoretical study of the performance of an algorithm in a bandit problem is done by analyzing the expected regret, which is defined as the difference between the cumulative reward of the algorithm and the hypothetical cumulative reward that an oracle would obtain by choosing the optimal action at each time step. An effective method for achieving small regret is the Thomson Sampling (TS) algorithm [3], which, despite its simplicity, has shown remarkable performance [4, 5, 6]. Studying the Thomspon Sampling regret, [1] introduced the concept of information ratio, a statistic that captures the trade-off between the information gained by the algorithm about the environment and the immediate regret.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found