UCB Algorithm for Exponential Distributions
Jouini, Wassim, Moy, Christophe
Several sequential decision making problems face a dilemma between the exploration of a space of choices, or solutions, and the exploitation of the information available to the decision maker. The problem described herein is known as sequential decision making under uncertainty. In this paper we focus on a subclass of this problem, where the decision maker has a discrete set of stateless choices and the added information is a real valued sequence (of feedbacks, or rewards) that quantifies how well the decision maker behaved in the previous time steps. This particular instance of sequential decision making problems is generally known as the multi-armed bandit (MAB) problem [1], [2]. A common approach to solving the exploration versus exploitation dilemma within MAB problems consists in assigning an utility value to every arm.
Apr-7-2012