Correlated Bandits for Dynamic Pricing via the ARC algorithm
Cohen, Samuel, Treetanthiploet, Tanut
In the multi-armed bandit problem, a decision maker needs to sequentially decide between acting to reveal data about a system and acting to generate profit. The central idea of the multi-armed bandit is that the agent has K'options' or equivalently, a bandit with K arms, and must choose which arm to play at each time. Playing an arm results in a reward generated from a fixed but unknown distribution which must be inferred'on-the-fly'. In the classic multi-armed bandit problem, the reward of each arm is assumed to be independent of the others (Gittins and Jones [9], Agrawal [1], Lattimore and Szepesvári [13]) and it is the only observation obtained by the decision maker at each step. In practice, we often observe signals in addition to the rewards and there is often correlation between the distributions of outcomes for different choices. For example, in a dynamic pricing problem (Dubé and Misra [6], Misra et al. [14]), an agent wants to fix the price of a single product, from a finite set of prices {c
Feb-8-2021