Functional Bandits
Tran-Thanh, Long, Yu, Jia Yuan
The stochastic multi-armed bandit (MAB) model consists of a slot machine with K arms (or actions), each of which delivers rewards that are independently and randomly drawn from an unknown distribution when pulled. In the optimalarm identification problem, the aim is to find an arm with the highest expected reward value. To do so, we can pull the arms and learn (i.e., estimate) their mean rewards. That is, our goal is to distribute a finite budget of T pulls among the arms, such that at the end of the process, we can identify the optimal arm as accurately as possible. This stochastic optimisation problem models many practical applications, ranging from keyword bidding strategy optimisation in sponsored search[Amin et al., 2012], to identifying the best medicines in medical trials [Robbins, 1952], and efficient transmission channel detection in wireless communication networks [Avner, Mannor, and Shamir, 2012]. Although this MAB optimisation model is a well-studied in the online learning community, the focus is on finding the arm with the highest expected reward value [Maron and Moore, 1993, Mnih, Szepesvári, and Audibert, 2008, Audibert, Bubeck, and Munos, 2010b, Karnin, Koren, and Somekh, 2013].
May-10-2014
- Country:
- Europe
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Hampshire > Southampton (0.04)
- Ireland > Leinster
- Europe
- Genre:
- Research Report (0.50)
- Technology: