Neural Thompson Sampling
Zhang, Weitong, Zhou, Dongruo, Li, Lihong, Gu, Quanquan
The stochastic multi-armed bandit (Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2020) has been extensively studied, as an important model to optimize the tradeoff between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010), advertising (Graepel et al., 2010), robotic control (Mahler et al., 2016), and healthcare (Greenewald et al., 2017). In each round of a contextual bandit, the agent observes a feature vector (the "context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize regret (to be defined later), in a total of T rounds. To do so, the agent must find a tradeoff between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933). The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle and Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017). Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu and Van Roy, 2017; Riquelme
Oct-2-2020
- Country:
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Genre:
- Research Report (0.64)
- Technology: