Zhang, Weitong
Neural Thompson Sampling
Zhang, Weitong, Zhou, Dongruo, Li, Lihong, Gu, Quanquan
The stochastic multi-armed bandit (Bubeck and Cesa-Bianchi, 2012; Lattimore and Szepesvári, 2020) has been extensively studied, as an important model to optimize the tradeoff between exploration and exploitation in sequential decision making. Among its many variants, the contextual bandit is widely used in real-world applications such as recommendation (Li et al., 2010), advertising (Graepel et al., 2010), robotic control (Mahler et al., 2016), and healthcare (Greenewald et al., 2017). In each round of a contextual bandit, the agent observes a feature vector (the "context") for each of the K arms, pulls one of them, and in return receives a scalar reward. The goal is to maximize the cumulative reward, or minimize regret (to be defined later), in a total of T rounds. To do so, the agent must find a tradeoff between exploration and exploitation. One of the most effective and widely used techniques is Thompson Sampling, or TS (Thompson, 1933). The basic idea is to compute the posterior distribution of each arm being optimal for the present context, and sample an arm from this distribution. TS is often easy to implement, and has found great success in practice (Chapelle and Li, 2011; Graepel et al., 2010; Kawale et al., 2015; Russo et al., 2017). Recently, a series of work has applied TS or its variants to explore in contextual bandits with neural network models (Blundell et al., 2015; Kveton et al., 2020; Lu and Van Roy, 2017; Riquelme
A Finite Time Analysis of Two Time-Scale Actor Critic Methods
Wu, Yue, Zhang, Weitong, Xu, Pan, Gu, Quanquan
Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon$) of the non-concave performance function $J(\boldsymbol{\theta})$, with $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.