Goto

Collaborating Authors

 thompson







Thompson sampling: Precise arm-pull dynamics and adaptive inference

Han, Qiyang

arXiv.org Machine Learning

Adaptive sampling schemes are well known to create complex dependence that may invalidate conventional inference methods. A recent line of work shows that this need not be the case for UCB-type algorithms in multi-armed bandits. A central emerging theme is a `stability' property with asymptotically deterministic arm-pull counts in these algorithms, making inference as easy as in the i.i.d. setting. In this paper, we study the precise arm-pull dynamics in another canonical class of Thompson-sampling type algorithms. We show that the phenomenology is qualitatively different: the arm-pull count is asymptotically deterministic if and only if the arm is suboptimal or is the unique optimal arm; otherwise it converges in distribution to the unique invariant law of an SDE. This dichotomy uncovers a unifying principle behind many existing (in)stability results: an arm is stable if and only if its interaction with statistical noise is asymptotically negligible. As an application, we show that normalized arm means obey the same dichotomy, with Gaussian limits for stable arms and a semi-universal, non-Gaussian limit for unstable arms. This not only enables the construction of confidence intervals for the unknown mean rewards despite non-normality, but also reveals the potential of developing tractable inference procedures beyond the stable regime. The proofs rely on two new approaches. For suboptimal arms, we develop an `inverse process' approach that characterizes the inverse of the arm-pull count process via a Stieltjes integral. For optimal arms, we adopt a reparametrization of the arm-pull and noise processes that reduces the singularity in the natural SDE to proving the uniqueness of the invariant law of another SDE. We prove the latter by a set of analytic tools, including the parabolic Hörmander condition and the Stroock-Varadhan support theorem.


BanditLP: Large-Scale Stochastic Optimization for Personalized Recommendations

Nguyen, Phuc, Zelditch, Benjamin, Chen, Joyce, Patra, Rohit, Wei, Changshuai

arXiv.org Machine Learning

We present BanditLP, a scalable multi-stakeholder contextual bandit framework that unifies neural Thompson Sampling for learning objective-specific outcomes with a large-scale linear program for constrained action selection at serving time. The methodology is application-agnostic, compatible with arbitrary neural architectures, and deployable at web scale, with an LP solver capable of handling billions of variables. Experiments on public benchmarks and synthetic data show consistent gains over strong baselines. We apply this approach in LinkedIn's email marketing system and demonstrate business win, illustrating the value of integrated exploration and constrained optimization in production.


Thompson Sampling and Approximate Inference

Neural Information Processing Systems

We study the effects of approximate inference on the performance of Thompson sampling in the $k$-armed bandit problems. Thompson sampling is a successful algorithm for online decision-making but requires posterior inference, which often must be approximated in practice.


Finite-Time Regret of Thompson Sampling Algorithms for Exponential Family Multi-Armed Bandits

Neural Information Processing Systems

We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family, which covers many common reward distributions including Bernoulli, Gaussian, Gamma, Exponential, etc. We propose a Thompson sampling algorithm, termed ExpTS, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm. We provide a tight regret analysis for ExpTS, which simultaneously yields both the finite-time regret bound as well as the asymptotic regret bound. In particular, for a $K$-armed bandit with exponential family rewards, ExpTS over a horizon $T$ is sub-UCB (a strong criterion for the finite-time regret that is problem-dependent), minimax optimal up to a factor $\sqrt{\log K}$, and asymptotically optimal, for exponential family rewards. Moreover, we propose ExpTS$^+$, by adding a greedy exploitation step in addition to the sampling distribution used in ExpTS, to avoid the over-estimation of sub-optimal arms. ExpTS$^+$ is an anytime bandit algorithm and achieves the minimax optimality and asymptotic optimality simultaneously for exponential family reward distributions. Our proof techniques are general and conceptually simple and can be easily applied to analyze standard Thompson sampling with specific reward distributions.


Bayesian decision-making under misspecified priors with applications to meta-learning

Neural Information Processing Systems

Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $\tilde{O}(H^2 \epsilon)$ from TS with a well-specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case.Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.