Goto

Collaborating Authors

 loglogt




1n logE h

Neural Information Processing Systems

Lemma 2 (Chernoff bound for irreducible Markov chains). The proof is based on the argument given in Appendix A.2 of [7], adapted though for the case of Markov chains. We start the analysis by establishing the relation between the expected regret, Equation 1, and its proxy,Equation17. For the first part, we show in Appendix C that the expected number of times that an arma {1,...,N}hasn'tbeenplayed,isoftheorderofO(loglogT). Assume that the one-parameter family of Markov chains on the finite state space S, together with the reward functionf: S R, satisfy conditions (18), (19), (20), (21), and (22).


Information Design with Unknown Prior

Lin, Tao, Li, Ce

arXiv.org Artificial Intelligence

Classical information design models (e.g., Bayesian persuasion and cheap talk) require players to have perfect knowledge of the prior distribution of the state of the world. Our paper studies repeated persuasion problems in which the information designer does not know the prior. The information designer learns to design signaling schemes from repeated interactions with the receiver. We design learning algorithms for the information designer to achieve no regret compared to using the optimal signaling scheme with known prior, under two models of the receiver's decision-making. (1) The first model assumes that the receiver knows the prior and can perform posterior update and best respond to signals. In this model, we design a learning algorithm for the information designer with $O(\log T)$ regret in the general case, and another algorithm with $\Theta(\log \log T)$ regret in the case where the receiver has only two actions. (2) The second model assumes that the receiver does not know the prior and employs a no-regret learning algorithm to take actions. We show that the information designer can achieve regret $O(\sqrt{\mathrm{rReg}(T) T})$, where $\mathrm{rReg}(T)=o(T)$ is an upper bound on the receiver's learning regret. Our work thus provides a learning foundation for the problem of information design with unknown prior.


Infrequent Resolving Algorithm for Online Linear Programming

Li, Guokai, Wang, Zizhuo, Zhang, Jingwei

arXiv.org Artificial Intelligence

Online linear programming (OLP) has gained significant attention from both researchers and practitioners due to its extensive applications, such as online auction, network revenue management and advertising. Existing OLP algorithms fall into two categories: LP-based algorithms and LP-free algorithms. The former one typically guarantees better performance, even offering a constant regret, but requires solving a large number of LPs, which could be computationally expensive. In contrast, LP-free algorithm only requires first-order computations but induces a worse performance, lacking a constant regret bound. In this work, we bridge the gap between these two extremes by proposing an algorithm that achieves a constant regret while solving LPs only $O(\log\log T)$ times over the time horizon $T$. Moreover, when we are allowed to solve LPs only $M$ times, we propose an algorithm that can guarantee an $O\left(T^{(1/2+\epsilon)^{M-1}}\right)$ regret. Furthermore, when the arrival probabilities are known at the beginning, our algorithm can guarantee a constant regret by solving LPs $O(\log\log T)$ times, and an $O\left(T^{(1/2+\epsilon)^{M}}\right)$ regret by solving LPs only $M$ times. Numerical experiments are conducted to demonstrate the efficiency of the proposed algorithms.


Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

Hanna, Osama A., Yang, Lin F., Fragouli, Christina

arXiv.org Artificial Intelligence

In this paper, we address the stochastic contextual linear bandit problem, where a decision maker is provided a context (a random set of actions drawn from a distribution). The expected reward of each action is specified by the inner product of the action and an unknown parameter. The goal is to design an algorithm that learns to play as close as possible to the unknown optimal policy after a number of action plays. This problem is considered more challenging than the linear bandit problem, which can be viewed as a contextual bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we show that the stochastic contextual problem can be solved as if it is a linear bandit problem. In particular, we establish a novel reduction framework that converts every stochastic contextual linear bandit instance to a linear bandit instance, when the context distribution is known. When the context distribution is unknown, we establish an algorithm that reduces the stochastic contextual instance to a sequence of linear bandit instances with small misspecifications and achieves nearly the same worst-case regret bound as the algorithm that solves the misspecified linear bandit instances. As a consequence, our results imply a $O(d\sqrt{T\log T})$ high-probability regret bound for contextual linear bandits, making progress in resolving an open problem in (Li et al., 2019), (Li et al., 2021). Our reduction framework opens up a new way to approach stochastic contextual linear bandit problems, and enables improved regret bounds in a number of instances including the batch setting, contextual bandits with misspecifications, contextual bandits with sparse unknown parameters, and contextual bandits with adversarial corruption.


Online Learning for Equilibrium Pricing in Markets under Incomplete Information

Jalota, Devansh, Sun, Haoyuan, Azizan, Navid

arXiv.org Artificial Intelligence

The study of market equilibria is central to economic theory, particularly in efficiently allocating scarce resources. However, the computation of equilibrium prices at which the supply of goods matches their demand typically relies on having access to complete information on private attributes of agents, e.g., suppliers' cost functions, which are often unavailable in practice. Motivated by this practical consideration, we consider the problem of setting equilibrium prices in the incomplete information setting wherein a market operator seeks to satisfy the customer demand for a commodity by purchasing the required amount from competing suppliers with privately known cost functions unknown to the market operator. In this incomplete information setting, we consider the online learning problem of learning equilibrium prices over time while jointly optimizing three performance metrics -- unmet demand, cost regret, and payment regret -- pertinent in the context of equilibrium pricing over a horizon of $T$ periods. We first consider the setting when suppliers' cost functions are fixed and develop algorithms that achieve a regret of $O(\log \log T)$ when the customer demand is constant over time, or $O(\sqrt{T} \log \log T)$ when the demand is variable over time. Next, we consider the setting when the suppliers' cost functions can vary over time and illustrate that no online algorithm can achieve sublinear regret on all three metrics when the market operator has no information about how the cost functions change over time. Thus, we consider an augmented setting wherein the operator has access to hints/contexts that, without revealing the complete specification of the cost functions, reflect the variation in the cost functions over time and propose an algorithm with sublinear regret in this augmented setting.


A Sharp Memory-Regret Trade-Off for Multi-Pass Streaming Bandits

Agarwal, Arpit, Khanna, Sanjeev, Patil, Prathamesh

arXiv.org Machine Learning

The stochastic $K$-armed bandit problem has been studied extensively due to its applications in various domains ranging from online advertising to clinical trials. In practice however, the number of arms can be very large resulting in large memory requirements for simultaneously processing them. In this paper we consider a streaming setting where the arms are presented in a stream and the algorithm uses limited memory to process these arms. Here, the goal is not only to minimize regret, but also to do so in minimal memory. Previous algorithms for this problem operate in one of the two settings: they either use $\Omega(\log \log T)$ passes over the stream (Rathod, 2021; Chaudhuri and Kalyanakrishnan, 2020; Liau et al., 2018), or just a single pass (Maiti et al., 2021). In this paper we study the trade-off between memory and regret when $B$ passes over the stream are allowed, for any $B \geq 1$, and establish tight regret upper and lower bounds for any $B$-pass algorithm. Our results uncover a surprising *sharp transition phenomenon*: $O(1)$ memory is sufficient to achieve $\widetilde\Theta\Big(T^{\frac{1}{2} + \frac{1}{2^{B+2}-2}}\Big)$ regret in $B$ passes, and increasing the memory to any quantity that is $o(K)$ has almost no impact on further reducing this regret, unless we use $\Omega(K)$ memory. Our main technical contribution is our lower bound which requires the use of information-theoretic techniques as well as ideas from round elimination to show that the *residual problem* remains challenging over subsequent passes.


Linear Bandits with Limited Adaptivity and Learning Distributional Optimal Design

Ruan, Yufei, Yang, Jiaqi, Zhou, Yuan

arXiv.org Machine Learning

Motivated by practical needs such as large-scale learning, we study the impact of adaptivity constraints to linear contextual bandits, a central problem in online active learning. We consider two popular limited adaptivity models in literature: batch learning and rare policy switches. We show that, when the context vectors are adversarially chosen in $d$-dimensional linear contextual bandits, the learner needs $O(d \log d \log T)$ policy switches to achieve the minimax-optimal regret, and this is optimal up to $\mathrm{poly}(\log d, \log \log T)$ factors; for stochastic context vectors, even in the more restricted batch learning model, only $O(\log \log T)$ batches are needed to achieve the optimal regret. Together with the known results in literature, our results present a complete picture about the adaptivity constraints in linear contextual bandits. Along the way, we propose the distributional optimal design, a natural extension of the optimal experiment design, and provide a both statistically and computationally efficient learning algorithm for the problem, which may be of independent interest.


Multinomial Logit Bandit with Low Switching Cost

Dong, Kefan, Li, Yingkai, Zhang, Qin, Zhou, Yuan

arXiv.org Machine Learning

We study multinomial logit bandit with limited adaptivity, where the algorithms change their exploration actions as infrequently as possible when achieving almost optimal minimax regret. We propose two measures of adaptivity: the assortment switching cost and the more fine-grained item switching cost. We present an anytime algorithm (AT-DUCB) with $O(N \log T)$ assortment switches, almost matching the lower bound $\Omega(\frac{N \log T}{ \log \log T})$. In the fixed-horizon setting, our algorithm FH-DUCB incurs $O(N \log \log T)$ assortment switches, matching the asymptotic lower bound. We also present the ESUCB algorithm with item switching cost $O(N \log^2 T)$.