Goto

Collaborating Authors

 loglog




Appendices

Neural Information Processing Systems

Let N(µ,σ2) denote a Gaussian distribution with meanµ and variance σ2. Let χ2(n) denote a χ2 distribution withn degrees of freedom. Our analysis extensively uses the following facts about Gaussian and χ2 distributions: Definition A.1 (Gaussian and Wigner Random Matrices). We let G N(n) denote an n n randomGaussianmatrixwith i.i.d. We let W W(n)=G+GT denotean n n Wigner matrix, where G N(n). Fact A.1 (χ2 TailBound(Lemma 1of[1])).



5d69dc892ba6e79fda0c6a1e286f24c5-Supplemental.pdf

Neural Information Processing Systems

Consider any predictor cM( |i) (as a function of the sample pathX) for theith row ofM, i = 1,2,3. In Section 6.2.2, we make the steps in(29) precise and bound the Bayes risk from below by an appropriate mutual information. In Section 6.2.3, we choose a prior distribution on the transition probabilities and prove a lower bound on the resulting mutual information, thereby completing the proof ofTheorem 1,with the added bonus that the construction isrestricted toirreducible and reversiblechains. Let (X1,...,Xn) be the trajectory of a stationary Markov chain with transition matrixM. We first relate the Bayes estimator ofM and T (given the X and Y chain respectively).


4a5876b450b45371f6cfe5047ac8cd45-Supplemental.pdf

Neural Information Processing Systems

In the following equation, we use the results inAppendix D.1 tocalculate the probability that there exists some arm whose mean value isaboveitsconfidence intervalofwidth


Optimal Regret Bounds for Collaborative Learning in Bandits

Shidani, Amitis, Vakili, Sattar

arXiv.org Machine Learning

We consider regret minimization in a general collaborative multi-agent multi-armed bandit model, in which each agent faces a finite set of arms and may communicate with other agents through a central controller. The optimal arm for each agent in this model is the arm with the largest expected mixed reward, where the mixed reward of each arm is a weighted average of its rewards across all agents, making communication among agents crucial. While near-optimal sample complexities for best arm identification are known under this collaborative model, the question of optimal regret remains open. In this work, we address this problem and propose the first algorithm with order optimal regret bounds under this collaborative bandit model. Furthermore, we show that only a small constant number of expected communication rounds is needed.


Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Qiao, Dan, Yin, Ming, Min, Ming, Wang, Yu-Xiang

arXiv.org Machine Learning

We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.


(Almost) Free Incentivized Exploration from Decentralized Learning Agents

Shi, Chengshuai, Xu, Haifeng, Xiong, Wei, Shen, Cong

arXiv.org Machine Learning

Incentivized exploration in multi-armed bandits (MAB) has witnessed increasing interests and many progresses in recent years, where a principal offers bonuses to agents to do explorations on her behalf. However, almost all existing studies are confined to temporary myopic agents. In this work, we break this barrier and study incentivized exploration with multiple and long-term strategic agents, who have more complicated behaviors that often appear in real-world applications. An important observation of this work is that strategic agents' intrinsic needs of learning benefit (instead of harming) the principal's explorations by providing "free pulls". Moreover, it turns out that increasing the population of agents significantly lowers the principal's burden of incentivizing. The key and somewhat surprising insight revealed from our results is that when there are sufficiently many learning agents involved, the exploration process of the principal can be (almost) free. Our main results are built upon three novel components which may be of independent interest: (1) a simple yet provably effective incentive-provision strategy; (2) a carefully crafted best arm identification algorithm for rewards aggregated under unequal confidences; (3) a high-probability finite-time lower bound of UCB algorithms. Experimental results are provided to complement the theoretical analysis.


Asymptotically Optimal Information-Directed Sampling

Kirschner, Johannes, Lattimore, Tor, Vernade, Claire, Szepesvári, Csaba

arXiv.org Machine Learning

We introduce a computationally efficient algorithm for finite stochastic linear bandits. The approach is based on the frequentist information-directed sampling (IDS) framework, with an information gain potential that is derived directly from the asymptotic regret lower bound. We establish frequentist regret bounds, which show that the proposed algorithm is both asymptotically optimal and worst-case rate optimal in finite time. Our analysis sheds light on how IDS trades off regret and information to incrementally solve the semi-infinite concave program that defines the optimal asymptotic regret. Along the way, we uncover interesting connections towards a recently proposed two-player game approach and the Bayesian IDS algorithm.