loglog
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Appendices
Let N(µ,σ2) denote a Gaussian distribution with meanµ and variance σ2. Let χ2(n) denote a χ2 distribution withn degrees of freedom. Our analysis extensively uses the following facts about Gaussian and χ2 distributions: Definition A.1 (Gaussian and Wigner Random Matrices). We let G N(n) denote an n n randomGaussianmatrixwith i.i.d. We let W W(n)=G+GT denotean n n Wigner matrix, where G N(n). Fact A.1 (χ2 TailBound(Lemma 1of[1])).
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Texas > Brazos County > College Station (0.04)
- (3 more...)
5d69dc892ba6e79fda0c6a1e286f24c5-Supplemental.pdf
Consider any predictor cM( |i) (as a function of the sample pathX) for theith row ofM, i = 1,2,3. In Section 6.2.2, we make the steps in(29) precise and bound the Bayes risk from below by an appropriate mutual information. In Section 6.2.3, we choose a prior distribution on the transition probabilities and prove a lower bound on the resulting mutual information, thereby completing the proof ofTheorem 1,with the added bonus that the construction isrestricted toirreducible and reversiblechains. Let (X1,...,Xn) be the trajectory of a stationary Markov chain with transition matrixM. We first relate the Bayes estimator ofM and T (given the X and Y chain respectively).
Optimal Regret Bounds for Collaborative Learning in Bandits
Shidani, Amitis, Vakili, Sattar
We consider regret minimization in a general collaborative multi-agent multi-armed bandit model, in which each agent faces a finite set of arms and may communicate with other agents through a central controller. The optimal arm for each agent in this model is the arm with the largest expected mixed reward, where the mixed reward of each arm is a weighted average of its rewards across all agents, making communication among agents crucial. While near-optimal sample complexities for best arm identification are known under this collaborative model, the question of optimal regret remains open. In this work, we address this problem and propose the first algorithm with order optimal regret bounds under this collaborative bandit model. Furthermore, we show that only a small constant number of expected communication rounds is needed.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost
Qiao, Dan, Yin, Ming, Min, Ming, Wang, Yu-Xiang
We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of $\Omega(HSA)$ is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a \emph{reward-free} exploration algorithm with an optimal switching cost of $O(HSA)$.
(Almost) Free Incentivized Exploration from Decentralized Learning Agents
Shi, Chengshuai, Xu, Haifeng, Xiong, Wei, Shen, Cong
Incentivized exploration in multi-armed bandits (MAB) has witnessed increasing interests and many progresses in recent years, where a principal offers bonuses to agents to do explorations on her behalf. However, almost all existing studies are confined to temporary myopic agents. In this work, we break this barrier and study incentivized exploration with multiple and long-term strategic agents, who have more complicated behaviors that often appear in real-world applications. An important observation of this work is that strategic agents' intrinsic needs of learning benefit (instead of harming) the principal's explorations by providing "free pulls". Moreover, it turns out that increasing the population of agents significantly lowers the principal's burden of incentivizing. The key and somewhat surprising insight revealed from our results is that when there are sufficiently many learning agents involved, the exploration process of the principal can be (almost) free. Our main results are built upon three novel components which may be of independent interest: (1) a simple yet provably effective incentive-provision strategy; (2) a carefully crafted best arm identification algorithm for rewards aggregated under unequal confidences; (3) a high-probability finite-time lower bound of UCB algorithms. Experimental results are provided to complement the theoretical analysis.
- North America > United States > Virginia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > China > Hong Kong (0.04)
Asymptotically Optimal Information-Directed Sampling
Kirschner, Johannes, Lattimore, Tor, Vernade, Claire, Szepesvári, Csaba
We introduce a computationally efficient algorithm for finite stochastic linear bandits. The approach is based on the frequentist information-directed sampling (IDS) framework, with an information gain potential that is derived directly from the asymptotic regret lower bound. We establish frequentist regret bounds, which show that the proposed algorithm is both asymptotically optimal and worst-case rate optimal in finite time. Our analysis sheds light on how IDS trades off regret and information to incrementally solve the semi-infinite concave program that defines the optimal asymptotic regret. Along the way, we uncover interesting connections towards a recently proposed two-player game approach and the Bayesian IDS algorithm.
- North America > Canada > Alberta (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)