r-max
We will improve the broader impact section by emphasizing the implications of our theoretical
We sincerely thank all the reviewers, and feel really honored to receive such positive and constructive comments. We will mention total variation distance in the appendix, and correct the typo on "Corollary Note that the smooth planning oracle is not needed throughout the paper, and is thus not the "primary It is only used in Sec. We have discussed R-MAX in lines 82-83. By saying "especially model-free ones..." this sentence, we simply meant The works on Q-learning in games you mentioned exactly conquered this issue, with non-trivial efforts. We will address all the grammatical comments/typos in the final version.
Increasingly Cautious Optimism for Practical PAC-MDP Exploration
Zhang, Liangpeng (University of Science and Technology of China) | Tang, Ke (University of Science and Technology of China) | Yao, Xin (University of Birmingham)
Exploration strategy is an essential part of learning agents in model-based Reinforcement Learning. R-MAX and V-MAX are PAC-MDP strategies proved to have polynomial sample complexity; yet, their exploration behavior tend to be overly cautious in practice. We propose the principle of Increasingly Cautious Optimism (ICO) to automatically cut off unnecessarily cautious exploration, and apply ICO to R-MAX and V-MAX, yielding two new strategies, namely Increasingly Cautious R-MAX (ICR) and Increasingly Cautious V-MAX (ICV). We prove that both ICR and ICV are PACMDP, and show that their improvement is guaranteed by a tighter sample complexity upper bound. Then, we demonstrate their significantly improved performance through empirical results.
Learning to Coordinate Efficiently: A Model-based Approach
Brafman, R. I., Tennenholtz, M.
Pla y ers parti ipating in su h games m ust learn to o ordinate with ea h other in order to re eiv e the highest-p ossible v alue. A n um b er of reinfor emen t learning algorithms ha v e b een prop osed for this problem, and some ha v e b een sho wn to on v erge to go o d solutions in the limit. In this pap er w e sho w that using v ery simple mo del-based algorithms, m u h b etter (i.e., p olynomial) on v ergen e rates an b e attained. Moreo v er, our mo del-based algorithms are guaran teed to on v erge to the optimal v alue, unlik e man y of the existing algorithms. The distributed nature of su h systems mak es the problem of learning to a t in an unkno wn en vironmen t more diÆ ult b e ause the agen ts m ust o ordinate b oth their learning pro ess and their a tion hoi es. Ho w ev er, the need to o ordinate is not restri ted to distributed agen ts, as it arises naturally among self-in terested agen ts in ertain en vironmen ts. A go o d mo del for su h en vironmen ts is that of a ommon-inter est sto hasti game (CISG). A sto hasti game (Shapley, 1953) is a mo del of m ulti-agen t in tera tions onsisting of m ultiple nite or innite stages, in ea h of whi h the agen ts pla y a one-shot strategi form game. The iden tit y of ea h stage dep ends sto hasti ally on the previous stage and the a tions p erformed b y the agen ts in that stage. The goal of ea h agen t is to maximize some fun tion of its rew ard stream - either its a v erage rew ard or its sum of dis oun ted rew ards. A CISG is a sto hasti game in whi h at ea h p oin t the pa y o of all agen ts is iden ti al. V arious algorithms for learning in CISGs ha v e b een prop osed in the literature.