ucrl2
the final version, we will better emphasize their value as it seems their importance was not properly conveyed
We would like to begin by highlighting two contributions of the paper we feel remained unnoticed by R#2 and R#3. Due to its generality it is a powerful tool and is indeed central in all our analysis. RTDP is a well known and practical algorithm. We thank the reviewer for his/her favorable review. Abstract/Line 124/Line 263 - will be corrected, thanks!
Reviews: Regret Bounds for Learning State Representations in Reinforcement Learning
The authors present a regret analysis for learning state representation. They propose an algorithm called UCB-MS with O(\sqrt{T}) regret, which improves over the currently best result. The paper is well-organized and easy to follow. The authors also explain the possible methods and directions to further improve the bound. The paper could be more clear if lemma 3 was proved in appendix.
Reviews: Regret Bounds for Learning State Representations in Reinforcement Learning
This paper proposes a natural extension of UCRL2 to learning state representations. The proposed algorithm chooses optimistically over a finite set of candidate MDPs and their corresponding policies. The algorithm is analyzed and improves over existing regret bounds. The paper was discussed and all reviewers agree that this is a natural extension of UCRL2 that deserves to be published.
Reviews: Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
This is an excellent theoretical contribution. The analysis is quite heavy and has many subtleties. I do not have enough time to read the appended proofs; also, the subject of the paper is not in my area of research. The comments below are based on the impression I got after reading carefully the first 8 pages of the paper and glancing through the rest in the supplementary file. Summary: This paper is about reinforcement learning in weakly-communicating MDP under the average-reward criterion.