algorithm and tighter regret bound
Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $O(DS\sqrt{AT})$ for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter $D$. In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.
Review for NeurIPS paper: Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting
Weaknesses: I have some concerns and questions: - In order to come up with an efficiently-implementable algorithm, for DORL the authors construct an optimistic MDP following a very simple construction. This construction only considers error bounds and completely ignores the value function. So, while the proof claims the optimism is guaranteed, I believe that the resulting optimistic MDP is overly-optimsitic, and to favor computational efficiency, this way one may sacrifice learning efficiency to a large extent. Indeed, the idea of optimizing over a finite set of MDPs (in lieu of the bounded-parameter MDP) is nice. However, I believe the current construction that completely ignores the value function is too naive to work in practice.
Review for NeurIPS paper: Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting
After discussing with the reviewers, we have decided to propose acceptance for the paper. Nonetheless, I would like the stress that the current submission has a number of critical aspects that need to be addressed by the authors to make the paper more solid. I strongly encourage the authors to read reviewers' comments and focus on improving along the following directions: - Clarity: Reviewers all agree that the paper could be improved in writing to make it more accessible to an audience that is not strictly familiar with the factored MDP formalism and/or the technicalities behind UCRL proofs. In particular, it is unclear how the parameter c has been chosen and why it takes significantly different values for different algorithms. It would be helpful to see the performance as c changes. - The way optimism is obtained is probably not very tight and it may cause over exploration for a long time.
Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting
We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound O(DS\sqrt{AT}) for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter D . In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span.