Reinforcement Learning
REBEL: Reinforcement Learning via Regressing Relative Rewards Zhaolin Gao 1, Jonathan D. Chang
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g.
Reviews: Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning
NeurIPS 2019 Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center "2626" "Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning" The reviewers unanimously support acceptance. We encourage the authors to strongly consider the suggestions provided by the reviewers for improving a camera ready version.
We thank all the reviewers for their encouraging comments
We thank all the reviewers for their encouraging comments. In both these cases, τ is effectively zero. Liu et al. shows how GTD-class algorithms can be formally derived using a primal-dual saddle point Sutton et al. presents a (single time-scale) variant of linear TD learning, which they call emphatic TD and show that They also provide an asymptotic convergence analysis to the set of local optima. If the paper is accepted, we will work further on improving the clarity of the work.