Goto

Collaborating Authors

 regression problem







REBEL: Reinforcement Learning via Regressing Relative Rewards Zhaolin Gao 1, Jonathan D. Chang

Neural Information Processing Systems

While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g.




8efb100a295c0c690931222ff4467bb8-AuthorFeedback.pdf

Neural Information Processing Systems

Inthefinalresponseprediction,15 we treat all the neighbors equally, using the same weight. Their result also considered the runtime-precision tradeoff which we did not take into43 account.