Large Language Model
REBEL: Reinforcement Learning via Regressing Relative Rewards Zhaolin Gao 1, Jonathan D. Chang
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications, including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g.