Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Open in new window