RRHF: Rank Responses to Align Language Models with Human Feedback

Neural Information Processing Systems 

InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO).

Similar Docs  Excel Report  more

TitleSimilaritySource
None found