Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

Open in new window