Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models

Open in new window