Direct Advantage Regression: Aligning LLMs with Online AI Reward

Open in new window