The Peril of Preference: Why GRPO fails on Ordinal Rewards

Open in new window