ReDit: Reward Dithering for Improved LLM Policy Optimization

Open in new window