LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits
–Neural Information Processing Systems
Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g.
Neural Information Processing Systems
Jun-13-2026, 18:24:18 GMT
- Technology: