LASeR: Learning to Adaptively Select Reward Models with Multi-Arm Bandits

Neural Information Processing Systems 

Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g.