InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao

Neural Information Processing Systems 

With the advent of large language models (LLMs), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models' behaviors with human values [

Similar Docs  Excel Report  more

TitleSimilaritySource
None found