InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Y uchun Miao
–Neural Information Processing Systems
With the advent of large language models (LLMs), reinforcement learning from human feedback (RLHF) has emerged as a pivotal technological paradigm to align models' behaviors with human values [
Neural Information Processing Systems
Feb-18-2026, 16:24:08 GMT
- Country:
- Asia
- China > Hubei Province
- Wuhan (0.04)
- Middle East > Jordan (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.14)
- China > Hubei Province
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States
- Virginia (0.04)
- Asia
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.93)
- Research Report
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology: