β-DPO: Direct Preference Optimization with Dynamic β Junkang Wu
–Neural Information Processing Systems
Despite the effectiveness, RLHF's instability and computational requirements often limit its practical
Neural Information Processing Systems
Oct-10-2025, 20:20:21 GMT