Direct Preference-based Policy Optimization without Reward Modeling
–Neural Information Processing Systems
Instead, we propose a PbRL algorithm that directly learns from preference without requiring any reward modeling.
Neural Information Processing Systems
Nov-19-2025, 22:18:48 GMT
- Country:
- Asia > South Korea > Seoul > Seoul (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: