Direct Preference-based Policy Optimization without Reward Modeling

Neural Information Processing Systems 

Instead, we propose a PbRL algorithm that directly learns from preference without requiring any reward modeling.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found