BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Open in new window