Policy Optimization in RLHF: The Impact of Out-of-preference Data

Open in new window