Simplify RLHF as Reward-Weighted SFT: A Variational Method

Open in new window