Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Open in new window