Stabilizing RLHF through Advantage Model and Selective Rehearsal

Open in new window