Certifiable Safe RLHF: Fixed-Penalty Constraint Optimization for Safer Language Models

Open in new window