Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Jun-12-2026, 08:48:18 GMT–Neural Information Processing Systems

Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these two objectives, training separate preference models for helpfulness and safety, while framing safety as a constraint within a constrained Markov Decision Process (CMDP) framework. This paper identifies a potential issue when using the widely adopted expected safety constraints for LLM safety alignment, termed safety compensation'', where the constraints are satisfied on expectation, but individual prompts may trade off safety, resulting in some responses being overly restrictive while others remain unsafe.

large language model, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Jun-12-2026, 08:48:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.87)
  - Machine Learning (0.79)