Stepwise Alignment for Constrained Language Model Policy Optimization

Mar-22-2026, 07:16:07 GMT–Neural Information Processing Systems

Safety and trustworthiness are indispensable requirements for real-world applications of AI systems using large language models (LLMs). This paper formulates human value alignment as an optimization problem of the language model policy to maximize reward under a safety constraint, and then proposes an algorithm, Stepwise Alignment for Constrained Policy Optimization (SACPO). One key idea behind SACPO, supported by theory, is that the optimal policy incorporating reward and safety can be directly obtained from a reward-aligned policy.

artificial intelligence, large language model, natural language, (6 more...)

Neural Information Processing Systems

Mar-22-2026, 07:16:07 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.64)