LLMSafety Alignment is Divergence Estimation in Disguise

Jun-23-2026, 09:25:20 GMT–Neural Information Processing Systems

We present a theoretical framework showing that popular LLM alignment methods--including RLHF and its variants--can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Jun-23-2026, 09:25:20 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.67)

Industry:
- Health & Medicine > Consumer Health (0.93)
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.92)
  - Machine Learning
    - Neural Networks > Deep Learning (0.90)
    - Performance Analysis > Accuracy (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found