Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Bach, Thong, Nguyen, Dung, Le, Thao Minh, Tran, Truyen

Nov-18-2025–arXiv.org Artificial Intelligence

Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Nov-18-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law Enforcement & Public Safety (0.93)
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology > Addiction Disorder (0.68)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (0.70)
    - Machine Learning > Neural Networks
      - Deep Learning (0.70)