The Realignment Problem: When Right becomes Wrong in LLMs

Sharma, Aakash Sen, Sanyal, Debdeep, Srivastava, Vivek, Karande, Shirish, Mandal, Murari

Nov-5-2025–arXiv.org Artificial Intelligence

The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a pro-grammatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment. The advent of Large Language Models (LLMs) aligned with human values through Reinforcement Learning from Human Feedback (RLHF) represents a landmark achievement in artificial intelligence. This process transforms raw predictive models into safe and helpful agents, forming the bedrock of their widespread deployment. Y et, this bedrock is built on a profoundly brittle assumption: that human values are static.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Nov-5-2025

arXiv.org PDF

Add feedback

Country:
- Asia > India (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Military (0.68)
- Health & Medicine
  - Consumer Health (1.00)
  - Pharmaceuticals & Biotechnology (0.67)
  - Therapeutic Area > Psychiatry/Psychology
    - Addiction Disorder (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found