Energy-Driven Steering: Reducing False Refusals in Large Language Models

Jiang, Eric Hanchen, Ou, Weixuan, Liu, Run, Pang, Shengyuan, Wan, Guancheng, Duan, Ranjie, Dong, Wei, Chang, Kai-Wei, Wang, XiaoFeng, Wu, Ying Nian, Li, Xinfeng

Oct-13-2025–arXiv.org Machine Learning

Safety alignment of large language models (LLMs) faces a key challenge: current alignment techniques often only focus on improving safety against harmful prompts, causing LLMs to become over-cautious and refuse to respond to benign prompts. Therefore, a key objective of safe alignment is to enhance safety while simultaneously reducing false refusals. In this paper, we introduce Energy-Driven Steering (EDS), a novel, fine-tuning free framework designed to resolve this challenge through dynamic, inference-time intervention. We trained a lightweight, external Energy-Based Model (EBM) to assign high energy to undesirable (false refusal or jailbreak) states and low energy to desirable (helpful response or safe reject) ones. We use the gradient of the energy function to dynamically steer the LLM's hidden states to low energy regions, correcting the model to generate a desirable response in real-time without modifying its weights. Extensive experiments across a wide range of models show our method successfully achieves this objective: it substantially lowers false refusal rates. Our work presents an effective paradigm for building LLMs that achieve both low false refusal rates and high safety. Note: This paper contains examples with potentially disturbing content. The alignment of large language models (LLMs) with human safety remains a central challenge in artificial intelligence research (Bianchi et al., 2023; Anwar et al., 2024; Xu et al., 2020; Röttger et al., 2020; Sun et al., 2021; Vidgen et al., 2023). Common approaches such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), system prompt engineering, and vector ablation have proven effective. However, these methods often introduce an unintended tradeoff: they can lead either to excessive refusal (over-rejection) or to lapses in safety. This behavior is not merely an inconvenience; it severely undermines model utility and reliability in critical domains.

large language model, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

Oct-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Los Angeles County > Los Angeles (0.14)
- Europe > Latvia
  - Lubāna Municipality > Lubāna (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Law (0.68)
- Information Technology (0.46)
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found