REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering
Zhan, Li-Ming, Liu, Bo, Xie, Chengqiang, Cao, Jiannong, Wu, Xiao-Ming
–arXiv.org Artificial Intelligence
Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module selection and steering strength. We evaluate REAL across eight LLMs from the Llama and Qwen families and nine datasets spanning truthfulness enhancement, open-domain QA under knowledge conflicts, and general alignment tasks. REAL enables more effective inference-time interventions, achieving an average relative improvement of 20% (up to 81.5%) over the ITI method on truthfulness steering. In addition, the modules selected by REAL exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.
arXiv.org Artificial Intelligence
Oct-2-2025
- Country:
- South America (1.00)
- Oceania (1.00)
- Asia > Middle East (1.00)
- Africa > Middle East (0.67)
- North America
- Canada (1.00)
- United States > California
- Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom
- England (0.93)
- Genre:
- Research Report > New Finding (1.00)
- Personal > Honors (1.00)
- Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology (1.00)
- Banking & Finance > Trading (1.00)
- Transportation > Air (0.92)
- Energy > Power Industry (0.67)
- Consumer Products & Services > Travel (0.67)
- Leisure & Entertainment
- Law
- Statutes (1.00)
- Criminal Law (1.00)
- Family Law (0.67)
- Health & Medicine
- Pharmaceuticals & Biotechnology (1.00)
- Consumer Health (1.00)
- Public Health (0.92)
- Therapeutic Area
- Psychiatry/Psychology (1.00)
- Oncology (1.00)
- Cardiology/Vascular Diseases (1.00)
- Immunology (0.92)
- Environmental Medicine (0.92)
- Obstetrics/Gynecology (0.92)
- Endocrinology (0.92)
- Infections and Infectious Diseases (0.92)
- Government
- Military (1.00)
- Space Agency (0.67)
- Voting & Elections (0.67)
- Regional Government
- North America Government > United States Government (1.00)
- Europe Government (0.92)
- Education
- Health & Safety > School Nutrition (1.00)
- Educational Setting (1.00)
- Media
- Technology: