A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations
O'Neill, Charles, Chalnev, Slava, Zhao, Chi Chi, Kirkby, Max, Jayasekara, Mudith
–arXiv.org Artificial Intelligence
Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.
arXiv.org Artificial Intelligence
Aug-1-2025
- Country:
- Asia > Singapore (0.04)
- Europe
- Germany (0.04)
- Greece (0.04)
- United Kingdom
- England > Greater London
- London (0.04)
- Wales > Pembrokeshire (0.04)
- England > Greater London
- North America
- Canada (0.04)
- United States (0.04)
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Health & Medicine (0.46)
- Technology: