A Single Direction of Truth: An Observer Model's Linear Residual Probe Exposes and Steers Contextual Hallucinations

O'Neill, Charles, Chalnev, Slava, Zhao, Chi Chi, Kirkby, Max, Jayasekara, Mudith

Aug-1-2025–arXiv.org Artificial Intelligence

Contextual hallucinations -- statements unsupported by given context -- remain a significant challenge in AI. We demonstrate a practical interpretability insight: a generator-agnostic observer model detects hallucinations via a single forward pass and a linear probe on its residual stream. This probe isolates a single, transferable linear direction separating hallucinated from faithful text, outperforming baselines by 5-27 points and showing robust mid-layer performance across Gemma-2 models (2B to 27B). Gradient-times-activation localises this signal to sparse, late-layer MLP activity. Critically, manipulating this direction causally steers generator hallucination rates, proving its actionability. Our results offer novel evidence of internal, low-dimensional hallucination tracking linked to specific MLP sub-circuits, exploitable for detection and mitigation. We release the 2000-example ContraTales benchmark for realistic assessment of such solutions.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Aug-1-2025

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom (0.68)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (0.93)
  - Natural Language
    - Large Language Model (0.96)
    - Text Processing (0.68)
  - Machine Learning
    - Statistical Learning (0.94)
    - Neural Networks > Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found