Do Androids Know They're Only Dreaming of Electric Sheep?

CH-Wang, Sky, Van Durme, Benjamin, Eisner, Jason, Kedzie, Chris

Dec-28-2023–arXiv.org Artificial Intelligence

We design probes trained on the internal representations of a transformer language model that are predictive of its hallucinatory behavior on in-context generation tasks. To facilitate this detection, we create a span-annotated dataset of organic and synthetic hallucinations over several tasks. We find that probes trained on the force-decoded states of synthetic hallucinations are generally ecologically invalid in organic hallucination detection. Furthermore, hidden state information about hallucination appears to be task and distribution-dependent. Intrinsic and extrinsic hallucination saliency varies across layers, hidden state types, and tasks; notably, extrinsic hallucinations tend to be more salient in a transformer's internal representations. Outperforming multiple contemporary baselines, we show that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Dec-28-2023

arXiv.org PDF

Add feedback

Country:
- Asia
  - Japan (0.14)
  - Middle East (0.14)
- Europe
  - Belgium (0.14)
  - Germany (0.14)
  - Spain > Canary Islands (0.14)
- North America
  - Canada (0.14)
  - United States (0.14)
- Oceania > Australia (0.14)

Genre:
- Research Report (1.00)

Industry:
- Consumer Products & Services > Restaurants (0.68)
- Leisure & Entertainment > Sports (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)
  - Natural Language
    - Chatbot (0.94)
    - Large Language Model (1.00)
  - Representation & Reasoning (0.93)