Do Androids Know They're Only Dreaming of Electric Sheep?
CH-Wang, Sky, Van Durme, Benjamin, Eisner, Jason, Kedzie, Chris
–arXiv.org Artificial Intelligence
We design probes trained on the internal representations of a transformer language model that are predictive of its hallucinatory behavior on in-context generation tasks. To facilitate this detection, we create a span-annotated dataset of organic and synthetic hallucinations over several tasks. We find that probes trained on the force-decoded states of synthetic hallucinations are generally ecologically invalid in organic hallucination detection. Furthermore, hidden state information about hallucination appears to be task and distribution-dependent. Intrinsic and extrinsic hallucination saliency varies across layers, hidden state types, and tasks; notably, extrinsic hallucinations tend to be more salient in a transformer's internal representations. Outperforming multiple contemporary baselines, we show that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
arXiv.org Artificial Intelligence
Dec-28-2023
- Country:
- Asia
- Japan (0.14)
- Middle East (0.14)
- Europe
- Belgium (0.14)
- Germany (0.14)
- Spain > Canary Islands (0.14)
- North America
- Canada (0.14)
- United States (0.14)
- Oceania > Australia (0.14)
- Asia
- Genre:
- Research Report (1.00)
- Industry:
- Consumer Products & Services > Restaurants (0.68)
- Leisure & Entertainment > Sports (0.68)
- Technology: