INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection

Chen, Chao, Liu, Kai, Chen, Ze, Gu, Yi, Wu, Yue, Tao, Mingyuan, Fu, Zhihang, Ye, Jieping

Feb-6-2024–arXiv.org Artificial Intelligence

Knowledge hallucination have raised widespread concerns for the security and reliability of deployed LLMs. Previous efforts in detecting hallucinations have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, where the semantic information is inevitably lost during the tokendecoding procedure. Thus, we propose to explore the dense semantic information retained within LLMs' INternal States for hallucInation DEtection (INSIDE). In particular, a simple yet effective EigenScore metric is proposed to better evaluate responses' self-consistency, which exploits the eigenvalues of responses' covariance matrix to measure the semantic consistency/diversity in the dense embedding space. Furthermore, from the perspective of self-consistent hallucination detection, a test time feature clipping approach is explored to truncate extreme activations in the internal states, which reduces overconfident generations and potentially benefits the detection of overconfident hallucinations. Extensive experiments and ablation studies are performed on several popular LLMs and questionanswering (QA) benchmarks, showing the effectiveness of our proposal. Large Language Models (LLMs) have recently achieved a milestone breakthrough and demonstrated impressive abilities in various applications (Ouyang et al., 2022; OpenAI, 2023). However, it has been widely observed that even the state-of-the-art LLMs often make factually incorrect or nonsense generations (Cohen et al., 2023; Ren et al., 2022; Kuhn et al., 2022), which is also known as knowledge hallucination (Ji et al., 2023). The potentially unreliable generations make it risky to deploy LLMs in practical scenarios.

eigenscore, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Feb-6-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East (0.68)
- North America > United States (0.70)

Genre:
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment (1.00)
- Media > Television (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.88)
  - Natural Language > Large Language Model (1.00)