Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Wang, Yiming, Zhang, Pei, Yang, Baosong, Wong, Derek F., Wang, Rui

arXiv.org Artificial Intelligence 

LLM self-evaluation relies on the LLM's own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs. Large Language Models (LLMs) have significantly enhanced their ability to generalize across diverse scenarios (Brown et al., 2020; Achiam et al., 2023; GLM et al., 2024). However, their outputs can sometimes be unstable, leading to incorrect responses that may threaten social safety. Therefore, labelfree LLM self-evaluation -- estimating the correctness of LLM responses fully through LLMs' own capabilities -- has emerged as a crucial research area. It can provide real-time response monitoring and feedback in large-scale employments, enhancing the reliability of LLMs (Sun et al., 2024). Popular self-evaluation research in the era of LLMs focuses more on output-based forms (Zhang et al., 2023). Two typical paradigms that do not assess the internal states of LLMs involve directly asking LLMs to express confidence in their responses through well-designed prompts (Lin et al., 2022a; Tian et al., 2023), and generating multiple responses by perturbing prompts (Gao et al., 2024) or decoding sampling (Wang et al., 2023) to calculating the response consistency (Xiong et al., 2024). Besides the two types, other methods basically draw on uncertainty estimation concepts from the era of deep neural networks, leveraging output logits or probability distributions to gauge the confidence of model responses (Malinin & Gales, 2020; Si et al., 2022; Huang et al., 2023; Kuhn et al., 2023). Recently, some research has revealed that the latent space of LLMs contains a substantial amount of untapped hidden state information, they can largely reflect response correctness (Azaria & Mitchell, 2023; Liu et al., 2023; Duan et al., 2024), and are usually more interpretable than LLM output (Li et al., 2024a). However, these output-free research often require correctness labels 0/1 for training probing classifiers to extract features from hidden states (Burns et al., 2022; Sky et al., 2024; Su et al., 2024). This contradicts our goal of being "label-free" and limits the generalization capabilities on unseen data.