Itzhak, Itay
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
Simhi, Adi, Itzhak, Itay, Barez, Fazl, Stanovsky, Gabriel, Belinkov, Yonatan
Large Language Models (LLMs) often generate outputs that lack grounding in real-world facts, a phenomenon known as hallucinations. Prior research has associated hallucinations with model uncertainty, leveraging this relationship for hallucination detection and mitigation. In this paper, we challenge the underlying assumption that all hallucinations are associated with uncertainty. Using knowledge detection and uncertainty measurement methods, Figure 1: Do high-certainty hallucinations exist? An we demonstrate that models can hallucinate illustrative categorization of hallucinations based on a with high certainty even when they have the model's knowledge and certainty. Highlighted is the correct knowledge. We further show that highcertainty phenomenon of high-certainty hallucinations (purple) hallucinations are consistent across - where models confidently produce incorrect outputs, models and datasets, distinctive enough to be even when they have the correct knowledge. While other singled out, and challenge existing mitigation types of hallucinations can potentially be explained by methods. Our findings reveal an overlooked aspect the model not knowing, being mistaken, or uncertain, of hallucinations, emphasizing the need to high-certainty hallucinations are harder to rationalize, understand their origins and improve mitigation making their existence particularly intriguing.
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
Itzhak, Itay, Stanovsky, Gabriel, Rosenfeld, Nir, Belinkov, Yonatan
Recent studies show that instruction tuning and learning from human feedback improve the abilities of large language models (LMs) dramatically. While these tuning methods can make models generate high-quality text, we conjecture that more implicit cognitive biases may arise in these fine-tuned models. Our work provides evidence that these fine-tuned models exhibit biases that were absent or less pronounced in their pretrained predecessors. We examine the extent of this phenomenon in three cognitive biases - the decoy effect, the certainty effect, and the belief bias - all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models, especially those that have undergone instruction tuning, such as Flan-T5, GPT3.5, and GPT4. This research constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens
Itzhak, Itay, Levy, Omer
Standard pretrained language models operate on sequences of subword tokens without direct access to the characters that compose each token's string representation. We probe the embedding layer of pretrained language models and show that models learn the internal character composition of whole word and subword tokens to a surprising extent, without ever seeing the characters coupled with the tokens. Our results show that the embedding layer of RoBERTa holds enough information to accurately spell up to a third of the vocabulary and reach high average character ngram overlap on all token types. We further test whether enriching subword models with additional character information can improve language modeling, and observe that this method has a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modeling objectives incentivize the model to implicitly learn some notion of spelling, and that explicitly teaching the model how to spell does not enhance its performance on such tasks.