Lost-in-Distance: Impact of Contextual Proximity on LLM Performance in Graph Tasks

Firooz, Hamed, Sanjabi, Maziar, Jiang, Wenlong, Zhai, Xiaoling

arXiv.org Artificial Intelligence 

Despite significant advancements, Large Language Models (LLMs) exhibit blind spots that impair their ability to retrieve and process relevant contextual data effectively. We demonstrate that LLM performance in graph tasks with complexities beyond the "needle-in-a-haystack" scenario--where solving the problem requires cross-referencing and reasoning across multiple subproblems jointly--is influenced by the proximity of relevant information within the context, a phenomenon we term "lost-in-distance". We examine two fundamental graph tasks: identifying common connections between two nodes and assessing similarity among three nodes, and show that the model's performance in these tasks significantly depends on the relative positioning of common edges. We evaluate three publicly available LLMs using various graph encoding techniques that represent graph structures for LLM input. We propose a formulation for the lost-in-distance phenomenon and demonstrate that lost-in-distance and lost-in-the-middle phenomena occur independently. Results indicate that model accuracy can decline by up to 6x as the distance between node connections increases, independent of graph encoding and model size. Large Language Models (LLMs) have attained an unprecedented level of generality by leveraging scale and attention-based architectures (Kaplan et al., 2020; Vaswani, 2017). Additionally, LLMs are increasingly serving as essential and flexible building blocks for various user-facing machine learning and artificial intelligence applications beyond traditional language processing domains, such as recommendation systems (Geng et al., 2022), graph-related tasks (Wang et al., 2024), knowledge bases (AlKhamissi et al., 2022; Petroni et al., 2019), and more. These applications highlight the versatility of LLMs but also expose new challenges in handling domain-specific data encoded as textual input.