How Do LLMs Use Their Depth?
Gupta, Akshat, Yeung, Jay, Anumanchipalli, Gopala, Ivanova, Anna
–arXiv.org Artificial Intelligence
Growing evidence suggests that large language models do not use their depth uniformly, yet we still lack a fine-grained understanding of their layer-wise prediction dynamics. In this paper, we trace the intermediate representations of several open-weight models during inference and reveal a structured and nuanced use of depth. Specifically, we propose a "Guess-then-Refine" framework that explains how LLMs internally structure their computations to make predictions. We first show that the top-ranked predictions in early LLM layers are composed primarily of high-frequency tokens, which act as statistical guesses proposed by the model early on due to the lack of appropriate contextual information. As contextual information develops deeper into the model, these initial guesses get refined into contextually appropriate tokens. Even high-frequency token predictions from early layers get refined > 70% of the time, indicating that correct token prediction is not "one-and-done". We then go beyond frequency-based prediction to examine the dynamic usage of layer depth across three case studies. Together, our results provide a detailed view of depth usage in LLMs, shedding light on the layer-by-layer computations that underlie successful predictions and providing insights for future works to improve computational efficiency in transformer-based models. Despite the remarkable performance of large language models (LLMs), their internal computations remain poorly understood. One critical question is: how do LLMs internally structure their computations during inference and use their depth layer-by-layer to arrive at predictions? Are specific token predictions always computed at the last layer or does the model settle on predictable tokens early on and simply propagate these predictions? These questions have implications both for interpreting the internal computations of these models and for building more efficient LLM that can use their compute dynamically.
arXiv.org Artificial Intelligence
Oct-22-2025