Perivolaropoulos, Christos
Round and Round We Go! What makes Rotary Positional Encodings useful?
Barbero, Federico, Vitvitskyi, Alex, Perivolaropoulos, Christos, Pascanu, Razvan, Veličković, Petar
Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust'positional' attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths. It is common to provide positional information to the attention mechanism in Transformers through the use of absolute positional encodings (Vaswani et al., 2017), relative positional encodings (Su et al., 2024), or by introducing a bias directly to the activations (Press et al., 2021). One of the currently most widely adopted encodings, especially in Large Language Models (LLMs), are Rotary Positional Encodings (RoPE) (Su et al., 2024), being used in popular models such as LLama 3 (Dubey et al., 2024) and Gemma (Gemma Team et al., 2024). The method can be implemented efficiently and provides an interesting geometric approach to positional encodings. Despite the significant adoption of RoPE, the specific reasons why this method is useful to Transformer models remains poorly understood. One of the main arguments in favour of RoPE made by Su et al. (2024) is that the method helps to decay attention coefficients as the relative distance grows. Most such claims, however, rely on the queries and keys being constant vectors - which is uncommon in practice. In fact, in this work we find that there are many situations in which this decay does not occur and that this is exploited at times by attention heads in Gemma 7B (Gemma Team et al., 2024).
softmax is not enough (for sharp out-of-distribution)
Veličković, Petar, Perivolaropoulos, Christos, Barbero, Federico, Pascanu, Razvan
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.