Scaling Laws of RoPE-based Extrapolation

Liu, Xiaoran, Yan, Hang, Zhang, Shuo, An, Chenxin, Qiu, Xipeng, Lin, Dahua

arXiv.org Artificial Intelligence 

The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding Su et al. (2021) is currently a topic of considerable interest. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose Scaling Laws of RoPE-based Extrapolation, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by critical dimension for extrapolation. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B (Touvron et al., 2023b). Large Language Models (LLMs) have become the dominant architecture in a variety of natural language processing tasks(OpenAI, 2023; Touvron et al., 2023a;b), while Transformers (Vaswani et al., 2017) based on Rotary Position Embedding (RoPE) (Su et al., 2021) have become the dominant backbone in wide range of LLM design (Chowdhery et al., 2022; Nijkamp et al., 2022; Touvron et al., 2023a;b). While RoPE can theoretically represent sequences through trigonometric functions, as detailed in Appendix A, its performance drops when the input sequence or context length surpasses the training length(Press et al., 2021; Chen et al., 2023), seen in Figure 1. Concerning the extrapolation issue with RoPE, different works have provided various interpretations and corresponding solving attempts. These works could divided into two schools of thought. One limits the scope of self-attention (Ratner et al., 2022; Han et al., 2023) given the fact that selfattention computations in RoPE fail to keep stable beyond training context and exhibit attention RoPE fine-tuned with either a smaller or larger base on the original training length of 4K or a much longer context of 16K, could outperform other extrapolation strategies and extrapolate to 100K context length. The other aims to capture longer contexts by using smaller rotation angles and longer fine-tuning context (Chen et al., 2023; Peng et al., 2023). Currently, popular methods, such as Dynamic NTK (Local-LLaMA, 2023a) and Code LLaMA (Rozière et al., 2023), mainly come from the second approach. Both approaches adapt RoPE to longer contexts with a larger rotary base.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found