Han, Xiaodong
MiniMax-01: Scaling Foundation Models with Lightning Attention
MiniMax, null, Li, Aonian, Gong, Bangwei, Yang, Bo, Shan, Boji, Liu, Chang, Zhu, Cheng, Zhang, Chunhao, Guo, Congchao, Chen, Da, Li, Dong, Jiao, Enwei, Li, Gengxin, Zhang, Guojun, Sun, Haohai, Dong, Houze, Zhu, Jiadai, Zhuang, Jiaqi, Song, Jiayuan, Zhu, Jin, Han, Jingtao, Li, Jingyang, Xie, Junbin, Xu, Junhao, Yan, Junjie, Zhang, Kaishun, Xiao, Kecheng, Kang, Kexi, Han, Le, Wang, Leyang, Yu, Lianfei, Feng, Liheng, Zheng, Lin, Chai, Linbo, Xing, Long, Ju, Meizhi, Chi, Mingyuan, Zhang, Mozhi, Huang, Peikai, Niu, Pengcheng, Li, Pengfei, Zhao, Pengyu, Yang, Qi, Xu, Qidi, Wang, Qiexiang, Wang, Qin, Li, Qiuhui, Leng, Ruitao, Shi, Shengmin, Yu, Shuqi, Li, Sichen, Zhu, Songquan, Huang, Tao, Liang, Tianrun, Sun, Weigao, Sun, Weixuan, Cheng, Weiyu, Li, Wenkai, Song, Xiangjun, Su, Xiao, Han, Xiaodong, Zhang, Xinjie, Hou, Xinzhu, Min, Xu, Zou, Xun, Shen, Xuyang, Gong, Yan, Zhu, Yingjie, Zhou, Yipeng, Zhong, Yiran, Hu, Yongyi, Fan, Yuanxiang, Yu, Yue, Yang, Yufeng, Li, Yuhao, Huang, Yunan, Li, Yunji, Huang, Yunpeng, Xu, Yunzhi, Mao, Yuxin, Li, Zehan, Li, Zekang, Tao, Zewei, Ying, Zewen, Cong, Zhaoyang, Qin, Zhen, Fan, Zhenhua, Yu, Zhihang, Jiang, Zhuo, Wu, Zijia
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
Scaling TransNormer to 175 Billion Parameters
Qin, Zhen, Li, Dong, Sun, Weigao, Sun, Weixuan, Shen, Xuyang, Han, Xiaodong, Wei, Yunshen, Lv, Baohong, Yuan, Fei, Luo, Xiao, Qiao, Yu, Zhong, Yiran
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.
Linearized Relative Positional Encoding
Qin, Zhen, Sun, Weixuan, Lu, Kaiyue, Deng, Hui, Li, Dongxu, Han, Xiaodong, Dai, Yuchao, Kong, Lingpeng, Zhong, Yiran
Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles for designing encoding methods suitable for linear transformers remain understudied. In this work, we put together a variety of existing linear relative positional encoding approaches under a canonical form and further propose a family of linear relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve linear space-time complexity. Equipped with different models, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves state-of-the-art performance in language modeling, text classification, and image classification. Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers. The code is available at https://github.com/OpenNLPLab/Lrpe.
Toeplitz Neural Network for Sequence Modeling
Qin, Zhen, Han, Xiaodong, Sun, Weixuan, He, Bowen, Li, Dong, Li, Dongxu, Dai, Yuchao, Kong, Lingpeng, Zhong, Yiran
Sequence modeling has important applications in natural language processing and computer vision. Recently, the transformer-based models have shown strong performance on various sequence modeling tasks, which rely on attention to capture pairwise token relations, and position embedding to inject positional information. While showing good performance, the transformer models are inefficient to scale to long input sequences, mainly due to the quadratic space-time complexity of attention. To overcome this inefficiency, we propose to model sequences with a relative position encoded Toeplitz matrix and use a Toeplitz matrix-vector production trick to reduce the space-time complexity of the sequence modeling to log linear. A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters, enabling the proposed Toeplitz neural network to deal with varying sequence lengths. In addition, despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance. Extensive experiments on autoregressive and bidirectional language modeling, image modeling, and the challenging Long-Range Arena benchmark show that our method achieves better performance than its competitors in most downstream tasks while being significantly faster. The code is available at https://github.com/OpenNLPLab/Tnn. Figure 1: The left figure shows the training speed (x-axis), performances (y-axis), and GPU memory footprints (circle sizes) of the TNN and competing methods on Long-Range Arena benchmark.