lightning attention
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax, null, :, null, Chen, Aili, Li, Aonian, Gong, Bangwei, Jiang, Binyang, Fei, Bo, Yang, Bo, Shan, Boji, Yu, Changqing, Wang, Chao, Zhu, Cheng, Xiao, Chengjun, Du, Chengyu, Zhang, Chi, Qiao, Chu, Zhang, Chunhao, Du, Chunhui, Guo, Congchao, Chen, Da, Ding, Deming, Sun, Dianjun, Li, Dong, Jiao, Enwei, Zhou, Haigang, Zhang, Haimo, Ding, Han, Sun, Haohai, Feng, Haoyu, Cai, Huaiguang, Zhu, Haichao, Sun, Jian, Zhuang, Jiaqi, Cai, Jiaren, Song, Jiayuan, Zhu, Jin, Li, Jingyang, Tian, Jinhao, Liu, Jinli, Xu, Junhao, Yan, Junjie, Liu, Junteng, He, Junxian, Feng, Kaiyi, Yang, Ke, Xiao, Kecheng, Han, Le, Wang, Leyang, Yu, Lianfei, Feng, Liheng, Li, Lin, Zheng, Lin, Du, Linge, Yang, Lingyu, Zeng, Lunbin, Yu, Minghui, Tao, Mingliang, Chi, Mingyuan, Zhang, Mozhi, Lin, Mujie, Hu, Nan, Di, Nongyu, Gao, Peng, Li, Pengfei, Zhao, Pengyu, Ren, Qibing, Xu, Qidi, Li, Qile, Wang, Qin, Tian, Rong, Leng, Ruitao, Chen, Shaoxiang, Chen, Shaoyu, Shi, Shengmin, Weng, Shitong, Guan, Shuchang, Yu, Shuqi, Li, Sichen, Zhu, Songquan, Li, Tengfei, Cai, Tianchi, Liang, Tianrun, Cheng, Weiyu, Kong, Weize, Li, Wenkai, Chen, Xiancai, Song, Xiangjun, Luo, Xiao, Su, Xiao, Li, Xiaobo, Han, Xiaodong, Hou, Xinzhu, Lu, Xuan, Zou, Xun, Shen, Xuyang, Gong, Yan, Ma, Yan, Wang, Yang, Shi, Yiqi, Zhong, Yiran, Duan, Yonghong, Fu, Yongxiang, Hu, Yongyi, Gao, Yu, Fan, Yuanxiang, Yang, Yufeng, Li, Yuhao, Hu, Yulin, Huang, Yunan, Li, Yunji, Xu, Yunzhi, Mao, Yuxin, Shi, Yuxuan, Wenren, Yuze, Li, Zehan, Li, Zelin, Tian, Zhanxu, Zhu, Zhengmao, Fan, Zhenhua, Wu, Zhenzhen, Xu, Zhichao, Yu, Zhihang, Lyu, Zhiheng, Jiang, Zhuo, Gao, Zibo, Wu, Zijia, Song, Zijian, Sun, Zijun
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention
Qin, Zhen, Sun, Weigao, Li, Dong, Shen, Xuyang, Sun, Weixuan, Zhong, Yiran
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.
Scaling TransNormer to 175 Billion Parameters
Qin, Zhen, Li, Dong, Sun, Weigao, Sun, Weixuan, Shen, Xuyang, Han, Xiaodong, Wei, Yunshen, Lv, Baohong, Yuan, Fei, Luo, Xiao, Qiao, Yu, Zhong, Yiran
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.