AITopics | lightning attention

Collaborating Authors

lightning attention

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax, null, :, null, Chen, Aili, Li, Aonian, Gong, Bangwei, Jiang, Binyang, Fei, Bo, Yang, Bo, Shan, Boji, Yu, Changqing, Wang, Chao, Zhu, Cheng, Xiao, Chengjun, Du, Chengyu, Zhang, Chi, Qiao, Chu, Zhang, Chunhao, Du, Chunhui, Guo, Congchao, Chen, Da, Ding, Deming, Sun, Dianjun, Li, Dong, Jiao, Enwei, Zhou, Haigang, Zhang, Haimo, Ding, Han, Sun, Haohai, Feng, Haoyu, Cai, Huaiguang, Zhu, Haichao, Sun, Jian, Zhuang, Jiaqi, Cai, Jiaren, Song, Jiayuan, Zhu, Jin, Li, Jingyang, Tian, Jinhao, Liu, Jinli, Xu, Junhao, Yan, Junjie, Liu, Junteng, He, Junxian, Feng, Kaiyi, Yang, Ke, Xiao, Kecheng, Han, Le, Wang, Leyang, Yu, Lianfei, Feng, Liheng, Li, Lin, Zheng, Lin, Du, Linge, Yang, Lingyu, Zeng, Lunbin, Yu, Minghui, Tao, Mingliang, Chi, Mingyuan, Zhang, Mozhi, Lin, Mujie, Hu, Nan, Di, Nongyu, Gao, Peng, Li, Pengfei, Zhao, Pengyu, Ren, Qibing, Xu, Qidi, Li, Qile, Wang, Qin, Tian, Rong, Leng, Ruitao, Chen, Shaoxiang, Chen, Shaoyu, Shi, Shengmin, Weng, Shitong, Guan, Shuchang, Yu, Shuqi, Li, Sichen, Zhu, Songquan, Li, Tengfei, Cai, Tianchi, Liang, Tianrun, Cheng, Weiyu, Kong, Weize, Li, Wenkai, Chen, Xiancai, Song, Xiangjun, Luo, Xiao, Su, Xiao, Li, Xiaobo, Han, Xiaodong, Hou, Xinzhu, Lu, Xuan, Zou, Xun, Shen, Xuyang, Gong, Yan, Ma, Yan, Wang, Yang, Shi, Yiqi, Zhong, Yiran, Duan, Yonghong, Fu, Yongxiang, Hu, Yongyi, Gao, Yu, Fan, Yuanxiang, Yang, Yufeng, Li, Yuhao, Hu, Yulin, Huang, Yunan, Li, Yunji, Xu, Yunzhi, Mao, Yuxin, Shi, Yuxuan, Wenren, Yuze, Li, Zehan, Li, Zelin, Tian, Zhanxu, Zhu, Zhengmao, Fan, Zhenhua, Wu, Zhenzhen, Xu, Zhichao, Yu, Zhihang, Lyu, Zhiheng, Jiang, Zhuo, Gao, Zibo, Wu, Zijia, Song, Zijian, Sun, Zijun

arXiv.org Artificial IntelligenceJun-17-2025

We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.

arxiv preprint arxiv, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.13585

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Qin, Zhen, Sun, Weigao, Li, Dong, Shen, Xuyang, Sun, Weixuan, Zhong, Yiran

arXiv.org Artificial IntelligenceJun-20-2024

We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.

language model, lightning attention, tnl, (14 more...)

arXiv.org Artificial Intelligence

2405.17381

Country:

North America > United States (0.14)
Europe > Austria > Vienna (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling TransNormer to 175 Billion Parameters

Qin, Zhen, Li, Dong, Sun, Weigao, Sun, Weixuan, Shen, Xuyang, Han, Xiaodong, Wei, Yunshen, Lv, Baohong, Yuan, Fei, Luo, Xiao, Qiao, Yu, Zhong, Yiran

arXiv.org Artificial IntelligenceJul-27-2023

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2307.14995

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(3 more...)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback