AITopics | linear attention transformer

Collaborating Authors

linear attention transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Demystify Mamba in Vision: A Linear Attention Perspective

Neural Information Processing SystemsMar-22-2026, 18:35:04 GMT

Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design.

artificial intelligence, linear attention transformer, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.42)

Add feedback

Demystify MambainVision: ALinearAttention Perspective

Neural Information Processing SystemsFeb-18-2026, 12:15:17 GMT

Despite its efficiency, previous works [4, 39, 15, 16] proved that linear attention suffers from insufficient expressive power, making it impractical for real applications.

artificial intelligence, justification, machine learning, (14 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)

Add feedback

e618724ac897c6cf3fbfb273f8695d67-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 19:49:50 GMT

linear attention, linear attention transformer, mamba, (12 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Demystify Mamba in Vision: A Linear Attention Perspective

Neural Information Processing SystemsMay-27-2025, 20:02:09 GMT

linear attention perspective, linear attention transformer, mamba, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.45)

Add feedback

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Zhu, Lianghui, Huang, Zilong, Liao, Bencheng, Liew, Jun Hao, Yan, Hanshu, Feng, Jiashi, Wang, Xinggang

arXiv.org Artificial IntelligenceMay-28-2024

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with scalability and quadratic complexity efficiency. In this paper, we aim to leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models. We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the DiT design, but offering superior efficiency and effectiveness. In addition to better performance than DiT, DiG-S/2 exhibits $2.5\times$ higher training speed than DiT-S/2 and saves $75.7\%$ GPU memory at a resolution of $1792 \times 1792$. Moreover, we analyze the scalability of DiG across a variety of computational complexity. DiG models, with increased depth/width or augmentation of input tokens, consistently exhibit decreasing FID. We further compare DiG with other subquadratic-time diffusion models. With the same model size, DiG-XL/2 is $4.2\times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8\times$ faster than DiT with CUDA-optimized FlashAttention-2 under the $2048$ resolution. All these results demonstrate its superior efficiency among the latest diffusion models. Code is released at https://github.com/hustvl/DiG.

arxiv preprint arxiv, diffusion model, transformer, (10 more...)

arXiv.org Artificial Intelligence

2405.18428

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Gated Linear Attention Transformers with Hardware-Efficient Training

Yang, Songlin, Wang, Bailin, Shen, Yikang, Panda, Rameswar, Kim, Yoon

arXiv.org Artificial IntelligenceDec-24-2023

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with respect to output length) inference complexity. Recent works such as RetNet (Sun et al., 2023) and TransNormerLLM (Qin et al., 2023a) observe that adding a global decay term to the additive RNN update rule greatly improves performance, sometimes outperforming standard Transformers with softmax attention when trained at scale. In this work we show that adding a data-dependent gating mechanism further improves performance. We derive a parallel form of this gated linear attention layer that enables efficient training. However, a straightforward, numerically stable implementation of this parallel form requires generalized matrix multiplications in log-space for numerical stability, and thus cannot take advantage of tensor cores on modern GPUs which are optimized for standard matrix multiplications. We develop a hardware-efficient version of the parallel form that can still make use of tensor cores through block-parallel computations over sequence chunks. Experiments on moderate-scale language modeling (340M-parameter models trained on 15B tokens, 1.3B-parameter models trained on 100B tokens) show that gated linear attention (GLA) Transformers perform competitively against a strong LLaMA-architecture Transformer baseline (Touvron et al., 2023) as well as Mamba (Gu & Dao, 2023), a recently introduced state-space model with a data-dependent state transition mechanism. For training speed, our Triton-based implementation performs comparably to CUDA-optimized FlashAttention-2 (Dao, 2023) under the regular 2048 training length setting, while outperforming FlashAttention-2 when training on longer sequences beyond 4096.

computation, parallel form, transformer, (12 more...)

arXiv.org Artificial Intelligence

2312.06635

Country:

North America > United States > Arizona > Maricopa County > Phoenix (0.04)
North America > Dominican Republic (0.04)
Africa > Rwanda > Kigali > Kigali (0.04)
(7 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking

Bebensee, Björn, Lee, Haejun

arXiv.org Artificial IntelligenceJun-15-2023

In schema-guided dialogue state tracking models estimate the current state of a conversation using natural language descriptions of the service schema for generalization to unseen services. Prior generative approaches which decode slot values sequentially do not generalize well to variations in schema, while discriminative approaches separately encode history and schema and fail to account for inter-slot and intent-slot dependencies. We introduce SPLAT, a novel architecture which achieves better generalization and efficiency than prior approaches by constraining outputs to a limited prediction space. At the same time, our model allows for rich attention among descriptions and history while keeping computation costs constrained by incorporating linear-time attention. We demonstrate the effectiveness of our model on the Schema-Guided Dialogue (SGD) and MultiWOZ datasets. Our approach significantly improves upon existing models achieving 85.3 JGA on the SGD dataset. Further, we show increased robustness on the SGD-X benchmark: our model outperforms the more than 30$\times$ larger D3ST-XXL model by 5.0 points.

artificial intelligence, natural language, representation, (14 more...)

arXiv.org Artificial Intelligence

2306.0934

Country:

North America > Dominican Republic (0.04)
North America > United States > Washington > King County > Seattle (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(6 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.87)

Add feedback