AITopics | sparse transformer

Collaborating Authors

sparse transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

9ed27554c893b5bad850a422c3538c15-Paper.pdf

Neural Information Processing SystemsFeb-19-2026, 05:09:32 GMT

However, these models suffer from quadratic computational cost in the input sequence lengthn to compute pairwise attention in each layer.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

f6a8dd1c954c8506aadc764cc32b895e-AuthorFeedback.pdf

Neural Information Processing SystemsFeb-11-2026, 03:58:30 GMT

sequence length, suggestion, transformer, (15 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback

A Supplementary Materials

Neural Information Processing SystemsFeb-10-2026, 07:03:49 GMT

A.1 Dataset Description We describe the additional details of each dataset in the followings. For electricity, we take 500k training windows between 2014-01-01 to 2014-09-01 by reference [14, 24]. And we use the first 90% for the training set and the last 10% as the validation set. Testing set is the next 7 days after the training set. We apply the z-score normalization to the real-valued inputs of each time series.

artificial intelligence, machine learning, supplementary material, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

c6b8c8d762da15fa8dbbdfb6baf9e260-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 07:03:42 GMT

forecasting, time sery, transformer, (14 more...)

Neural Information Processing Systems

Country:

North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
North America > United States > Texas (0.04)
(3 more...)

Industry: Energy (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Neural Information Processing SystemsDec-24-2025, 09:16:33 GMT

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections.

name change, transformer, universal approximability, (8 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Technology: Information Technology > Artificial Intelligence (0.44)

Add feedback

dc960c46c38bd16e953d97cdeefdbc68-AuthorFeedback.pdf

Neural Information Processing SystemsNov-19-2025, 05:24:45 GMT

artificial intelligence, natural language, transformer, (19 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

Add feedback

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Liu, Dong, Yu, Yanxuan

arXiv.org Artificial IntelligenceNov-17-2025

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

artificial intelligence, natural language, ringattention, (19 more...)

arXiv.org Artificial Intelligence

2511.10696

Country: North America > United States (0.29)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior

Han, Fuqun, Osher, Stanley, Li, Wuchen

arXiv.org Machine LearningOct-21-2025

Modern generative models, such as neural ordinary differential equations (neural ODEs) [4], transformers [25], and diffusion models [22], have demonstrated remarkable ability to learn and generate samples from complex, high-dimensional probability distributions. These architectures have achieved broad success in scientific computing, image processing, and data science, offering scalable frameworks for data-driven modeling. However, training and sampling in such spaces remain expensive and highly sensitive to architectural and optimization choices. Despite these advances, the curse of dimensionality continues to present a fundamental challenge in many real-world applications. Fortunately, numerous problems in scientific computing exhibit intrinsic structures, such as sparsity, low-rank representations, or approximate invariances, that can be interpreted as prior information about the underlying data or operators. Leveraging such priors within generative models offers a promising avenue to improve both computational efficiency and generalization. A classical way to incorporate prior information, such as sparsity or piecewise regularity, is through Bayesian modeling, where the posterior combines a prior distribution encoding structural knowledge with a likelihood function derived from observations.

artificial intelligence, machine learning, sparse transformer, (18 more...)

arXiv.org Machine Learning

2510.16356

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > South Carolina (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)

Add feedback

dc960c46c38bd16e953d97cdeefdbc68-AuthorFeedback.pdf

Neural Information Processing SystemsAug-20-2025, 05:39:29 GMT

multi-linear attention, sparse transformer, transformer, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.30)

Add feedback

encouraged that reviewers find our paper clear and well written (R1, R2, R3) and our method to be theoretically sound

Neural Information Processing SystemsAug-17-2025, 08:06:33 GMT

We would like to thank the reviewers for their helpful comments and their thorough evaluation of our work. Reversible layers is a technique introduced by Gomez et al. (2017) and is orthogonal and In contrast, clustered attention places no such restriction. We will also add Set Transformers to the related work section. Is speech favorable to clustering? We would like to mention that our NLP approximation experiment for GLUE and SQuAD tasks in 4.3 shows that NLP/vision tasks in the long context setting, as suggested.

artificial intelligence, machine learning, sequence length, (17 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.32)

Add feedback