sparse transformer
O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections.
encouraged that reviewers find our paper clear and well written (R1, R2, R3) and our method to be theoretically sound
We would like to thank the reviewers for their helpful comments and their thorough evaluation of our work. Reversible layers is a technique introduced by Gomez et al. (2017) and is orthogonal and In contrast, clustered attention places no such restriction. We will also add Set Transformers to the related work section. Is speech favorable to clustering? We would like to mention that our NLP approximation experiment for GLUE and SQuAD tasks in 4.3 shows that NLP/vision tasks in the long context setting, as suggested.
A Supplementary Materials
A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.05)
- Pacific Ocean > North Pacific Ocean > San Francisco Bay (0.04)
- North America > United States > Texas (0.04)
- (3 more...)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada (0.04)
- Europe > Portugal > Lisbon > Lisbon (0.04)
Sparse Transformer Architectures via Regularized Wasserstein Proximal Operator with $L_1$ Prior
Han, Fuqun, Osher, Stanley, Li, Wuchen
Modern generative models, such as neural ordinary differential equations (neural ODEs) [4], transformers [25], and diffusion models [22], have demonstrated remarkable ability to learn and generate samples from complex, high-dimensional probability distributions. These architectures have achieved broad success in scientific computing, image processing, and data science, offering scalable frameworks for data-driven modeling. However, training and sampling in such spaces remain expensive and highly sensitive to architectural and optimization choices. Despite these advances, the curse of dimensionality continues to present a fundamental challenge in many real-world applications. Fortunately, numerous problems in scientific computing exhibit intrinsic structures, such as sparsity, low-rank representations, or approximate invariances, that can be interpreted as prior information about the underlying data or operators. Leveraging such priors within generative models offers a promising avenue to improve both computational efficiency and generalization. A classical way to incorporate prior information, such as sparsity or piecewise regularity, is through Bayesian modeling, where the posterior combines a prior distribution encoding structural knowledge with a likelihood function derived from observations.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- North America > United States > South Carolina > Richland County > Columbia (0.14)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
encouraged that reviewers find our paper clear and well written (R1, R2, R3) and our method to be theoretically sound
We would like to thank the reviewers for their helpful comments and their thorough evaluation of our work. Reversible layers is a technique introduced by Gomez et al. (2017) and is orthogonal and In contrast, clustered attention places no such restriction. We will also add Set Transformers to the related work section. Is speech favorable to clustering? We would like to mention that our NLP approximation experiment for GLUE and SQuAD tasks in 4.3 shows that NLP/vision tasks in the long context setting, as suggested.
A Supplementary Materials
A.1 Dataset Description We describe the additional details of each dataset in the followings. And we use the first 90% for the training set and the last 10% as the validation set. And we leave the last 210 days as the test set. We further evaluate the sensitivities of different hyper-parameters. The distribution of attention densities of different α is shown in Figure 2.