AITopics | self-attention model

Collaborating Authors

self-attention model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Funnel-Transformer: FilteringoutSequential RedundancyforEfficientLanguageProcessing

Neural Information Processing SystemsFeb-7-2026, 22:54:55 GMT

With the success of language pretraining, it is highly desirable to develop more efficient architectures ofgood scalability thatcanexploit theabundant unlabeled dataatalowercost.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing SystemsDec-24-2025, 14:11:48 GMT

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.

accelerating and structuring self-attention, name change, sparse adaptive connection, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.98)

Add feedback

Rethinking Graph-Based Document Classification: Learning Data-Driven Structures Beyond Heuristic Approaches

Bugueño, Margarita, de Melo, Gerard

arXiv.org Artificial IntelligenceAug-5-2025

In document classification, graph-based models effectively capture document structure, overcoming sequence length limitations and enhancing contextual understanding. However, most existing graph document representations rely on heuristics, domain-specific rules, or expert knowledge. Unlike previous approaches, we propose a method to learn data-driven graph structures, eliminating the need for manual design and reducing domain dependence. Our approach constructs homogeneous weighted graphs with sentences as nodes, while edges are learned via a self-attention model that identifies dependencies between sentence pairs. A statistical filtering strategy aims to retain only strongly correlated sentences, improving graph quality while reducing the graph size. Experiments on three document classification datasets demonstrate that learned graphs consistently outperform heuristic-based graphs, achieving higher accuracy and $F_1$ score. Furthermore, our study demonstrates the effectiveness of the statistical filtering in improving classification robustness. These results highlight the potential of automatic graph generation over traditional heuristic approaches and open new directions for broader applications in NLP.

machine learning, natural language, text classification, (15 more...)

arXiv.org Artificial Intelligence

2508.00864

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Does Self-Attention Need Separate Weights in Transformers?

Kowsher, Md, Prottasha, Nusrat Jahan, Yu, Chun-Nam

arXiv.org Artificial IntelligenceNov-29-2024

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE dataset, the shared weight self-attention-based BERT model demonstrates accuracy improvements of 0.38%, 5.81%, and 1.06% over the standard, symmetric, and pairwise attention-based BERT models, respectively. The model and source code are available at Anonymous.

arxiv preprint arxiv, mechanism, self-attention mechanism, (12 more...)

arXiv.org Artificial Intelligence

2412.00359

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

Neural Information Processing SystemsOct-11-2024, 07:30:39 GMT

accelerating and structuring self-attention, sac, sparse adaptive connection, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.43)

Add feedback

Dynamic metastability in the self-attention model

Geshkovski, Borjan, Koubbi, Hugo, Polyanskiy, Yury, Rigollet, Philippe

arXiv.org Artificial IntelligenceOct-9-2024

We consider the self-attention model - an interacting particle system on the unit sphere, which serves as a toy model for Transformers, the deep neural network architecture behind the recent successes of large language models. We prove the appearance of dynamic metastability conjectured in [GLPR23] - although particles collapse to a single cluster in infinite time, they remain trapped near a configuration of several clusters for an exponentially long period of time. By leveraging a gradient flow interpretation of the system, we also connect our result to an overarching framework of slow motion of gradient flows proposed by Otto and Reznikoff [OR07] in the context of coarsening and the Allen-Cahn equation. We finally probe the dynamics beyond the exponentially long period of metastability, and illustrate that, under an appropriate time-rescaling, the energy reaches its global maximum in finite time and has a staircase profile, with trajectories manifesting saddle-to-saddle-like behavior, reminiscent of recent works in the analysis of training dynamics via gradient descent for two-layer neural networks.

configuration, particle, theorem 1, (13 more...)

arXiv.org Artificial Intelligence

2410.06833

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
(5 more...)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Optimizing Cycle Life Prediction of Lithium-ion Batteries via a Physics-Informed Model

Nicolae, Constantin-Daniel, Sameer, Sara, Sun, Nathan, Yan, Karena

arXiv.org Artificial IntelligenceApr-26-2024

Accurately measuring the cycle lifetime of commercial lithium-ion batteries is crucial for performance and technology development. We introduce a novel hybrid approach combining a physics-based equation with a self-attention model to predict the cycle lifetimes of commercial lithium iron phosphate graphite cells via early-cycle data. After fitting capacity loss curves to this physics-based equation, we then use a self-attention layer to reconstruct entire battery capacity loss curves. Our model exhibits comparable performances to existing models while predicting more information: the entire capacity loss curve instead of cycle life. This provides more robustness and interpretability: our model does not need to be retrained for a different notion of end-of-life and is backed by physical intuition.

battery, capacity loss curve, cycle life, (12 more...)

arXiv.org Artificial Intelligence

2404.17174

Genre: Research Report (1.00)

Industry:

Energy > Energy Storage (1.00)
Electrical Industrial Apparatus (1.00)
Transportation > Ground > Road (0.47)
Materials > Metals & Mining > Lithium (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

Ildiz, M. Emrullah, Huang, Yixiao, Li, Yingcong, Rawat, Ankit Singh, Oymak, Samet

arXiv.org Artificial IntelligenceFeb-20-2024

Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise mapping between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties.

markov chain, probability, trajectory, (13 more...)

arXiv.org Artificial Intelligence

2402.13512

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Introduction to Transformers: an NLP Perspective

Xiao, Tong, Zhu, Jingbo

arXiv.org Artificial IntelligenceNov-29-2023

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes a description of the standard Transformer architecture, a series of model refinements, and common applications. Given that Transformers and related deep learning techniques might be evolving in ways we have never seen, we cannot dive into all the model details or cover all the technical areas. Instead, we focus on just those concepts that are helpful for gaining a good understanding of Transformers and their variants. We also summarize the key ideas that impact this field, thereby yielding some insights into the strengths and limitations of these models.

deep neural network, information processing system, self-attention model, (14 more...)

arXiv.org Artificial Intelligence

2311.17633

Country:

Asia > China > Liaoning Province > Shenyang (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report (1.00)
Overview (0.93)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Recasting Self-Attention with Holographic Reduced Representations

Alam, Mohammad Mahmudul, Raff, Edward, Biderman, Stella, Oates, Tim, Holt, James

arXiv.org Artificial IntelligenceMay-30-2023

In recent years, self-attention has become the dominant paradigm for sequence modeling in a variety of domains. However, in domains with very long sequence lengths the $\mathcal{O}(T^2)$ memory and $\mathcal{O}(T^2 H)$ compute costs can make using transformers infeasible. Motivated by problems in malware detection, where sequence lengths of $T \geq 100,000$ are a roadblock to deep learning, we re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR). In doing so we perform the same high-level strategy of the standard self-attention: a set of queries matching against a set of keys, and returning a weighted response of the values for each key. Implemented as a ``Hrrformer'' we obtain several benefits including $\mathcal{O}(T H \log H)$ time complexity, $\mathcal{O}(T H)$ space complexity, and convergence in $10\times$ fewer epochs. Nevertheless, the Hrrformer achieves near state-of-the-art accuracy on LRA benchmarks and we are able to learn with just a single layer. Combined, these benefits make our Hrrformer the first viable Transformer for such long malware classification sequences and up to $280\times$ faster to train on the Long Range Arena benchmark. Code is available at \url{https://github.com/NeuromorphicComputationResearchProgram/Hrrformer}

hrrformer, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.19534

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Maryland > Baltimore County (0.14)
North America > United States > Maryland > Baltimore (0.14)
(8 more...)

Genre: Research Report (0.65)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback