AITopics | standard transformer

Collaborating Authors

standard transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

995f693b73050f90977ed2828202645c-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 01:59:03 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

995f693b73050f90977ed2828202645c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 01:58:59 GMT

logic & formal reasoning, machine learning, programming language, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Oceania > Australia (0.04)
North America > Canada > Ontario > Toronto (0.04)
(5 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Software > Programming Languages (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.68)

Add feedback

f621585df244e9596dc70a39b579efb1-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 23:07:15 GMT

fmmformer, matrix, transformer, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > Orange County > Irvine (0.14)
(6 more...)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

321387ba926b8e58d3591c0aeb52ffc2-Paper-Conference.pdf

Neural Information Processing SystemsFeb-10-2026, 20:45:37 GMT

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Middle East > Jordan (0.04)
(20 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.46)
Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Cognitive Science (0.67)

Add feedback

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

Neural Information Processing SystemsDec-25-2025, 06:51:43 GMT

We propose FMMformers, a class of efficient and flexible transformers inspired by the celebrated fast multipole method (FMM) for accelerating interacting particle simulation. FMM decomposes particle-particle interaction into near-field and far-field components and then performs direct and coarse-grained computation, respectively. Similarly, FMMformers decompose the attention into near-field and far-field attention, modeling the near-field attention by a banded matrix and the far-field attention by a low-rank matrix. Computing the attention matrix for FMMformers requires linear complexity in computational time and memory footprint with respect to the sequence length. In contrast, standard transformers suffer from quadratic complexity. We analyze and validate the advantage of FMMformers over the standard transformer on the Long Range Arena and language modeling benchmarks. FMMformers can even outperform the standard transformer in terms of accuracy by a significant margin. For instance, FMMformers achieve an average classification accuracy of $60.74\%$ over the five Long Range Arena tasks, which is significantly better than the standard transformer's average accuracy of $58.70\%$.

decomposed near-field and far-field attention, efficient and flexible transformer, fmmformer, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

MoEUT: Mixture-of-Experts Universal Transformers

Neural Information Processing SystemsDec-24-2025, 20:51:45 GMT

Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced moot), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on language modeling tasks such as BLiMP and PIQA, while using significantly less compute and memory.

artificial intelligence, natural language, transformer, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.83)

Add feedback

Nexus: Higher-Order Attention Mechanisms in Transformers

Chen, Hanting, Zhu, Chong, Han, Kai, Tian, Yuchuan, Liang, Yuchen, Guo, Tianyu, Chen, Xinghao, Tao, Dacheng, Wang, Yunhe

arXiv.org Artificial IntelligenceDec-5-2025

Transformers have achieved significant success across various domains, relying on self-attention to capture dependencies. However, the standard first-order attention mechanism is often limited by a low-rank bottleneck, struggling to capture intricate, multi-hop relationships within a single layer. In this paper, we propose the Nexus, a novel architecture designed to enhance representational power through a recursive framework. Unlike standard approaches that use static linear projections for Queries and Keys, Nexus dynamically refines these representations via nested self-attention mechanisms. Specifically, the Query and Key vectors are themselves outputs of inner attention loops, allowing tokens to aggregate global context and model high-order correlations \textit{prior} to the final attention computation. We enforce a parameter-efficient weight-sharing strategy across recursive steps, ensuring that this enhanced expressivity incurs $\mathcal{O}(1)$ additional parameters. We provide theoretical analysis demonstrating that our method breaks the linear bottleneck of standard attention. Empirically, Nexus outperforms standard Transformers on multiple benchmarks.

large language model, machine learning, mechanism, (16 more...)

arXiv.org Artificial Intelligence

2512.03377

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Vision (0.68)

Add feedback

321387ba926b8e58d3591c0aeb52ffc2-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 22:44:24 GMT

experiment, standard transformer, transformer, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Middle East > Jordan (0.04)
(22 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.46)
Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Cognitive Science (0.67)

Add feedback

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference

Neural Information Processing SystemsOct-9-2025, 18:38:24 GMT

Parallelism introduces collective communication that is both expensive and represents a phase when hardware resources are underuti-lized.

parallelism, proceedings, transformer, (17 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > Canada > Ontario > Toronto (0.04)
(2 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A Method details 476 A.1 Categorical attention

Neural Information Processing SystemsOct-9-2025, 02:21:08 GMT

As described in Section 3.2, we implement categorical attention by associating each attention head In this example, an attention head ( left) calculates the histogram for each position. This allows us to compress the corresponding function. Illustrative programs are depicted in Figures 8 and 9 . This is illustrated in Figure 9 . In this section we describe additional implementation details for the experiments in Section 4 .W e We train each model for 250 epochs with a batch size of 512, a learning rate of 0.05, and We take one Gumbel sample per step.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback