AITopics | token

Collaborating Authors

token

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dynamical Properties of Tokens in Self-Attention and Effects of Positional Encoding

Neural Information Processing SystemsJun-23-2026, 01:19:35 GMT

This paper investigates the dynamical properties of tokens in pre-trained Transformer models and explores their application to improving Transformers. To this end, we analyze the dynamical system governing the continuous-time limit of the pre-trained model and characterize the asymptotic behavior of its solutions. Specifically, we characterize when tokens move closer to or farther from one another over time, depending on the model parameters. We provide sufficient conditions, based on these parameters, to identify scenarios where tokens either converge to zero or diverge to infinity. Unlike prior works, our conditions are broader in scope and more applicable to real-world models. Furthermore, we investigate how different forms of positional encoding - specifically absolute and rotary - affect these dynamical regimes. Empirical evidence reveals that the convergence scenario adversely impacts model performance. Motivated by these insights, we propose simple refinements to Transformer architectures that mitigate convergence behavior in models with absolute or rotary positional encoding. These findings support theoretical foundations and design principles for improving Transformer models.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia (1.00)
North America > United States > Minnesota (0.28)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.87)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Neural Attention Search

Neural Information Processing SystemsJun-22-2026, 22:54:56 GMT

We present Neural Attention Search (NAtS), an end-to-end learnable sparse transformer that automatically evaluates the importance of each token within a sequence and determines if the corresponding token can be dropped after several steps. To this end, we design a search space that contains three token types: (i) Global Tokens will be preserved and queried by all the following tokens; (ii) Local Tokens survive until the next global token appears; and (iii) Sliding Window Tokens have an impact on the inference of a fixed size of the next following tokens. Similar to the One-Shot Neural Architecture Search approach, this token-type information can be learned jointly with the architecture weights via a learnable attention mask. Experiments on both training a new transformer from scratch and fine-tuning existing large language models show that NAtS can efficiently reduce the KV cache size and the inference costs for the models while maintaining the models' performance.

information, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Country: North America > United States (0.27)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)
Overview (0.67)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Neural Attention Search

Neural Information Processing SystemsJun-14-2026, 05:47:32 GMT

artificial intelligence, machine learning, neural information processing system 38, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.75)

Add feedback

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Neural Information Processing SystemsJun-12-2026, 12:20:48 GMT

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: .

artificial intelligence, large language model, natural language, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.60)

Add feedback

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Neural Information Processing SystemsJun-12-2026, 02:44:03 GMT

Training high-performing Small Language Models (SLMs) remains computationally expensive, even with knowledge distillation and pruning from larger teacher models. Existing approaches often face three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce \textbf{Low-Rank Clone (LRC)}, an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Neural Information Processing SystemsMar-22-2026, 06:47:16 GMT

Mixture of Experts (MoE) models based on Transformer architecture are pushing the boundaries of language and vision tasks. The allure of these models lies in their ability to substantially increase the parameter count without a corresponding increase in FLOPs. Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as . At the same time, existing continuous MoE designs either lag behind their sparse counterparts or are incompatible with autoregressive decoding. Motivated by the observation that the adaptation of fully continuous methods has been an overarching trend in Deep Learning, we develop Mixture of Tokens (MoT), a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Unlike conventional methods, MoT assigns mixtures of tokens from different examples to each expert. This architecture is fully compatible with autoregressive training and generation. Our best models not only achieve a 3x increase in training speed over dense Transformer models in language pretraining but also match the performance of state-of-the-art MoE architectures. Additionally, a close connection between MoT and MoE is demonstrated through a novel technique we call .

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency

Neural Information Processing SystemsFeb-18-2026, 04:40:26 GMT

Compiler backends are tasked with generating executable machine code for processors. With the proliferation of diverse processors, it is imperative for programmers to tailor specific compiler backends to accommodate each one.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Austria (0.04)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Software (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.69)

Add feedback

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

Neural Information Processing SystemsFeb-17-2026, 19:42:19 GMT

Most widely adopted MoE models are discontinuous with respect to their parameters - often referred to as sparse .

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Europe > Poland > Masovia Province > Warsaw (0.05)
Asia > Middle East > Jordan (0.05)
Europe > Monaco (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)

Add feedback

1 Experiment Details

Neural Information Processing SystemsFeb-10-2026, 03:44:45 GMT

Clearly, with only 26M learnable parameters, the performance can be boosted from 79.9 to 84.4

artificial intelligence, machine learning, resolution, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Neural Information Processing SystemsDec-25-2025, 06:31:44 GMT

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.

learning viewpoint-agnostic visual representation, name change, transformer, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback