AITopics | pause token

Collaborating Authors

pause token

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Neural Information Processing SystemsJun-19-2026, 03:04:06 GMT

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of AC0 functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to TC0, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Understanding the Failure Modes of Transformers through the Lens of Graph Neural Networks

Lee, Hunjae

arXiv.org Artificial IntelligenceDec-11-2025

Transformers and more specifically decoder-only transformers dominate modern LLM architectures. While they have shown to work exceptionally well, they are not without issues, resulting in surprising failure modes and predictably asymmetric performance degradation. This article is a study of many of these observed failure modes of transformers through the lens of graph neural network (GNN) theory. We first make the case that much of deep learning, including transformers, is about learnable information mixing and propagation. This makes the study of model failure modes a study of bottlenecks in information propagation. This naturally leads to GNN theory, where there is already a rich literature on information propagation bottlenecks and theoretical failure modes of models. We then make the case that many issues faced by GNNs are also experienced by transformers. In addition, we analyze how the causal nature of decoder-only transformers create interesting geometric properties in information propagation, resulting in predictable and potentially devastating failure modes. Finally, we observe that existing solutions in transformer research tend to be ad-hoc and driven by intuition rather than grounded theoretical motivation. As such, we unify many such solutions under a more theoretical perspective, providing insight into why they work, what problem they are actually solving, and how they can be further improved to target specific failure modes of transformers. Overall, this article is an attempt to bridge the gap between observed failure modes in transformers and a general lack of theoretical understanding of them in this space. Much of modern deep learning can be understood as the study of learnable information mixing and propagation, a perspective that unifies seemingly disparate architectures under a common lens.

artificial intelligence, information, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2512.09182

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Galashov, Alexandre, Jones, Matt, Ke, Rosemary, Cao, Yuan, Nagarajan, Vaishnavh, Mozer, Michael C.

arXiv.org Artificial IntelligenceOct-17-2025

We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.

large language model, machine learning, pause token, (21 more...)

arXiv.org Artificial Intelligence

2510.13879

Genre:

Research Report (0.64)
Workflow (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

Kadambi, Shreya, Garrepalli, Risheek, Borse, Shubhankar, Hyatt, Munawar, Porikli, Fatih

arXiv.org Artificial IntelligenceJul-21-2025

Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.

artificial intelligence, instruction, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.13401

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

London, Charles, Kanade, Varun

arXiv.org Artificial IntelligenceMay-28-2025

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.21024

Country: Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration

Begin, James, Agrawal, Namit, Singh, Eshan, Fu, Yicheng, O'Brien, Sean, Sharma, Vasu, Zhu, Kevin

arXiv.org Artificial IntelligenceFeb-1-2025

LLMs have demonstrated remarkable proficiency in understanding tasks but continue to struggle with long-context comprehension, particularly with content located in the middle of extensive inputs. This limitation, known as the Lost-in-the-Middle (LITM) problem, hinders models from fully processing and utilizing information across lengthy contexts. To address this issue, we introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens, which serve to segment the input into smaller, more manageable parts. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark, where models must retrieve information embedded within contexts of up to 128K tokens. Experimental results demonstrate significant performance gains, with the LLaMA 3.2 3B Instruct model and the LLaMA 3.1 8B Instruct model improving by 10.61% and 3.57% respectively on average, suggesting that pause-tuning successfully enhances attention redistribution and improves long-context retention. The code and data are available at https://anonymous.4open.science/r/LITM-PauseTokens-7357.

information, llama 3, pause token, (15 more...)

arXiv.org Artificial Intelligence

2502.20405

Country: Oceania > Australia > Victoria > Melbourne (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning

Arefin, Md Rifat, Subbaraj, Gopeshh, Gontier, Nicolas, LeCun, Yann, Rish, Irina, Shwartz-Ziv, Ravid, Pal, Christopher

arXiv.org Artificial IntelligenceNov-4-2024

Decoder-only Transformers often struggle with complex reasoning tasks, particularly arithmetic reasoning requiring multiple sequential operations. In this work, we identify representation collapse in the model's intermediate layers as a key factor limiting their reasoning capabilities. To address this, we propose Sequential Variance-Covariance Regularization (Seq-VCR), which enhances the entropy of intermediate representations and prevents collapse. Combined with dummy pause tokens as substitutes for chain-of-thought (CoT) tokens, our method significantly improves performance in arithmetic reasoning problems. In the challenging $5 \times 5$ integer multiplication task, our approach achieves $99.5\%$ exact match accuracy, outperforming models of the same size (which yield $0\%$ accuracy) and GPT-4 with five-shot CoT prompting ($44\%$). We also demonstrate superior results on arithmetic expression and longest increasing subsequence (LIS) datasets. Our findings highlight the importance of preventing intermediate layer representation collapse to enhance the reasoning capabilities of Transformers and show that Seq-VCR offers an effective solution without requiring explicit CoT supervision.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.02344

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > New York (0.04)
North America > United States > California (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

Wu, Yihan, Guo, Junliang, Tan, Xu, Zhang, Chen, Li, Bohan, Song, Ruihua, He, Lei, Zhao, Sheng, Menezes, Arul, Bian, Jiang

arXiv.org Artificial IntelligenceDec-4-2023

Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.

length control, speech, video, (15 more...)

arXiv.org Artificial Intelligence

2211.16934

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > South Korea > Incheon > Incheon (0.04)
Asia > China (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Think before you speak: Training Language Models With Pause Tokens

Goyal, Sachin, Ji, Ziwei, Rawat, Ankit Singh, Menon, Aditya Krishna, Kumar, Sanjiv, Nagarajan, Vaishnavh

arXiv.org Artificial IntelligenceOct-3-2023

Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18\%$ EM score on the QA task of SQuAD, $8\%$ on CommonSenseQA and $1\%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm.

computational linguistic, downstream task, inference, (15 more...)

arXiv.org Artificial Intelligence

2310.02226

Country:

North America > United States > Washington > King County > Seattle (0.28)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback